3:["$","$L1d",null,{"formats":"$undefined","locale":"en","messages":{"meta":{"title":"Everyone's AI","description":"Free AI, deep learning & machine learning courses. Learn basic math, neural networks, backpropagation, KNN, regression, ensemble step by step with quizzes. AI for beginners—start here.","keywords":"deep learning, machine learning, AI course, free AI course, deep learning for beginners, machine learning tutorial, AI learning, neural network, backpropagation, KNN, linear regression, free course, AI education","learnTitle":"Learn","learnPageSeoTitle":"Basic Deep Learning | Learn","learnDescription":"Free basic deep learning course: dot product, matrix multiplication, linear layer, activation, backpropagation. Learn by chapter with problems and mini neural network playground.","learnKeywords":"basic deep learning, deep learning, dot product, matrix multiplication, neural network, backpropagation, linear layer, activation function, softmax, free deep learning course","learnMathTitle":"Basic Math and AI | Learn","learnMathDescription":"Free basic math for AI and deep learning: functions, vectors, matrices, exponents, logarithms, uniform and normal distributions. AI math foundations.","learnMathKeywords":"basic math, functions, vectors, matrices, AI math, normal distribution, deep learning math","learnMlTitle":"Basic Machine Learning | Learn","learnMlDescription":"Free basic machine learning course: KNN, linear and logistic regression, decision trees, ensemble, K-means, cross-validation, recommendation systems. ML for beginners.","learnMlKeywords":"basic machine learning, machine learning, KNN, linear regression, logistic regression, decision tree, ensemble, K-means, cross-validation, recommendation system, free ML course","learnMidMlTitle":"Intermediate Machine Learning | Learn","learnMidMlDescription":"Data preprocessing (scaling, encoding, imputation), PCA, SVM, boosting basics, DBSCAN, GMM, pipelines, and hyperparameter tuning for real-world ML.","learnMidMlKeywords":"intermediate ML, scaling, encoding, imputation, PCA, SVM, boosting, AdaBoost, GBM, DBSCAN, GMM, pipeline, Optuna","learnAdvDlTitle":"Advanced Deep Learning | Learn","learnAdvDlDescription":"Transformer, BERT, GPT, FlashAttention, ViT, self-supervised learning, prompt engineering, LoRA, QLoRA, RLHF, DPO, RAG, LLM agents, GNN, XAI, autoencoder, VAE, GAN, diffusion, VLM, speech, knowledge distillation, deployment: large models and generative AI in chapters.","learnAdvDlKeywords":"advanced deep learning, Transformer, BERT, GPT, FlashAttention, ViT, LoRA, QLoRA, RLHF, DPO, RAG, LLM agents, GNN, Grad-CAM, VAE, GAN, diffusion model, Stable Diffusion, CLIP, Whisper, knowledge distillation, TensorRT, vLLM","learnMidDlTitle":"Intermediate Deep Learning | Learn","learnMidDlDescription":"Weight initialization, Adam, learning rate scheduling, regularization, batch/layer norm, data augmentation, CNN, pooling, ResNet, efficient convolution, transfer learning, object detection, image segmentation, tokenization, word embedding, 1D CNN, RNN, LSTM, GRU, encoder-decoder, attention: stable training and unstructured data in chapters.","learnMidDlKeywords":"intermediate deep learning, weight initialization, Xavier, He, Adam, RMSprop, learning rate scheduling, regularization, dropout, batch norm, layer norm, data augmentation, CNN, pooling, ResNet, MobileNet, transfer learning, YOLO, SSD, U-Net, tokenization, BPE, Word2Vec, GloVe, RNN, LSTM, GRU, attention","learnMidMathTitle":"Intermediate Math | Learn","learnMidMathDescription":"Vectors, matrices, linear transformation, eigenvalues, gradient, Jacobian, Hessian, Taylor series, convex optimization, conditional probability, Bayes, covariance, multivariate normal, MLE, entropy, cross-entropy: intermediate math for multivariable and uncertainty, chapter by chapter.","learnMidMathKeywords":"intermediate math, vector space, dot product, matrix, linear transformation, inverse, determinant, rank, eigenvalue, eigenvector, gradient, Jacobian, Hessian, Taylor series, convex optimization, conditional probability, Bayes theorem, covariance, MLE, entropy, KL divergence","learnAdvMathTitle":"Advanced Math | Learn","learnAdvMathDescription":"SVD, tensor algebra, Lagrange, Markov, Monte Carlo, MCMC, EM, MAP, variational inference, Wasserstein, MDP, Fourier, graph Laplacian, SDE, Langevin, information geometry: advanced math for generative models and optimization, chapter by chapter.","learnAdvMathKeywords":"advanced math, SVD, pseudoinverse, tensor, Lagrange, KKT, Markov, Monte Carlo, MCMC, EM, MAP, variational inference, Wasserstein, MDP, Bellman, Fourier, graph Laplacian, SDE, Langevin, score matching, information geometry","learnAdvMlTitle":"Advanced Machine Learning | Learn","learnAdvMlDescription":"Feature engineering, PCA, t-SNE, SVM, kernels, boosting, XGBoost, imbalanced data, anomaly detection, DBSCAN, GMM, hyperparameter tuning, cross-validation, XAI, SHAP, time series, recommender systems: advanced ML for nonlinear problems, complex data, optimization, and interpretability.","learnAdvMlKeywords":"advanced machine learning, feature engineering, PCA, t-SNE, UMAP, SVM, kernel, boosting, XGBoost, LightGBM, SMOTE, anomaly detection, DBSCAN, GMM, hyperparameter, Optuna, XAI, SHAP, LIME, time series, ARIMA, Prophet, matrix factorization, FM","learnPaperReviewHubTitle":"Paper review | Learn","learnPaperReviewHubDescription":"A hub for AI and deep learning papers organized by topic: theory, optimization, architecture, tabular data, vision, NLP, XAI, data-centric AI, edge/web, and domain applications.","learnPaperReviewHubKeywords":"paper review, AI papers, deep learning, machine learning, Learn","playgroundTitle":"AI Playground | Neural Nets, CNN, KNN & RL","playgroundDescription":"Free browser AI Playground: MLP classifier, KNN neighbor classifier, convolution CNN, Q-learning swing, SGD·Adam loss landscape, and attention playground—hands-on ML and deep learning.","playgroundKeywords":"AI playground, KNN, k-nearest neighbors, neural network playground, CNN, optimizer, SGD, Adam, Q-learning, attention, free AI lab","playgroundNnClassifierTitle":"Neural Network Classifier Playground","playgroundNnClassifierDescription":"Free interactive lab to train a neural classifier in your browser. K-culture datasets (Taegeuk, Dancheong, K-Wave, K-Pop), feature toggles, hidden-layer editor, learning rate and regularization, live decision boundary and loss—experience backprop and mini-batch SGD hands-on.","playgroundNnClassifierKeywords":"neural network classifier, playground, MLP, decision boundary, backpropagation, mini-batch SGD, activation function, ReLU, Tanh, L1 L2 regularization, deep learning demo, free neural net simulator","playgroundRlTitle":"Swing Reinforcement Learning Agent Playground","playgroundRlDescription":"Free in-browser Q-learning lab: a robot learns when to push a pendulum swing. Six environments, tune α·γ·ε, live physics animation, episode reward charts, and Bellman update formulas.","playgroundRlKeywords":"reinforcement learning, Q-learning, swing simulator, pendulum, RL agent, RL playground, epsilon-greedy, reward function, episode return, Bellman equation, MDP, free RL simulator","playgroundRlAbout":"Reinforcement learning (Q-learning)","playgroundRlFeature1":"Pendulum swing physics simulation","playgroundRlFeature2":"Six swing presets (rope, damping, wind)","playgroundRlFeature3":"Live Q-learning training and reward charts","playgroundRlFaq1Q":"What will I learn here?","playgroundRlFaq1A":"A discretized Q-table maps angle and angular velocity to three actions: push left, coast, or push right. Height-based rewards and ε-greedy exploration illustrate the core RL loop.","playgroundRlFaq2Q":"How is reward defined?","playgroundRlFaq2A":"Each step rewards swing height (1−cos θ) with a bonus for high swings. Pushing incurs a small cost; extreme speed or angle ends the episode.","playgroundRlFaq3Q":"What are α, γ, and ε?","playgroundRlFaq3A":"α is the learning rate, γ the discount factor, and ε the exploration rate for random actions. Adjust sliders and watch how returns and the policy change.","playgroundCnnTitle":"Convolution CNN Playground | Feature Maps & Kernels","playgroundCnnDescription":"Free CNN lab: apply 3×3 filters to 16×16 K-culture images and watch feature maps update live. Seven presets (Sobel, blur, sharpen, edge), ReLU, max pooling, step animation, and padding—learn convolution hands-on.","playgroundCnnKeywords":"CNN, convolution, feature map, kernel, filter, ReLU, max pooling, Sobel, edge detection, computer vision, playground, K-culture, deep learning demo, free CNN simulator, image filter","playgroundCnnAbout":"Convolutional neural networks (CNN)","playgroundCnnFeature1":"K-culture input patterns: Taegeuk, Dancheong, Hangul, K-Pop","playgroundCnnFeature2":"Seven 3×3 kernel presets (identity, blur, Sobel, edge, emboss) plus click-to-edit","playgroundCnnFeature3":"Live conv → ReLU → pooling pipeline visualization","playgroundCnnFeature4":"Play/pause/step sliding kernel animation","playgroundCnnFeature5":"Valid and Same padding modes","playgroundCnnFaq1Q":"What does convolution do?","playgroundCnnFaq1A":"A small filter (kernel) slides over the image, computing a weighted sum of neighboring pixels at each position. It extracts spatial patterns like edges and textures—the core operation of CNNs.","playgroundCnnFaq2Q":"Why use Sobel filters?","playgroundCnnFaq2A":"Sobel X/Y emphasize brightness changes along horizontal and vertical directions (edges). Even before training, manual filters reveal outlines in the feature map.","playgroundCnnFaq3Q":"What are ReLU and max pooling?","playgroundCnnFaq3A":"ReLU zeros out negative responses, keeping active features only. 2×2 max pooling keeps the largest value in each 2×2 block, shrinking the map and adding slight shift invariance.","playgroundCnnFaq4Q":"Which kernel presets are available?","playgroundCnnFaq4A":"Identity, Gaussian blur, sharpen, Sobel X/Y (horizontal/vertical edges), edge (Laplacian-style), and emboss—seven presets total. Compare how each highlights different patterns in the feature map.","playgroundCnnFaq5Q":"How does the step animation work?","playgroundCnnFaq5A":"Press play or step once to slide the kernel across the input grid, filling the convolution output cell by cell. ReLU and 2×2 max pooling follow the same pattern so you can watch one CNN layer unfold step by step.","playgroundCnnFaq6Q":"Is it free with no install?","playgroundCnnFaq6A":"Yes. It runs entirely in your browser—no Python or TensorFlow required. A free educational playground to visualize convolution, feature maps, and pooling for computer vision beginners.","playgroundTransformerTitle":"Attention Playground | Self-Attention & QKV Lab","playgroundTransformerDescription":"Pick a query token in short sentences and watch Q·K dot products → softmax → weighted V sum in real time. Free Self-Attention lab with BERT/GPT causal mask, √d_k scaling, attention heatmap, and step animation for Transformer beginners.","playgroundTransformerKeywords":"attention, self-attention, transformer, QKV, Query Key Value, softmax, causal mask, scaled dot-product, Playground, deep learning lab, free attention simulator, NLP","playgroundTransformerAbout":"Scaled Dot-Product Self-Attention","playgroundTransformerFeature1":"4 fun 7-token locale-specific stories (cat, robot, ramen, train)","playgroundTransformerFeature2":"N×N heatmap of Q·K^T scores and softmax weights","playgroundTransformerFeature3":"BERT (full) vs GPT (causal) mask toggle","playgroundTransformerFeature4":"√d_k scale, formula panel, token connection SVG","playgroundTransformerFeature5":"▶ Play / step through scores → softmax → output","playgroundTransformerFaq1Q":"What are Q, K, and V?","playgroundTransformerFaq1A":"Query is what you search for, Key is what each token advertises, and Value is the information you retrieve. Dot products between Q and K produce relevance scores; softmax turns them into weights that mix V vectors into an updated token representation.","playgroundTransformerFaq2Q":"Why softmax?","playgroundTransformerFaq2A":"Softmax converts dot-product scores into weights between 0 and 1 that sum to 1 per query row. It concentrates attention on the most relevant tokens while keeping a probabilistic interpretation.","playgroundTransformerFaq3Q":"What is a causal mask?","playgroundTransformerFaq3A":"For left-to-right generation (GPT-style), future keys (j > i) get score −∞ so their weight becomes 0 after softmax. BERT uses bidirectional attention and can see all tokens at once.","playgroundTransformerFaq4Q":"Why divide by √d_k?","playgroundTransformerFaq4A":"As dimension d_k grows, dot products get larger and softmax can become too sharp. Scaling by √d_k stabilizes score magnitudes for training and interpretation.","playgroundTransformerFaq5Q":"Is it free with no install?","playgroundTransformerFaq5A":"Yes. It runs entirely in your browser—no PyTorch or TensorFlow required. A free educational playground to visualize QKV → softmax → weighted sum for Self-Attention.","playgroundAgentTitle":"Claude Code Multi-Agent Playground | Subagents · tool use","playgroundAgentDescription":"Simulate Claude Code spawning subagents via Task() and tackling ML tasks with Read·Write·Bash·Glob tools. Toggle specialists on/off to see where the session stops.","playgroundAgentKeywords":"Claude Code, LLM agent, multi-agent, subagent, Task tool, Read Write Bash, orchestration, tool use, Playground, free AI lab","playgroundAgentAbout":"Claude Code multi-agent session","playgroundAgentFeature1":"Toggle six subagents: Orchestrator, Data, Model, Code, Verify, Deploy","playgroundAgentFeature2":"Three ML prompts: tabular, vision, full deploy","playgroundAgentFeature3":"8-turn session log with tool-use trace and block visualization","playgroundAgentFeature4":"Orchestration formula panel","playgroundAgentFaq1Q":"What happens when I turn off a subagent?","playgroundAgentFaq1A":"The Claude Code session stops at turns that role owns. Turn off Code and it halts at Write train_*.py. Optional Deploy can stay off and you still finish a prototype if required turns pass.","playgroundAgentFaq2Q":"Why is the Orchestrator always on?","playgroundAgentFaq2A":"The orchestrator writes .cursor/plan.md and spawns specialists via Task(subagent=...). Claude Code sessions need one coordinator to delegate work.","playgroundAgentFaq3Q":"How is this different from the RL agent playground?","playgroundAgentFaq3A":"The RL playground trains a Q-learning policy. This lab simulates Claude Code subagents, Task() delegation, and Read/Write/Bash tool use—not reinforcement learning.","playgroundAgentFaq4Q":"Is it free with no install?","playgroundAgentFaq4A":"Yes. A browser-only educational simulator—no LLM API calls. Explore Claude Code session flow and subagent dependencies interactively.","playgroundOptimizerTitle":"Loss Landscape Lab | SGD · Momentum · Adam","playgroundOptimizerDescription":"Compare SGD, Momentum, and Adam on 2D loss surfaces in your browser. Adjust learning rate and momentum, explore bowl, saddle, and Rosenbrock valley terrains, and watch paths and loss values update in real time.","playgroundOptimizerKeywords":"optimizer, SGD, Momentum, Adam, gradient descent, loss landscape, learning rate, Playground, deep learning lab, free optimization simulator","playgroundOptimizerAbout":"Gradient descent and adaptive optimizers","playgroundOptimizerFeature1":"Three terrains: convex bowl, saddle, Rosenbrock valley","playgroundOptimizerFeature2":"Side-by-side SGD · Momentum · Adam paths","playgroundOptimizerFeature3":"Learning rate · β · click-to-set start · formula panel","playgroundOptimizerFaq1Q":"Why run three optimizers at once?","playgroundOptimizerFaq1A":"From the same start and learning rate, each algorithm takes a different path and converges at a different speed. SGD is simple but oscillates in valleys; Momentum keeps direction with inertia; Adam adapts step size per parameter.","playgroundOptimizerFaq2Q":"What is the Rosenbrock valley?","playgroundOptimizerFaq2A":"f(x,y)=(1−x)²+10(y−x²)² forms a narrow curved valley—like many neural network loss surfaces. It shows how optimizer and learning rate choices can diverge or crawl slowly toward the minimum at (1,1).","playgroundOptimizerFaq3Q":"How do I change the start point?","playgroundOptimizerFaq3A":"Click the loss canvas to move the white start marker; all three paths reset from there. Switching terrain presets resets to that terrain's default start.","playgroundOptimizerFaq4Q":"Why does the saddle terrain have two minima?","playgroundOptimizerFaq4A":"f(x,y)=(x²−1)²+y² is a double well: a saddle at (0,0) and minima at (-1,0) and (1,0). Compare which valley each optimizer enters from your start point and how they cross the saddle.","playgroundKnnTitle":"KNN Neighbor Classifier Playground | K-Nearest & Majority Vote","playgroundKnnDescription":"Free ML lab: classify 2D K-culture data with K-Nearest Neighbors. Tune K and Euclidean/Manhattan distance, click to query, see neighbor links, decision regions, and test accuracy in real time.","playgroundKnnKeywords":"KNN, k-nearest neighbors, machine learning, playground, majority vote, Euclidean distance, Manhattan distance, decision boundary, classification, ml03, free ML simulator","playgroundKnnAbout":"K-nearest neighbor classification","playgroundKnnFeature1":"Taegeuk, Dancheong, K-Wave, K-Pop K-datasets","playgroundKnnFeature2":"K, distance metric, query click, neighbor highlight","playgroundKnnFeature3":"Decision-region heatmap, test accuracy, vote table","playgroundKnnFaq1Q":"Does KNN need training?","playgroundKnnFaq1A":"No weight learning—predictions come from the majority label among the K nearest training points. Change the query and neighbors update instantly.","playgroundKnnFaq2Q":"What happens when K is large?","playgroundKnnFaq2A":"Small K yields jagged boundaries sensitive to noise. Large K smooths boundaries but can misclassify mixed regions.","playgroundKnnFaq3Q":"How is this different from the neural network classifier?","playgroundKnnFaq3A":"Same 2D data, but KNN uses only distance and majority vote while the NN classifier learns weights via backprop. Compare both playgrounds to feel classical ML vs deep learning.","communityTitle":"IT News","communityDescription":"Stay up to date with the latest AI and IT news and development trends. New posts are added regularly; find them via search.","communityKeywords":"IT news, AI news, artificial intelligence news, machine learning, deep learning, LLM, AI development trends, tech news, AI updates","studiesTitle":"Studies","studiesDescription":"Find deep learning study groups and learning resources.","curriculumTitle":"Book reading","curriculumDescription":"Create and share book-based learning roadmaps.","supportTitle":"Support & Contact","supportDescription":"How to use Everyone's AI, Chrome extension, and support for Learn and community.","privacyTitle":"Privacy Policy","privacyDescription":"How Everyone's AI collects and uses personal data across Learn, Book reading, Community, Playground, and more.","termsTitle":"Terms of Service","termsDescription":"Terms of service for the Everyone's AI education platform.","refundTitle":"Refund Policy","refundDescription":"Payment, refund, and cancellation policy for Learn Premium (USD 4/month).","aboutTitle":"What is Everyone's AI?","aboutDescription":"An AI education platform built by an AI researcher. Learn basic math and deep learning step by step. Based on experience from K-League AI Competition 3rd place, Financial AI Challenge 22nd, and more.","kimpoPromoTitle":"Kimpo AI Education Proposal | Everyone's AI","kimpoPromoDescription":"A promotional page for an Everyone's AI-based program that connects Kimpo city data with math, deep learning, and machine learning education.","kimpoCourseTitle":"Kimpo AI Course Overview | Everyone's AI","kimpoCourseDescription":"A course overview page with the 16-session structure, Kimpo issue datasets, and project flow."},"support":{"title":"Support & Contact","intro":"For how to use Everyone's AI (mdooai.com), error reports, and suggestions, please refer to the following.","serviceTitle":"Service introduction","serviceContent":"Everyone's AI is an education platform that helps you understand deep learning and AI from the ground up. It offers Learn (chapter-by-chapter visuals, some chapters paid subscription), Book reading (book-based roadmaps), Community (learning material sharing), and a Chrome extension (open the learning page in a new tab).","extensionTitle":"Chrome extension","extensionContent":"Clicking the toolbar icon opens the learning page (https://mdooai.com/learn) in a new tab. For installation or usage questions, contact us via this support page or the extension's Chrome Web Store listing.","extensionInstallCta":"Install from Chrome Web Store","contactTitle":"Contact us","contactContent":"For general inquiries, error reports, or suggestions, please use the contact option on mdooai.com or the published contact details. We will respond as soon as possible.","linksTitle":"Related links","learnLink":"Learn","privacyLink":"Privacy Policy","termsLink":"Terms of Service","refundLink":"Refund Policy","supportUrlLabel":"Support URL"},"about":{"title":"What is Everyone's AI?","intro1":"Hello. This is Everyone's AI. We focus on machine learning and deep learning.","intro2":"I have participated in various AI competitions and developed models used in industry. Through that, I learned one important lesson: while technique matters, what really determines the difference in performance is understanding the fundamentals. These days you can implement models quickly with vibe coding, but when performance doesn't meet expectations, analyzing the cause and improving is still not easy. Without an understanding of the mathematical foundations and AI principles, it's difficult to structurally identify where bottlenecks occur.","intro3":"This site is designed to reduce that trial and error by helping you learn concepts and calculations together.","intro4Before":"So I developed and released this learning platform based on what I've studied and organized. If you'd like lectures or training, feel free to reach out at ","intro4After":" and I'll be glad to help.","profileTitle":"Introduction","educationTitle":"1. Education","education1":"B.S. in Computer Engineering","education2":"M.S. program in Artificial Intelligence, Yonsei University","education3":"Upstage Certified Trainer","careerTitle":"2. External Experience","career1":"Prompt Genie — Chrome extension with 400K downloads (acquired)","career2":"Travel Writer — Primer Hackathon finalist (top 12 of 130+ teams, 636 participants)","career3Before":"Everyone's AI — global learning platform, 3,300 active users (","career3After":")","papersTitle":"3. Publications","papers1":"Interpretable Automated Machine Learning via Large Language Model Reasoning: Combining Transparency with Performance (2026 AI Society Autumn Conference)","papers2":"FS-DCM: Frequency-Separated Dual-Context Modeling with Dynamic Local Volatility Weighting for Time-Series AutoML (2026 AutoML.CC Oral + Poster)","awardsTitle":"4. Awards","awards1":"3rd place, 2026 K League Final Pass Prediction AI (3/947)","awards2":"2nd place, Private Leaderboard, 5th ETRI Human Understanding Paper Competition (2026)","approachTitle":"Learning Approach","approachContent":"Rather than listing concept summaries, the content follows the flow of computation step by step so you understand 'why it works this way.' It's centered on visualization and interaction, with immediate AI coach feedback to correct misconceptions.","roadmapTitle":"Future Plans","roadmapContent":"We plan to continuously expand with more AI education content, including machine learning. If you're interested, feel free to contact us at ","roadmapContactAfter":" anytime.","feedbackNote":"It's still an early version, but we're improving it continuously. Your feedback is welcome and will be actively incorporated.","ctaLearn":"Start Learning","ctaDeveloper":"View Developer Profile","chromeExtensionTitle":"Add to Chrome Web Store","chromeExtensionDesc":"Install the Chrome extension to open the learning page in a new tab."},"terms":{"title":"Terms of Service","effectiveDate":"Effective: May 21, 2026 (updates will be announced on this page).","intro":"Everyone's AI (mdooai.com) is an education platform for deep learning, machine learning, and math. These terms govern use of the website and related services.","section1Title":"1. Scope and services","section1Content":"These terms apply to https://mdooai.com and: (1) Learn — basic, intermediate, and advanced courses in deep learning, math, and machine learning; paper reviews; Playground experiments (neural networks, RL, CNN, etc.); badges and profile; (2) Book reading — book-based learning roadmaps; (3) Community and IT news — posts and comments; (4) Chrome extension — opens the Learn page in a new tab; (5) About, support, and legal pages. The site supports Korean, English, Japanese, and Simplified Chinese. Some Learn chapters may be Premium (paid subscription); free vs. paid scope follows in-app labels.","section2Title":"2. Accounts and use","section2Content":"Most features require sign-up or login (via Clerk or similar authentication). Learn provides chapter visuals, practice problems, and an AI learning coach; some chapters are free and some are Premium. Book reading, Community, Playground, and paper reviews are generally free. Premium fees, payment, refunds, and cancellation follow the Refund Policy. If checkout shows \"payment coming soon,\" no charge is made until payment is enabled; these policies then apply.","section3Title":"3. Content, user posts, and prohibited conduct","section3Content":"Learn and Book reading content is owned by Everyone's AI or rightful licensors. You may use it for personal study only; unauthorized copying, distribution, resale, or automated scraping is prohibited. You are responsible for Community posts and comments; we may remove content or restrict access per our policies. Account theft, service abuse, illegal activity, and infringement of others' rights are prohibited and may lead to suspension.","section4Title":"4. Changes to terms","section4Content":"We may update these terms on this page. Material changes will include an effective date. Continued use after changes means you accept the updated terms.","section5Title":"5. Contact","section5Content":"For terms or service questions, use https://mdooai.com/support or the in-site support page.","termsUrlLabel":"Terms of Service URL"},"refund":{"title":"Refund Policy","effectiveDate":"Effective: May 21, 2026 (updates will be announced on this page).","intro":"This policy covers Learn Premium subscription (USD 4/month, auto-renewal): payment, refunds, and cancellation. Book reading, Community, Playground, and paper reviews are free.","section1Title":"1. What Premium covers, fees, and payment","section1Content":"Premium applies to selected paid Learn chapters (e.g., Basic DL Chapters 04–12 per in-app Premium labels) and features such as the AI learning coach. The fee is USD 4 per month, auto-renewed monthly from your first payment date. Payment is handled by Paddle, PayPal, or another provider; we do not store full card details on our servers. The amount, currency, and cycle shown at checkout prevail. If the site shows \"payment coming soon,\" you may not be charged until payment is enabled; this policy applies once billing starts.","section2Title":"2. Refunds","section2Content":"You may request a full refund within 7 days of your first payment if you are not satisfied. After 7 days or from the second payment onward, no refund is provided for the current billing period, regardless of partial use. Contact https://mdooai.com/support or your payment provider’s customer support (Paddle, PayPal, etc.).","section3Title":"3. Cancellation","section3Content":"You may cancel anytime. Premium access continues until the end of the current billing period; you will not be charged on the next billing date. Fees already charged for that period are non-refundable except where Section 2 (first-payment 7-day refund) applies.","section4Title":"4. Applicability and contact","section4Content":"Procedures follow the policy at payment time and your payment provider’s terms. Questions: https://mdooai.com/support","refundUrlLabel":"Refund Policy URL"},"privacy":{"title":"Privacy Policy","effectiveDate":"Effective: May 21, 2026 (updates will be announced on this page).","section1Title":"1. Scope","section1Content":"This Privacy Policy applies to Everyone's AI (mdooai.com) and related services: Learn, Book reading, Community, Playground, paper reviews, the Chrome extension, and support pages. Some Learn chapters or features may be Premium (paid).","section2Title":"2. Information we collect","section2Intro":"We may collect and use the following when you use our services. Authentication, hosting, and payments are handled by specialized providers subject to their own policies.","section2List1":"Account and authentication: email, display name, profile identifiers, etc., via Clerk or similar (we may not store passwords directly).","section2List2":"Usage and learning data: chapter progress, badges, Community posts/comments/uploads, curriculum (Book reading) data, AI coach questions and responses, etc.","section2List3":"Device and access: browser, IP, timestamps, cookies/sessions (service delivery, analytics, security, locale).","section2List4":"Payment and subscription: processed by Paddle, PayPal, or others; we do not store full card numbers. Subscription status, payment IDs, and receipts are used for Premium access, refunds, cancellation, and support.","section2List5":"Processors: some Community and curriculum data may be stored in Supabase or similar cloud databases; AI coach requests may send question context to AI APIs (e.g., Upstage).","section2Extension":"The Chrome extension does not collect or transmit user data. It only opens https://mdooai.com/learn in a new tab when you click the icon.","section3Title":"3. How we use the information","section3Content":"We use data for authentication, Learn/Community/Book reading services, Premium subscription management, AI coach and search features, improvements and incident response, security and abuse prevention, legal compliance, and support.","section4Title":"4. Retention and deletion","section4Content":"We delete or anonymize personal data when purposes are met or legal retention periods end. Account deletion or erasure requests are handled per our procedures and processor policies.","section5Title":"5. Third parties and processors","section5Content":"We do not sell personal information. For operations we use processors such as Clerk (auth), Supabase (storage), Paddle/PayPal (payments), and AI API providers (learning coach). We may disclose information when required by law.","section6Title":"6. Policy changes","section6Content":"We update this page when the Privacy Policy changes. Material updates will state an effective date.","section7Title":"7. Contact","section7Content":"Privacy inquiries: https://mdooai.com/support or the in-site support page.","privacyUrlLabel":"Privacy Policy URL"},"common":{"appName":"Everyone's AI","headerBrand":"Everyone's AI","loading":"Loading...","close":"Close","back":"Back","backToHome":"← Home","chapterSelect":"Select chapter","chapterSearchNoResults":"No results found.","chapterListEmpty":"No chapters.","chapters":"Learn","curriculum":"Book reading","community":"Community","itNews":"IT News","language":"Language","openMenu":"Open menu","closeMenu":"Close menu","menu":"Menu","communityComingSoon":"Community section is coming soon.","searchPlaceholder":"Search chapters, concepts…","globalSearchPlaceholder":"Search all chapters…","globalSearchNoResults":"No results found.","answer":"Answer","wrongAnswerGuideButton":"Why was it wrong?","mcTfFalse":"False","mcTfTrue":"True","mcCircled1":"①","mcCircled2":"②","mcCircled3":"③","mcCircled4":"④","signIn":"Sign in","signUp":"Sign up","myAccount":"My Account","signOut":"Sign out","aboutLink":"What is Everyone's AI?","myAchievements":"My achievements","moreServices":"More","allServices":"All services","saving":"Saving…"},"community":{"title":"IT News","subtitle":"Stay up to date with the latest AI and IT news and development trends.","allPosts":"All posts","viewFullCommunity":"View full community","sortNewest":"Newest","sortOldest":"Oldest","newPost":"New post","createPost":"Create post","uploadMaterial":"Upload material","uploadTitle":"Title","category":"Category","categoryAll":"All","categoryPlaceholder":"Select category","category_ai_news":"AI News","category_ai_basics":"AI Basics","category_machine_learning":"Machine Learning","category_deep_learning":"Deep Learning","category_nlp":"Natural Language Processing","category_computer_vision":"Computer Vision","category_llm":"Large Language Models","category_prompt_engineering":"Prompt Engineering","category_ai_ethics":"AI Ethics","category_ai_tools":"AI Tools","category_study_material":"Study Materials","priceTypeFree":"Free","priceTypePaid":"Paid","price":"Price","pricePlaceholder":"e.g. 10,000 KRW","uploadTitlePlaceholder":"e.g. Dot product practice sheet","uploadDescription":"Description","uploadDescriptionPlaceholder":"Describe the material and how to use it...","uploadFile":"Attach file (optional)","uploadSubmit":"Publish","uploading":"Publishing...","download":"Download","postedAt":"posted","noPosts":"No posts yet. Be the first to share!","searchPlaceholder":"Search title or description","prevPage":"Previous","nextPage":"Next","pageOf":"Page {current} of {total}","scrollToTop":"Scroll to top","signInToPost":"Sign in to upload materials.","errorLoad":"Failed to load posts.","errorPublish":"Failed to publish. Try again.","errorPriceRequired":"Please enter the price for paid posts.","backToFeed":"Back to feed","postedAnUpdate":"posted an update","postLabel":"Post","inThisPost":"In this post","replyPlaceholder":"Reply to {name}'s post","replyComingSoon":"Replies are coming soon.","errorPostNotFound":"Post not found.","deletePost":"Delete post","deleteConfirm":"Delete this post?","errorDelete":"Failed to delete.","editPost":"Edit post","comments":"Comments","commentPlaceholder":"Write a comment","commentSubmit":"Post","commentSubmitting":"Posting…","commentEdit":"Edit","commentDelete":"Delete","commentDeleteConfirm":"Delete this comment?","commentCancel":"Cancel","commentSave":"Save","noComments":"No comments yet.","errorComment":"Failed to post comment.","errorCommentEdit":"Failed to update.","errorCommentDelete":"Failed to delete.","removeFile":"Remove","editForbidden":"You don't have permission to edit.","backToPost":"Back to post","currentFile":"Current","removeFileLabel":"Remove attachment"},"curriculum":{"title":"Book reading","listTitle":"Book reading","listSubtitle":"Create and share book-based learning roadmaps. Browse recommended book reading.","createNew":"New book reading","newTitle":"Create book reading","subtitle":"Search for a textbook and get a learning roadmap so you can follow the track to reach your learning goal.","searchBooks":"Search books","autocompleteLabel":"Autocomplete","searchResults":"Select from search results","searchResultsEmpty":"Search for books to see results here.","requiredBookTitle":"Please enter the book title. (Required)","aiAutoLabel":"AI auto-generate","generateHint":"After entering the book title, click the button and AI will generate a learning roadmap.","generateWithAI":"Generate book reading with AI","fillRequiredToGenerate":"Enter a book title to enable this button.","resultEmptyHint":"Click \"Generate book reading with AI\" above to fill this area. You can edit and save.","requiredToSave":"Please enter both book title and book reading content to save.","searchPlaceholder":"Search by book title, author, or topic…","searchButton":"Search","searching":"Searching…","noBooks":"No results. Try a different search.","selectBook":"Create book reading from this book","editBookInfo":"Book info (editable)","searchOrManualHint":"Search for a book to select it, or enter the details below. You can create a book reading with just a title if the book is not in the catalog.","bookTitle":"Book title","bookTitlePlaceholder":"e.g. Introduction to Deep Learning","bookImageUrl":"Book cover image URL","isbnPubdate":"ISBN / Publication date","bookInfo":"Book information","bookDescription":"Book description","isbn":"ISBN","pubdate":"Publication date","generating":"Generating book reading…","generateError":"Failed to generate book reading. Please try again.","searchError":"Book search failed.","optionalRequest":"Additional request (optional)","optionalRequestPlaceholder":"e.g. For beginners, 2-week course, focus on understanding ML…","resultTitle":"Generated learning roadmap","shortDescription":"Short description (shown in list)","shortDescriptionPlaceholder":"e.g. Step-by-step learning roadmap from basics to advanced","shortDescriptionHint":"Shown as preview on the list. Leave empty to use content summary.","editCurriculum":"Edit the content below if needed, then save.","save":"Save","saving":"Saving…","saveSuccess":"Saved.","saveError":"Failed to save.","signInToSave":"Sign in to save.","author":"Author","publisher":"Publisher","sortNewest":"Newest","sortOldest":"Oldest","sortPopular":"Popular","curriculaSearchPlaceholder":"Search title or summary","prevPage":"Previous","nextPage":"Next","pageOf":"Page {current} of {total}","scrollToTop":"Scroll to top","noCurricula":"No saved book reading yet. Create one!","notFound":"Book reading not found.","like":"Recommend","likes":"Recommends","createdBy":"Created by","anonymous":"Anonymous","edit":"Edit","delete":"Delete","deleteConfirm":"Delete this book reading?","editCurriculumMenu":"Menu","editTitle":"Edit book reading","cancel":"Cancel","backToCurriculum":"Back to book reading","backToDetail":"Back to detail","editForbidden":"Only the author can edit."},"auth":{"loading":"Loading...","signIn":{"title":"Sign in","subtitle":"Enter your email or username and password.","identifierLabel":"Email or username","identifierPlaceholder":"Enter email or username","passwordLabel":"Password","passwordPlaceholder":"Enter password","submit":"Continue","submitting":"Signing in...","noAccount":"Don't have an account?","signUpLink":"Sign up"},"signUp":{"title":"Create your account","subtitle":"Please fill in the details below to get started.","usernameLabel":"Username","usernamePlaceholder":"4–64 characters, letters and numbers","usernameRules":"4–64 characters, Latin letters only. Special characters ^ $ ! . ` # + ~ are not allowed.","emailLabel":"Email address","emailPlaceholder":"Enter your email address","passwordLabel":"Password","passwordPlaceholder":"Enter your password","submit":"Continue","submitting":"Processing...","hasAccount":"Already have an account?","signInLink":"Sign in"},"verifyEmail":{"title":"Verify your email","subtitleSignIn":"Enter the verification code sent to your email.","subtitleSignUp":"Enter the verification code sent to your email address.","codeLabel":"Verification code","codePlaceholder":"Enter verification code","submit":"Verify","submitting":"Verifying...","verifyButton":"Verify","back":"Back","backSignIn":"Sign in another way"},"errors":{"generic":"Something went wrong. Please try again.","username_length":"Username must be between 4 and 64 characters.","username_non_number":"Username must contain at least one non-numeric character (e.g. a letter).","username_latin_only":"Usernames can only use Latin letters (e.g. English). You can set a display name in your preferred language after sign-up.","password_length":"Please check the password length requirements.","form_identifier_exists":"This email or username is already in use.","form_identifier_not_found":"No account found with this identifier.","form_password_incorrect":"Incorrect password.","form_code_incorrect":"Invalid verification code.","form_password_compromised":"A security issue was detected with your password. Please sign in using another method, such as email verification.","user_locked":"Sign-in is temporarily locked. Please try again later.","display_name_min_length":"Display name must be at least 4 characters.","second_factor_not_supported":"This app only supports password sign-in. If multi-factor authentication (MFA) is required, turn it off in the Clerk Dashboard (instance MFA policy or the user’s security settings), then try again."}},"paperReview":{"title":"AI Papers","navTitle":"AI Papers","hubTitle":"AI Papers","hubDescription":"Papers grouped by theme. Choose a category below.","hubFlatListTitle":"Published AI papers","hubFlatListLead":"Open a conference hub or go straight to each paper page.","hubFlatListCount":"{count} papers","hubFlatListPaperLabel":"Paper","scopeHeading":"Scope","keywordsHeading":"Keywords","seoTitleSuffix":"CPAL 2026 paper review | Everyone's AI","categories":{"theoreticalFoundations":{"sidebarTitle":"Theory & math","headline":"Theoretical AI & Mathematical Foundations","scope":"Papers on mathematical proofs for AI algorithms, optimization theory, functional analysis, and linear-algebraic approaches (e.g. work on influence functions).","keywords":"Formal proofs, optimization, algorithmic foundations, statistical learning theory"},"modelOptimization":{"sidebarTitle":"Optimization & efficiency","headline":"Model Optimization & Efficient AI","scope":"Model compression and acceleration: low-rank approximation, LoRA, quantization, pruning, and related methods.","keywords":"Compression, parameter efficiency, inference speed, memory optimization"},"coreArchitecture":{"sidebarTitle":"Architecture & algorithms","headline":"Core Architecture & Algorithms","scope":"New neural backbones and training methodology: Transformer variants, CNNs, GNNs, loss functions, optimizers, and related contributions.","keywords":"Model structure, deep learning architecture, learning algorithms"},"predictiveTabular":{"sidebarTitle":"Tabular & prediction","headline":"Predictive Modeling & Tabular Data","scope":"Tree-based models, tabular classification/regression, churn, sports analytics, Kaggle-style and business forecasting papers.","keywords":"Machine learning, time series, tabular data, predictive modeling"},"automatedMl":{"sidebarTitle":"AutoML & ML pipelines","headline":"Automated ML & End-to-End ML Pipelines","scope":"AutoML, neural architecture search, hyperparameter/model search, meta-learning, and automation that ties preprocessing, training, evaluation, and deployment—including natural-language-driven tooling.","keywords":"AutoML, HPO, NAS, meta-learning, MLOps, pipeline automation"},"visionMultimodal":{"sidebarTitle":"Vision & multimodal","headline":"Computer Vision & Multimodal","scope":"Face analysis, object detection, segmentation, and multimodal models combining vision and language.","keywords":"Vision, image analysis, multimodal deep learning"},"nlpLlm":{"sidebarTitle":"NLP & LLMs","headline":"NLP & Large Language Models","scope":"Language modeling, text classification, translation, multilingual NLP, prompting, RAG, and other text-centric AI work.","keywords":"LLM, NLU/NLG, text mining"},"trustworthyXai":{"sidebarTitle":"Trust & XAI","headline":"Trustworthy AI & XAI","scope":"Interpretability, robustness to outliers, data attribution, AI ethics and safety.","keywords":"Explainability, robustness, model diagnostics, trustworthy AI"},"dataCentricFeatures":{"sidebarTitle":"Data-centric & features","headline":"Data-Centric AI & Feature Engineering","scope":"Improving performance via data quality, feature design, augmentation, and noisy-label handling rather than only model structure.","keywords":"Preprocessing, feature engineering, data augmentation"},"edgeWebServices":{"sidebarTitle":"Edge & web AI","headline":"AI Services & Edge/Web Computing","scope":"Browser inference (e.g. TensorFlow.js), on-device and mobile deployment, extensions, and edge serving.","keywords":"On-device AI, web AI, deployment optimization"},"domainApplications":{"sidebarTitle":"Domain applications","headline":"Domain-Specific Applications","scope":"Applied deep learning in education, coaching, recommendation, healthcare, personalization, and similar domains.","keywords":"Educational AI, recommender systems, healthcare, personalization"}},"papers":{"sidebarYear2025":"2025","sidebarYear2026":"2026","sidebarVenueCpal":"CPAL","sidebarVenueIcml":"ICML","sidebarVenueIclr":"ICLR","cpal2026":{"sidebarLabel":"CPAL2026","hubTitle":"CPAL2026","hubDescription":"CPAL 2026 papers under Theoretical AI & Mathematical Foundations.","metaTitle":"CPAL2026","metaDescription":"CPAL 2026 paper hub under the theory & mathematical foundations category."},"nlpCpal2026":{"hubTitle":"CPAL2026","hubDescription":"CPAL 2026 papers under NLP & Large Language Models.","metaTitle":"CPAL2026","metaDescription":"CPAL 2026 paper hub under the NLP & large language models category."},"influenceKernelVonMises":{"sidebarTitle":"Kernel von Mises Formula of the Influence Function","title":"Kernel von Mises Formula of the Influence Function","placeholder":"Review content coming soon.","metaTitle":"Kernel von Mises Formula of the Influence Function","metaKeywords":"influence function, kernel von Mises, CPAL 2026, paper review, statistical learning, robust statistics","metaDescription":"CPAL 2026 review: Kernel von Mises Formula of the Influence Function—influence functions, kernels, and key takeaways."},"curseDepthLlm":{"sidebarTitle":"The Curse of Depth in Large Language Models","title":"The Curse of Depth in Large Language Models","placeholder":"Review content coming soon.","metaTitle":"The Curse of Depth in Large Language Models","metaKeywords":"LLM, curse of depth, LayerNorm scaling, CPAL 2026, large language models, transformers","metaDescription":"CPAL 2026 review: The Curse of Depth in Large Language Models—depth pathology, LayerNorm scaling, and training fixes."},"polarQuant":{"sidebarTitle":"PolarQuant: Quantizing KV Caches with Polar Transformation","title":"PolarQuant: Quantizing KV Caches with Polar Transformation","description":"A deep dive into PolarQuant, which removes normalization overhead by quantizing polar angles of KV caches after random preconditioning.","placeholder":"Review content coming soon.","viewOriginalPdf":"View original PDF","metaTitle":"PolarQuant paper review | KV cache quantization (arXiv 2502.02617)","metaKeywords":"PolarQuant, arXiv 2502.02617, KV cache, KV cache quantization, LLM inference, long-context LLM, attention, VRAM, polar transformation, random preconditioning, angle quantization, INT4, FP16, ModuDL","metaDescription":"arXiv 2502.02617 PolarQuant explained: random preconditioning + polar angles for 4.2x+ KV-cache compression, LLM inference memory, formulas and benchmarks."},"coreCpal2026":{"hubTitle":"CPAL2026","hubDescription":"CPAL 2026 papers in the Core Architecture & Algorithms category.","metaTitle":"CPAL2026","metaDescription":"CPAL 2026 paper hub for Core Architecture & Algorithms."},"alphaFormerEndToEnd":{"sidebarTitle":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","title":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","description":"Deep dive into AlphaFormer: synthetic pre-training, linear alpha pools, IC metrics, and PPO-stabilized generation of interpretable symbolic factors via Transformers.","placeholder":"Review content coming soon.","viewOriginalPdf":"View original PDF","metaTitle":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","metaKeywords":"AlphaFormer, alpha factors, symbolic regression, transformers, CPAL 2026, quantitative finance, PPO","metaDescription":"CPAL 2026 AlphaFormer review: end-to-end symbolic regression of alpha factors—pools, IC, PPO, formulas, and intuition."},"icml2025":{"sidebarLabel":"ICML 2025"},"iclr2025":{"sidebarLabel":"ICLR 2025","hubTitle":"ICLR 2025","hubDescription":"ICLR 2025 papers in the Automated ML & ML Pipelines category.","metaTitle":"ICLR 2025","metaDescription":"ICLR 2025 paper hub under Automated ML & ML Pipelines."},"autoMlIcml2025":{"hubTitle":"ICML 2025","hubDescription":"ICML 2025 papers in the Automated ML & ML Pipelines category.","metaTitle":"ICML 2025","metaDescription":"ICML 2025 paper hub under Automated ML & ML Pipelines."},"automlAgent":{"sidebarTitle":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","title":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","authors":"Patara Trirat, Wonyong Jeong, Sung Ju Hwang","venue":"ICML 2025","abstractHeading":"Abstract","abstract":"$1e","placeholder":"Review content coming soon.","metaTitle":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","metaKeywords":"AutoML, multi-agent systems, LLM, ICML 2025, full pipeline, retrieval-augmented planning, AutoML-Agent","metaDescription":"ICML 2025 AutoML-Agent review: a multi-agent LLM framework for end-to-end AutoML from data retrieval to deployment—planning, parallel agents, and verification."},"sela":{"sidebarTitle":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","sidebarLabel":"ICLR 2025","title":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","authors":"Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, Chenglin Wu","venue":"ICLR 2025 · arXiv:2410.17238","metaTitle":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","metaKeywords":"SELA, MCTS, AutoML, LLM agent, UCT-DP, tree search, ICLR 2025, arXiv:2410.17238","metaDescription":"SELA paper review: MCTS tree search for LLM-based AutoML, UCT-DP prioritization, and normalized scores (NS)—formulas and intuition."}}},"landing":{"heroTitle":"Where you learn AI the easy way","heroSubtext":"Learn step by step, the right way.","heroTagline":"The place where everyone learns AI.","forEveryone":"The platform for learning AI from the ground up—concepts, computation, and instant feedback.","heroCurriculum":"Create and share book-based learning roadmaps with other learners.","heroCommunity":"Share and download AI learning materials in the community.","ctaAbout":"What is Everyone's AI?","ctaExplore":"Deep learning","ctaMath":"Math","ctaMl":"Machine learning","ctaPaperReview":"AI papers","ctaPlayground":"Playground","ctaBrowse":"Browse book reading","ctaBrowseCommunity":"Browse community","trendingLabel":"Quick access","recentChaptersSectionLabel":"What's new","recentChaptersTitle":"Recently added","recentChaptersSubtitle":"See new chapters and Playground labs—start learning right away.","recentChaptersCardCta":"Open chapter","recentChaptersPlaygroundTrack":"Playground","recentChaptersPlaygroundKind":"Interactive lab","recentChaptersPlaygroundCta":"Open lab","recentChaptersRecentTooltip":"Published within the last 5 days","homeOfTitle":"The Home of AI Learning","homeOfSubtitle":"Discover step by step, practice hands-on, and learn with AI feedback.","featurePlatformTitle":"Learning platform","featurePlatformDesc":"Learn foundations, deep learning, and machine learning chapter by chapter, with no limits.","featureFasterTitle":"Move faster","featureFasterDesc":"Concepts, practice problems, and instant AI feedback to level up.","featureExploreTitle":"Explore all levels","featureExploreDesc":"Foundations, deep learning, and ML—step by step. We improve continuously with your feedback.","featureBadgeTitle":"Achievements & certificate","featureBadgeDesc":"Complete chapters to earn achievements and receive a certificate of completion.","featurePortfolioTitle":"Grow together","featurePortfolioDesc":"Share your learning, get the latest development news, and connect with fellow learners.","signUpCta":"Sign Up","problemTitle":"Why you need to do the math yourself","problemBody":"If you only use APIs, it's hard to explain why a model produced a given result.\n\nDot products, matrix multiplication, gradients—unless you work through these calculations yourself, it's difficult to grasp why performance dropped or where things went wrong.\n\nMost courses show only results and formulas, and don't give you enough opportunity to check the computation step by step.","solutionSectionLabel":"How it works","solutionTitle":"Learn concepts easily and solve problems. When stuck, just ask the AI","solutionIntro":"From dot product to gradient—core deep learning math, structured across 12 chapters.","solutionList":"Every chapter has concept overviews and practice problems. When wrong or stuck, you can ask the AI.","solutionBody":"When you're curious or got it wrong, you can ask the AI coach.","ctaStartLearning":"Start learning deep learning","globalPlatform":"KO · EN · JA · ZH","learnShortDesc":"Basic deep learning: 12 chapters from dot product to gradient—concepts, problems, and instant grading.","heroImageAlt":"AI learning background","dlCardTitle":"Basic Deep Learning","advMathCardTitle":"Advanced Math","learnAdvMathShortDesc":"SVD, tensors, Markov, MCMC, variational inference, Wasserstein, SDE, information geometry. Advanced math for generative models and optimization, chapter by chapter.","ctaAdvMath":"Advanced Math","advMlCardTitle":"Advanced Machine Learning","learnAdvMlShortDesc":"Feature engineering, PCA, SVM, boosting, XGBoost, imbalanced data, anomaly detection, DBSCAN, XAI, SHAP, time series, recommender systems. Chapter-by-chapter advanced ML.","ctaAdvMl":"Advanced ML","mlCardTitle":"Basic Machine Learning","learnMlShortDesc":"From data and features to KNN, linear and logistic regression, and recommendation systems. Learn basic machine learning chapter by chapter.","learnPaperReviewShortDesc":"Curated AI and deep learning papers by theme. Read reviews in categories such as theory, architecture, NLP, and vision.","midDlCardTitle":"Intermediate deep learning","learnMidDlShortDesc":"Weight init, Adam, regularization, CNN, ResNet, transfer learning, object detection, tokenization, RNN, LSTM, attention. Stable training and unstructured data in chapters.","ctaMidDl":"Intermediate deep learning","advDlCardTitle":"Advanced deep learning","learnAdvDlShortDesc":"Transformer, BERT, GPT, LoRA, QLoRA, RLHF, RAG, agents, GAN, diffusion, VLM, knowledge distillation, deployment. Large models and generative AI in chapters.","ctaAdvDl":"Advanced deep learning","learnMathShortDesc":"From functions, vectors, and matrices to uniform and normal distributions. Build the foundations for understanding AI.","mathCardTitle":"Basic math","midMathCardTitle":"Intermediate math","learnMidMathShortDesc":"Vectors, matrices, linear transformation, eigenvalues, gradient, Jacobian, Hessian, convex optimization, Bayes, MLE, entropy. Learn multivariable and uncertainty math chapter by chapter.","ctaMidMath":"Intermediate math","quickAccessTitle":"Math · Deep learning · Machine learning · AI papers","curriculumShortDesc":"Design your own book-based learning roadmap and grow together with other learners.","communityShortDesc":"Share AI and deep learning materials, get the latest development news, and connect with fellow learners.","itNews":"IT News","itNewsShortDesc":"Stay up to date with the latest AI and IT news and development trends.","coupangBannerText":"Discover a wide range of products on Coupang","class101CourseSectionLabel":"Video course","class101CourseTitle":"Basic Deep Learning — Understanding AI Through Math","class101CourseDesc":"Pair our free chapters with Class101 video lessons to follow concepts and calculations step by step.","class101CourseCta":"Watch on Class101","class101CourseProvider":"CLASS101"},"adminPopup":{"title":"Session overview","languageNote":"The session will be conducted in Korean.","meetLinkNote":"We will send you the Google Meet link before the seminar.","freeSeminarNote":"This seminar is free.","seminarDateLabel":"Date & time","seminarDateTime":"Friday, March 27, 2026, 8:00–9:00 PM (KST)","competitionLinkLabel":"Competition","applyCta":"Apply","speakerTitle":"About the speaker","speakerPara1":"A working professional majoring in AI at Yonsei University, with hands-on experience in data-driven ML problem-solving and model improvement through AI competitions.","speakerPara2":"This session shares practical approaches and decision criteria around problem definition, analysis, and model design as required in competitions.","sessionTitle":"About this session","sessionPara1":"This session covers how to interpret and define ML problems using competition data, and how to improve models and strategies based on analysis.","sessionPara2":"Rather than listing algorithms or techniques, it focuses on how data was re-analyzed when performance fell short of expectations, and how those insights were reflected in model structure and inference strategy.","sessionPara3":"The goal is to share the strategies and thinking adopted under the constraints of an AI competition.","mainContentTitle":"Main topics","mainContent1":"Problem definition from competition data","mainContent2":"Criteria for connecting analysis to model design","mainContent3":"Strategy adjustments when improvement stalled","mainContent4":"Generalization and approach in a competition setting","recommendTitle":"Recommended for","recommend1":"Those unsure how to approach AI competition problems","recommend2":"Those who want to understand data analysis and model design flow","recommend3":"Those needing direction when performance improvement stalls","recommend4":"Those who want to systematize ML strategy for competitions","recommend5":"Developers who want to grow through AI competitions","dismissCheckboxLabel":"Don’t show for 3 days"},"home":{"introButton":"About the service","intro":"An AI-powered learning platform that helps beginners not get stuck on concepts and formulas. Practice calculations, get feedback from an AI coach to fix misconceptions, and understand step by step how AI learns and reasons.","problem":"Problem","advDlAskProblemContext":"Advanced Deep Learning — {chapterTitle}. Current problem:\n{problem}","problemPrompt":"Find the dot product __DOT_FORMULA__ of the vectors below.","problemPromptMatrix":"Find the value that goes in the blank (?) in the matrix product __MATRIX_AB__ below.","problemPromptLinearLayer":"Find the value that goes in the blank (?) in the linear layer __LINEAR_FORMULA__ below.","problemPromptActivation":"Given the activation function (Sigmoid, ReLU, Tanh₃), find Y for each X and fill in the blank (?).","problemPromptArtificialNeuron":"Artificial neuron: apply the given activation (ReLU, Sigmoid, or Tanh) to get Y, and fill in the blank (?).","problemPromptBatch":"Fill in the blank (?) in the batch operation (weight times input plus bias, add, subtract, multiply, subtract mean, sum, or mean).","prev":"Previous","next":"Next","prevChapter":"Previous chapter","nextChapter":"Next chapter","inputSectionTitle":"Your solution","askSectionTitle":"Ask a question","practicePadTitle":"Practice pad","tabletInkFabAria":"Open handwriting mode","tabletInkFabLabel":"Write","learnToolsFabAria":"Open learning tools menu","learnToolsFabLabel":"Tools","pageInkModeTitle":"Ink mode — draw directly on the page","pageInkClear":"Clear ink","pageInkModeExit":"Exit ink mode","pageInkCanvasAria":"Page ink canvas","pageInkPaletteAria":"Ink color palette","pageInkPaletteToggleAria":"Open or close color palette","pageInkScrollMode":"Scroll mode","pageInkDrawMode":"Ink mode","pageInkColorSwatchAria":"Color {color}","fabMenuLabel":"Ask menu","practicePadSeeMain":"Solve the problem on the main screen.","drawMode":"Handwrite","keyboardMode":"Type","drawHint":"Draw your solution in the area below. After drawing, click \"Grade with AI\" to get feedback.","keyboardHint":"Enter your solution or answer below. After entering, click \"Grade with AI\" to get feedback.","askDrawHint":"Write your question here by hand. After writing, click \"Ask\" to get an answer.","askKeyboardHint":"Type your question here. Click \"Ask\" to get an answer.","askPlaceholder":"e.g. Why does this formula work like this?","askSubmit":"Ask","asking":"Sending...","askResponseTitle":"Answer","drawQuestionLabel":"(Question with drawing)","askEmptyAlert":"Please draw or type your question, then click Ask.","errorAsk":"An error occurred while sending your question. Please try again.","errorAskRequest":"Ask request failed","askErrorEmptyQuestion":"Please draw or type your question.","solutionErrorNoContent":"Could not generate the solution.","solutionErrorServer":"An error occurred while generating the solution.","ariaAskInput":"Type your question","placeholder":"Enter your steps or final answer. e.g. a·b = 3·5 = 15","ariaKeyboardInput":"Type your solution","clear":"Clear","grade":"Grade with AI","gradeShort":"Grade","grading":"Grading...","correctAnswer":"Correct!","wrongAnswer":"Incorrect. Please try again.","wrongAnswerPanelHint":"After a wrong answer a “why was I wrong?” hint is requested automatically. It nudges you without revealing the exact answer.","tryAgain":"Try again.","checkAnswer":"Check answer","chapterCompleteTitle":"Chapter complete!","chapterCompleteBadge":"{chapterName} achievement earned","chapterCompleteLoginHint":"If you sign in now, this chapter will be marked complete and you won't need to solve it again.","chapterCompleteSignInCta":"Sign in and save completion","chapterCompleteTryAgain":"Try again","chapterCompleteNextChapter":"Next chapter","badgeSaved":"Achievement saved.","certificateTitle":"Certificate of Completion","certificateSubtitlePrefix":"This is to certify that the person named below has completed the following courses of Everyone's AI (https://mdooai.com) ","certificateSubtitleEnd":"Learn.","certificateHolder":"Holder","certificateHolderEditHint":"You can type the name directly.","certificateHolderModalTitle":"Enter the recipient name","certificateHolderModalConfirm":"Confirm","certificateHolderModalPrint":"Print","certificateHolderEdit":"Edit","certificateCompleted":"Completed courses","certificateIssuer":"Issuer","certificateIssuerName":"Everyone's AI","certificateIssuerUrl":"https://mdooai.com","certificateDate":"Date issued","certificatePrint":"Print certificate","certificateNoBadges":"No completed chapters yet. Complete chapters to receive a certificate.","certificateSignInRequired":"Please sign in to issue a certificate.","certificateIssue":"Issue certificate","profileTitle":"My learning","profileBadgesSection":"Earned achievements","profileNoBadges":"No completed chapters yet.","profileCertificateLink":"Issue certificate","profileMyBadges":"My achievements","profileBadgesCta":"View my achievements / Issue certificate","badgesPageTitle":"My Achievements & Certificate","badgesPageDesc":"View your earned achievements and certificate of completion.","badgesAdminMode":"(Admin Preview)","badgesAdminModeDesc":"All achievements are shown and the full certificate is printed.","mathFunctionsProblemPrompt":"Use the function expression and input to find the missing value.","mathFunctionsProblemPromptInput":"Given f(?) = value, find x","mathFunctionsProblemPromptCompare":"Compute f(x₁) and f(x₂), then enter the larger value.","mlKnnProblemPrompt":"Use KNN distance calculations and majority vote rules to find the answer.","mlLinearRegressionProblemPrompt":"Use the linear regression equation to compute predictions and slope/intercept values.","mlLinearRegressionProblemPromptPredict":"For the linear regression model $\\hat y = w x + b$ with $w={w}$, $b={b}$, find the predicted value $\\hat y$ when $x={x}$. Enter an integer.","mlLinearRegressionProblemPromptSlope":"Find the slope $w = \\frac{y_2-y_1}{x_2-x_1}$ of the line through ({x1}, {y1}) and ({x2}, {y2}). Enter an integer.","mlLinearRegressionProblemPromptIntercept":"A line with slope $w={w}$ passes through ({x}, {y}). Find the intercept $b = y - w x$. Enter an integer.","mlLinearRegressionProblemPromptTwoPointPredict":"The line through ({x1}, {y1}) and ({x2}, {y2}) is given. Find the $y$ value on the line when $x={x}$. Enter an integer.","mlLinearRegressionProblemPromptResidual":"The line $\\hat y={w}x+{b}$ predicts values. The actual observation is at ({x}, {y}). Find the residual $y - \\hat y$. Enter an integer.","mlLinearRegressionProblemPromptResidualSum":"Points {points} and line $\\hat y={w}x+{b}$. Find the sum of residuals $\\sum_i (y_i - \\hat y_i)$. Enter an integer.","mlMseProblemPrompt":"Compute squared error, SSE, MSE, and RMSE to find the answer.","mlMseProblemPromptSquaredError":"When actual $y={y}$ and prediction $\\hat y={yHat}$, find the squared error $(y - \\hat y)^2$. Enter an integer.","mlMseProblemPromptSse":"For the following (actual, prediction) pairs, find the sum of squared errors $\\sum_i (y_i - \\hat y_i)^2$. {pairs} Enter an integer.","mlMseProblemPromptMse":"For the following (actual, prediction) pairs, find the mean squared error MSE $= \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$. {pairs} Enter an integer.","mlMseProblemPromptMseFromLine":"Points {points} and line $\\hat y={w}x+{b}$. Find the MSE. Enter an integer.","mlMseProblemPromptMissingSquaredError":"MSE $= {mse}$, $n = {n}$, and $n-1$ squared errors are {squaredErrors}. Find the remaining squared error. Enter an integer.","mlMseProblemPromptRmse":"When MSE $= {mse}$, find RMSE $= \\sqrt{\\text{MSE}}$. Enter an integer.","mlMseProblemSolvingTable":"$1f","mlLogisticProblemPrompt":"Use logistic regression linear scores and decision boundaries to find the prediction.","mlLogisticProblemPromptLinearScore":"In the linear score $z = wx + b$ of logistic regression, when $w={w}$, $x={x}$, $b={b}$, find $z$ as an integer.","mlLogisticProblemPromptMultiScore":"In the linear score $z = w_1 x_1 + w_2 x_2 + b$, when weights are {weights}, features are {features}, and $b={b}$, find $z$ as an integer.","mlLogisticProblemPromptClassifyFromZ":"When the linear score $z = {z}$, according to the decision boundary ($z>0 \\Rightarrow \\hat y=1$, $z \\le 0 \\Rightarrow \\hat y=0$), find the predicted class $\\hat y$.","mlLogisticProblemPromptClassifyFromProb":"When probability $p = {p}$ and threshold $= {threshold}$, if $p \\ge$ threshold then $\\hat y=1$, otherwise $\\hat y=0$. Find the predicted class $\\hat y$.","mlLogisticProblemPromptCountClassOne":"For the following linear scores, we classify as class 1 when $z>0$. Find the number of samples classified as class 1 as an integer. $z$ list: {zList}","mlLogisticProblemPromptCountMisclassified":"When the true labels are {labels} and the linear score $z$ for each sample is {zList}, we predict $\\hat y_i = 1$ if $z_i>0$ else $0$. Find the number of misclassified samples.","mlLogisticProblemSolvingTable":"$20","mlDecisionTreeProblemPrompt":"Apply decision-tree split rules and impurity metrics to find the answer.","mlDecisionTreeProblemPromptCountNodes":"In a decision tree there are {internal} internal nodes and {leaves} leaf nodes. Find the total number of nodes.","mlDecisionTreeProblemPromptCountLeaves":"In a decision tree there are {leaves} leaf nodes. Find the number of leaf nodes.","mlDecisionTreeProblemPromptTreeDepth":"The maximum depth of the tree (root = 0) is {depth}. Find the depth.","mlDecisionTreeProblemPromptFollowPath":"In the decision tree, the path is {path} (0 = no/left, 1 = yes/right). Find the predicted class at the leaf you reach.","mlDecisionTreeProblemPromptLeafMajority":"At one leaf, class 0 has {c0} samples and class 1 has {c1}. Find the predicted class by majority vote.","mlDecisionTreeProblemPromptGini":"When class counts are {counts}, compute Gini impurity $G = 1 - \\sum_i p_i^2$ and find $100 \\times G$ (integer).","mlDecisionTreeProblemPromptEntropy":"When class counts are {counts}, compute entropy $H = -\\sum_i p_i \\log_2 p_i$ and find $100 \\times H$ (integer).","mlDecisionTreeProblemPromptInformationGain":"Parent node class counts {parentCounts}, left child {leftCounts}, right child {rightCounts}. Find $100 \\times \\text{IG}$ (integer).","mlDecisionTreeProblemPromptWeightedGini":"After split: left child class counts {leftCounts}, right child {rightCounts}. Find $100 \\times$ weighted Gini $(n_L/n)G_L + (n_R/n)G_R$ (integer).","mlDecisionTreeProblemSolvingLabel":"Explanation for solving the problems","mlDecisionTreeProblemSolvingTable":"**Decision tree — solving guide**\n\n- **Node count** — Add the number of internal nodes and leaf nodes.\n- **Leaf count** — Read the number given as the leaf count.\n- **Depth** — Read the maximum depth value with root = 0.\n- **Follow path** — Start at the root, move left for 0 and right for 1, then read the prediction at the leaf you reach.\n- **Gini** — Get $p_i$ from class counts, compute $G = 1 - \\sum_i p_i^2$, then compute $100 \\times G$.\n- **Entropy** — Compute $H = -\\sum_i p_i \\log_2 p_i$, then compute $100 \\times H$.\n- **Weighted Gini** — Compute $(n_L/n)G_L + (n_R/n)G_R$, then compute $100 \\times$ that value.\n- **Leaf majority** — Compare the counts of class 0 and class 1, and use the larger side as the prediction.","mlEnsembleProblemPrompt":"Apply ensemble voting/averaging rules to find the final prediction.","mlEnsembleProblemSolvingLabel":"Explanation for solving the problems","mlEnsembleProblemPromptMajorityVote":"In a random forest, class 0 received {votes0} votes and class 1 received {votes1} votes. Find the final predicted class by majority vote.","mlEnsembleProblemPromptCountVotes":"There are {totalTrees} trees; class 0 has {votes0} votes and class 1 has {votes1} votes. Find the number of votes for the winning class.","mlEnsembleProblemPromptRegressionMean":"In a regression ensemble, {B} trees predicted {predictions}. Find the mean $\\hat y = \\frac{1}{B}\\sum_{b=1}^B \\hat y_b$ (integer).","mlEnsembleProblemPromptNumTrees":"In a random forest there are {B} trees. Find the number of trees $B$.","mlEnsembleProblemPromptOobCount":"There are {nTrees} trees, and a sample was in the bootstrap of only {nInBag} of them. Find the number of trees that did not use this sample (OOB count).","mlEnsembleProblemPromptFormulaMean":"In an ensemble, {B} trees have predictions summing to {sum}. Find the mean $\\hat y = \\frac{1}{B}\\sum_{b=1}^B \\hat y_b$ (integer).","mlEnsembleProblemPromptDefinition":"If the following statement is true enter 1, otherwise 0. {statement}","mlEnsembleProblemPromptFeatureImportance":"Feature importances are {importances}. Find the index (starting from 1) of the feature with the highest importance.","mlEnsembleProblemPromptWeightedVote":"There are 2 trees: the first gives class {c1} weight {w1}, the second gives class {c2} weight {w2}. Find the final prediction by the class with the larger weight.","mlEnsembleStatement_0":"In bagging, each base model is trained independently.","mlEnsembleStatement_1":"Random forest is an ensemble that combines bagging and decision trees.","mlEnsembleStatement_2":"In classification ensembles, the final prediction is usually by majority vote.","mlEnsembleStatement_3":"In boosting, later models focus on samples that previous models got wrong.","mlEnsembleStatement_4":"OOB (Out-of-Bag) means predicting a sample using only trees that did not have it in their bootstrap sample.","mlEnsembleStatement_5":"In stacking, a meta-model uses the predictions of base models as input.","mlEnsembleStatement_6":"In regression ensembles, the final prediction is usually the average of tree predictions.","mlEnsembleStatement_7":"In random forest, at each split only a random subset of features is considered.","mlEnsembleStatement_8":"An ensemble combines predictions from multiple models into one prediction.","mlEnsembleStatement_9":"Random forest tends to reduce variance compared to a single decision tree.","mlEnsembleStatement_10":"In boosting, each base model is trained independently.","mlEnsembleStatement_11":"In regression ensembles, the final prediction is by majority vote.","mlEnsembleStatement_12":"OOB evaluation requires a separate validation set.","mlEnsembleStatement_13":"In random forest, each tree is trained on the full training data.","mlEnsembleStatement_14":"In stacking, the meta-model uses only the original input features of the base models.","mlEnsembleProblemSolvingTable":"$21","mlKmeansProblemPrompt":"Compute K-Means distances, center updates, and SSE to find the answer.","mlKmeansProblemPromptDistanceSquared":"Two points ({x1}, {y1}) and ({x2}, {y2}). Find the squared Euclidean distance $(x_2-x_1)^2+(y_2-y_1)^2$ (integer).","mlKmeansProblemPromptAssignCluster":"Point ({px}, {py}) with centers {centers}. Find the cluster number (from 1) of the nearest center.","mlKmeansProblemPromptCenterMeanX":"When the points in a cluster are {points}, find the $x$ coordinate of the new center (mean, integer).","mlKmeansProblemPromptCenterMeanY":"When the points in a cluster are {points}, find the $y$ coordinate of the new center (mean, integer).","mlKmeansProblemPromptSse":"Cluster points {points}, center ({cx}, {cy}). Find SSE $\\sum_i \\|\\mathbf{x}_i - \\boldsymbol{\\mu}\\|^2$ (integer).","mlKmeansProblemPromptNumClusters":"In K-Means, $K = {K}$. Find the value of $K$.","mlKmeansProblemPromptDefinition":"If the following statement is true enter 1, otherwise 0. {statement}","mlKmeansStatement_0":"K-Means is unsupervised learning.","mlKmeansStatement_1":"In K-Means, the user sets the number of clusters K.","mlKmeansStatement_2":"K-Means minimizes the within-cluster sum of squared distances (SSE).","mlKmeansStatement_3":"In the assignment step, each point is assigned to the nearest center.","mlKmeansStatement_4":"In the update step, the new center is the mean of the points in the cluster.","mlKmeansStatement_5":"K-Means forms clusters from data without labels.","mlKmeansStatement_6":"K-Means uses Euclidean distance (or squared distance) for comparison.","mlKmeansStatement_7":"K-Means repeats assignment and update until convergence.","mlKmeansStatement_10":"K-Means is supervised learning.","mlKmeansStatement_11":"K-Means automatically chooses K.","mlKmeansStatement_12":"K-Means maximizes the number of clusters.","mlKmeansStatement_13":"In the assignment step, points are assigned to clusters at random.","mlKmeansStatement_14":"In the update step, the new center is the median of the cluster.","mathExponentialProblemPrompt":"Find the value of the exponential","mathExponentialProblemPromptExponent":"Find the exponent","mathExponentialProblemPromptCompare":"Compute both values, then enter the larger one.","mathExponentialProblemPromptProduct":"Same base product: find the exponent sum","mathExponentialProblemPromptQuotient":"Same base quotient: find the exponent difference","mathExponentialProblemPromptPowerOfPower":"Find the value of the power of a power.","mathLogProblemPrompt":"Find the value of the logarithm","mathLogProblemPromptInput":"Find the argument","mathLogProblemPromptCompare":"Compute both log values, then enter the larger one.","mathLogProblemPromptSum":"Log sum: $\\log_a(b) + \\log_a(c) = \\log_a(b \\cdot c)$.","mathLogProblemPromptDiff":"Log difference: $\\log_a(b) - \\log_a(c) = \\log_a(b/c)$.","mathLimitProblemPrompt":"Find the limit (Types: polynomial, constant, x→∞, ε-δ concept)","mathLimitProblemPromptDirect":"Find the limit","mathLimitProblemPromptConstant":"Find the limit of the constant.","mathLimitProblemPromptLinear":"Find the limit of the linear expression.","mathLimitProblemPromptAtInfinity":"Find the limit as x → ∞.","mathLimitProblemPromptEpsilon":"Choose the number that matches the ε-δ definition.","mathLimitProblemEpsilonQuestion":"In ε-δ, what does δ represent?","mathLimitProblemEpsilonQuestionEps":"In ε-δ, what does ε represent?","mathLimitProblemEpsilonHint":"(1=distance, 2=error)","mathContinuityProblemPrompt":"Continuity: find the limit or continuity","mathContinuityProblemPromptLimitPoly":"Polynomial is continuous, so limit = function value.","mathContinuityProblemPromptLimitLinear":"Find the limit of the linear expression (equals the function value).","mathContinuityProblemPromptYesNo":"Choose 1 if continuous at that point, 0 if discontinuous.","mathContinuityProblemPromptLimitAtHole":"Find the limit of the expression shown below.","mathContinuityProblemAtPoint":" at ","mathContinuityProblemContinuousQ":" continuous?","mathContinuityProblemLimitAtHoleIntro":"A function with a hole at","mathContinuityProblemLimitAtHoleQ":"has limit = ?","mathDerivativeProblemPrompt":"Derivative: find the derivative (tangent slope) at the given point","mathDerivativeProblemPromptPower":"Power rule: $(x^n)' = n x^{n-1}$. Find $f'(x)$ at the given point.","mathDerivativeProblemPromptLinear":"Linear: $(mx+b)' = m$. Find $f'(x)$ at the given point.","mathDerivativeProblemPromptPoly2":"Quadratic derivative. Find $f'(x)$ at the given point.","mathDerivativeProblemPromptConstMul":"Constant multiple: $(c \\cdot x^n)' = c \\cdot n \\cdot x^{n-1}$. Find $f'(x)$ at the given point.","mathDerivativeProblemAtPoint":" at","mathChainRuleProblemPrompt":"Chain rule: find $f'(x)$ at the given point (Types: power, exponential, trig, sqrt, ln, quadratic)","mathPartialGradientProblemPrompt":"Partial derivative & gradient: find the required component","mlKnnProblemSolvingTable":"**Steps**\n\n- **Input** — New feature vector $\\mathbf{x}$\n- **Stored** — Labeled examples $(\\mathbf{x}_i, y_i)$\n- **1** — Compute distance $d(\\mathbf{x}, \\mathbf{x}_i)$ to each $\\mathbf{x}_i$\n- **2** — Select K smallest distances\n- **3 (classification)** — Predict by **majority vote** of the K labels\n- **3 (regression)** — Predict **average** of the K values\n\n---\n\n**Example (distance squared)**\n\nTwo points A(0, 0) and B(3, 4) lie in the plane. Find the distance squared $(x_2-x_1)^2 + (y_2-y_1)^2$.\n\n**Solution**\n\n$(3-0)^2 + (4-0)^2 = 9 + 16 = 25$, so the answer is **25**.","mlLinearRegressionProblemSolvingTable":"$22","mathIntegralProblemPrompt":"Integral: find the definite integral or antiderivative value","mathIntegralProblemPromptDefiniteConst":"Find the definite integral of the constant function.","mathIntegralProblemPromptDefiniteLinear":"Find the definite integral of the linear function.","mathIntegralProblemPromptAntiderivative":"Find the value of the antiderivative at the given point.","mathRandomVariableProblemPrompt":"Follow the instruction below","mathRandomVariableProblemPromptProbSumSix":"Find the blank c so the three probabilities sum to 1.","mathRandomVariableProblemPromptExpectedValueScale6":"Find 6×E[X] = Σ(value × numerator).","mathRandomVariableProblemPromptVarianceShort":"Find 36 times the variance for the distribution below.","mathRandomVariableProblemVarianceHowToCalc":"Variance = how spread out values are from the average. Variance = E[X²]−(E[X])², 36×variance = 6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²","mathRandomVariableProblemVarianceLabel":"36×variance","mathRandomVariableProblemPromptVarianceScale36":"For the same distribution, Var(X)=E[X²]-E[X]². Find 36×Var(X). (6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²)","mathRandomVariableProblemPromptVarianceIntro":"For the same distribution, ","mathRandomVariableProblemPromptVarianceMid":". Find ","mathRandomVariableProblemPromptVarianceEnd":". (6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²)","mathRandomVariableProblemPromptVarianceAsk":". ","mathRandomVariableProblemPromptVarianceFormula":"(6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²)","mathRandomVariableProblemProbSumHint":"c","mathRandomVariableProblemExpectationHint":"Sum of (value × numerator)","mathRandomVariableProblemVarianceHint":"36×Var(X)","mathRandomVariableProblemPromptMode":"Which value of X has the highest probability? (mode)","mathRandomVariableProblemPromptExpectedValueInt":"Find the expected value E[X] (average value).","mathRandomVariableProblemPromptCumulativeNumerator":"When the probability that X is at most the given value is written as numerator/6, choose the numerator.","mathRandomVariableProblemModeLabel":"Most likely X","mathRandomVariableProblemExpectedValueIntLabel":"E[X]","mathRandomVariableProblemCumulativeLabel1":"P(X≤1) = ?/6 → ?","mathRandomVariableProblemCumulativeLabel2":"P(X≤2) = ?/6 → ?","mathMeanVarianceProblemPrompt":"Follow the instructions below","mathMeanVarianceProblemPromptProbSumSix":"Find the blank c so that the three probabilities sum to 1.","mathMeanVarianceProblemPromptMeanScale6":"Find 6×E[X] = Σ(value×numerator).","mathMeanVarianceProblemPromptVarianceShort":"Find 36×variance for the following distribution.","mathMeanVarianceProblemVarianceHowToCalc":"Variance = spread around the mean. 36×variance = 6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²","mathMeanVarianceProblemVarianceLabel":"36×variance","mathMeanVarianceProblemPromptVarianceScale36":"Find 36×Var(X) for the same distribution.","mathMeanVarianceProblemProbSumHint":"c","mathMeanVarianceProblemMeanScale6Label":"6×mean","mathMeanVarianceProblemMeanIntegerLabel":"Mean E[X]","mathMeanVarianceProblemPromptMeanInteger":"Find the mean (expected value) E[X].","mathMeanVarianceProblemPromptMode":"Find the X value with the highest probability (mode).","mathMeanVarianceProblemPromptCumulativeNumerator":"When P(X≤given) is written as numerator/6, choose the numerator.","mathMeanVarianceProblemModeLabel":"Most likely X","mathMeanVarianceProblemCumulativeLabel1":"P(X≤1) = ?/6 → ?","mathMeanVarianceProblemCumulativeLabel2":"P(X≤2) = ?/6 → ?","mathUniformNormalProblemPrompt":"Follow the instructions below","mathUniformNormalProblemPromptUniformMean":"For the uniform distribution on [a,b], find the mean (a+b)/2.","mathUniformNormalProblemPromptUniformVar12":"For uniform U[a,b], find 12×variance = (b−a)².","mathUniformNormalProblemPromptUniformLength":"Find the length of the interval [a,b], i.e. b−a.","mathUniformNormalProblemPromptNormalPct68":"In a normal distribution, about what percent of values lie within μ±σ?","mathUniformNormalProblemPromptNormalPct95":"In a normal distribution, about what percent of values lie within μ±2σ?","mathIntegralProblemAntiderivativeIntro":"Given that","mathIntegralProblemAntiderivativeAt":" what is the value at ","mathIntegralProblemAntiderivativeQ":"?","mathPartialGradientProblemAtPoint":"at","mathPartialGradientProblemGradientFirst":"First component","mathPartialGradientProblemGradientSecond":"Second component","wrongAnswerGuideButton":"Why was it wrong?","wrongAnswerGuideTitle":"Wrong answer guide","wrongAnswerGuideSubmittedAnswer":"Your answer:","wrongAnswerGuideHint":"The AI will infer why you solved it that way and guide you in the right direction without giving the answer.","wrongAnswerGuideApiQuestion":"The user got the problem wrong with the answer \"{answer}\". Infer why they might have solved it that way and guide them in the right direction without revealing the correct answer.","wrongAnswerGuideAsking":"Getting guide...","wrongAnswerQuestionPrompt":"I answered {answer}. Why was it wrong?","getSolution":"Get solution","loadingSolution":"Loading...","feedbackTitle":"AI grading feedback","solutionTitle":"Solution","alertDrawFirst":"Please draw your solution before grading.","alertInputFirst":"Please enter your solution before grading.","errorGrade":"An error occurred while grading. Please try again.","errorSolution":"An error occurred while loading the solution. Please try again.","errorGradeRequest":"Grading request failed","errorSolutionRequest":"Solution request failed","errorStream":"Could not read stream.","errorDefault":"Could not generate feedback.","placeholderChapter":"This chapter is coming soon.","conceptVisualPlaceholder":"A visualization for this concept is coming soon.","conceptComingSoon":"Learning content for this concept will be added in a future update.","conceptMatrixMulIntro":"One row of A · one column of B (dot product) → one entry of the result matrix","conceptMatrixMulCell":"This entry","conceptLinearLayerIntro":"Multiply input X by weight matrix W and add bias b to get output Y. __LINEAR_FORMULA__","conceptLinearLayerLegendRow0":"W row 1·X + b[0] → Y[0]","conceptLinearLayerLegendRow1":"W row 2·X + b[1] → Y[1]","conceptArtificialNeuronIntro":"An artificial neuron computes the weighted sum __WEIGHTED_SUM_FORMULA__ , then applies an activation (e.g. ReLU, Sigmoid, or Tanh) to produce output Y.","conceptArtificialNeuronCalcCaption":"Step by step: (W·X) multiplied + b added = Z → ReLU(Z) = Y","conceptBatchIntro":"A batch stacks multiple samples as columns of a matrix. The same W and b are applied at once: __LINEAR_FORMULA__ .","conceptBatchCaption":"One column = one sample. Same W and b applied to all columns at once.","conceptBatchExampleTitle":"Example: calculation for one column (sample)","conceptBatchFormulaRow":"Z{n} = (W row {row}·this column)+b[{bi}] = ({calc})+({b}) = {result}","conceptConnectionIntro":"Connections describe how neurons in one layer link to the next. Only non-zero weights are actual links; the graph below shows those partial connections.","conceptConnectionGraphCaption":"Connection structure (zero-weight links omitted)","conceptConnectionCalcCaption":"Each output: (W row·X) multiplied + b added = Y","conceptConnectionFormulaRow1":"Y₁ = (W row 1·X) + b₁ = ({calc}) + {b} = {wx} + {b} = {y}","conceptConnectionFormulaRow2":"Y₂ = (W row 2·X) + b₂ = ({calc}) + {b} = {wx} + {b} = {y}","conceptActivationTitleSigmoid":"Y = Sigmoid(X)","conceptActivationTitleRelu":"Y = ReLU(X)","conceptActivationTitleTanh":"Y = Tanh₃(X)","conceptActivationTableHeader":"X ~ Y","conceptDotProductIntro":"a = [{a1}, {a2}], b = [{b1}, {b2}] → a·b = {samePositionSum}","conceptDotProductSamePositionSum":"sum of element-wise products","problemPromptConnection":"In the connection __LINEAR_FORMULA__ , find the value for the blank (?). Inputs with W=0 are not connected to that output.","conceptHiddenIntro":"A hidden layer takes input, applies a linear transform (__LINEAR_CORE__) and ReLU to produce an intermediate representation H, then applies another linear transform and ReLU to produce the final output Y.","conceptHiddenGraphCaption":"Input → Hidden (H) → Output (Y)","problemPromptHidden":"In the forward pass with a hidden layer (X → (W·X+b) → ReLU → H → (W·H+b) → ReLU → Y), fill in the blank (?).","conceptDeepIntro":"A deep network stacks many hidden layers. Each layer applies Linear (W·input + b) and ReLU to produce an intermediate representation before the next layer.","conceptDeepFormulaCaption":"Each layer: Linear & ReLU","conceptDeepFormulaWithSymbols":"Linear = W·(prev layer) + b → ReLU","conceptDeepGraphCaption":"Input (X) → Hidden (A,B,C,D) → Output (Y)","problemPromptDeep":"In the multi-layer forward pass (each layer Linear & ReLU), fill in the blank (?).","conceptWideIntro":"Width means having many neurons in one layer. Wider layers can express more features at once; each layer is computed with Linear & ReLU.","conceptWideFormulaCaption":"Each layer: Linear & ReLU (layer gets wider)","conceptWideGraphCaption":"Input (X) → Hidden (A,B) → Output (Y) → 1→2→4→8 neurons","problemPromptWide":"In the forward pass where layers get wider (each layer Linear & ReLU), fill in the blank (?).","conceptSoftmaxIntro":"Softmax turns numbers into values between 0 and 1 that sum to 1. Compute __WEIGHTED_SUM_FORMULA__, then __SOFTMAX_EXP__, then divide each by the sum (__SOFTMAX_SUM__) to get probabilities.","conceptSoftmaxFormulaCaption":"Z = W·X + b → e^Z (e^x) → Y = e^Z / Σ","conceptSoftmaxGraphCaption":"Often used in the final layer for multi-class classification.","problemPromptSoftmax":"Compute __SOFTMAX_FLOW__ , then fill in the blank (?).","conceptSoftmaxEHint":"In this problem we use e = 3 for easy calculation. So __E_Z_3Z__. (e.g. Z=1 → 3, Z=2 → 9)","conceptGradientIntro":"The gradient is a vector that shows the direction and rate of change of a function. To reduce loss, we update parameters in the opposite direction. Forward: __GRADIENT_FORWARD__; backward: __GRADIENT_BACKWARD__.","conceptGradientForwardLabel":"Forward","conceptGradientBackwardLabel":"Backward","conceptGradientFormulaCaption":"Forward Z = W·X → Backward dZ = dW·X","conceptGradientGraphCaption":"The same idea applies to linear layers, hidden layers, and so on.","conceptGradientBlankHint":"In the problems, the blank (?) is **one entry of X** or **one entry of Z** (forward) / **dZ** (backward).","conceptGradientForwardDesc":"Forward: Z = W·X (each row of W dotted with X)","conceptGradientBackwardDesc":"Backward: dZ = dW·X (same structure, gradient values)","conceptInputX":"Input X","conceptLinear":"Linear","conceptLinearReLULayer1":"Linear & ReLU (layer 1)","conceptLinearReLULayer2":"Linear & ReLU (layer 2)","conceptSoftmaxFlowCaption":"Score (__Z__) → __3Z__ → divide by sum → probability (__Y__)","conceptSoftmaxZLabel":"Z (score)","conceptSoftmaxExpLabel":"3^Z","conceptSoftmaxSumLabel":"Sum","conceptSoftmaxProblemFlow":"Score (__Z__) → __3Z__ → divide by sum (__SIGMA__) → probability (__Y__)","conceptSoftmaxProbability":"Prob.","conceptSoftmaxExampleTitle":"Example: step-by-step calculation","conceptSoftmaxStepZ":"Z{n} = (W row {row}·X)+b[{bi}] = {calc}+{b} = {result}","conceptSoftmaxStepExp":"3^Z{n} = 3^{z} = {result}","conceptSoftmaxStepSum":"Σ = {items} = {result}","conceptSoftmaxStepY":"Y{n} = 3^Z{n}/Σ = {num}/{den} = {result}","conceptWideLinearReLU1":"Linear & ReLU (layer 1, width 2)","conceptWideLinearReLU2":"Linear & ReLU (layer 2, width 4)","conceptWideLayer1Formula":"Layer 1 (width 2): H = ReLU(W₁·X + b₁)","conceptWideLayer2Formula":"Layer 2 (width 4): Y = ReLU(W₂·H + b₂)","conceptMatrixMulCellDot":"Row {row} of A · column {col} of B (one dot product)","conceptMatrixMulARow":"Row {row} of A","conceptMatrixMulBCol":"Column {col} of B","conceptBatchLinear":"Multiply the table by weights and add bias to fill the blank.","conceptBatchLinearRelu":"Multiply by weights, add bias, then set negatives to 0 and fill the blank.","conceptBatchAdd":"Add the right-hand value to each row and fill the blank.","conceptBatchSubtract":"Subtract the right-hand value from each row and fill the blank.","conceptBatchMultiply":"Multiply each row by the right-hand value and fill the blank.","conceptBatchCenter":"Subtract each row's mean from that row and fill the blank.","conceptBatchSum":"Sum all numbers in each row and fill the blank.","conceptBatchMean":"Find the mean (integer) of each row and fill the blank.","conceptBatchRowMeanHint":"(row mean → 0)","conceptBatchRowSumHint":"(row sum)","conceptBatchRowMeanIntHint":"(row mean, integer)","conceptRowN":"row {n}","conceptDeepLayer1Title":"Layer 1: A₁, A₂, A₃ (W₁ each row·X + b₁)","conceptDeepLayer2Title":"Layer 2: B₁, B₂, B₃ (W₂ each row·A + b₂)","conceptDeepFormulaA":"A{n} = (W₁ {row}·X)+b₁[{bi}] = ({calc})+({b}) = {linear} → ReLU = {result}","conceptDeepFormulaAZero":"A{n} = (W₁ {row}·X)+b₁[{bi}] = ({calc})+({b}) = {linear} → ReLU(-1)=0 → {result}","conceptDeepFormulaB":"B{n} = (W₂ {row}·A)+b₂[{bi}] = ({calc})+({b}) = {linear} → ReLU = {result}","conceptHiddenLayer1Title":"Layer 1: H = ReLU(W₁·X + b₁)","conceptHiddenLayer2Title":"Layer 2: Y = ReLU(W₂·H + b₂)","conceptHiddenLinear1":"Linear₁","conceptHiddenLinear2":"Linear₂","conceptHiddenFormulaL1":"{linearLabel} = (W₁ {row}·X)+b₁[{bi}] = ({calc}) + ({b}) = {linear} → ReLU = {result}","conceptHiddenFormulaL2":"{linearLabel} = (W₂ {row}·H)+b₂[{bi}] = ({calc}) + ({b}) = {linear} → ReLU = {result}","conceptWideFormulaH1":"H₁ = (W₁ {row}·X)+b₁[0] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaH2":"H₂ = (W₁ {row}·X)+b₁[1] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY1":"Y₁ = (W₂ {row}·H)+b₂[0] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY2":"Y₂ = (W₂ {row}·H)+b₂[1] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY3":"Y₃ = (W₂ {row}·H)+b₂[2] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY4":"Y₄ = (W₂ {row}·H)+b₂[3] = {calc} = {linear} → ReLU = {result}","conceptGradientZLine":"Z{n} = (W {row})·X = {calc} = {z}","conceptGradientDZLine":"dZ{n} = (dW {row})·X = {calc} = {dz}","problemPromptGradient":"Fill in the blank (?) in __GRADIENT_FORWARD__ or __GRADIENT_BACKWARD__ .","tinyNNTitle":"Deep learning diagram by chapter","tinyNNDescription":"As you complete each chapter, the diagram below fills in. This is the structure so far.","tinyNNComplete":"By the last chapter you'll see the full picture: forward → loss → backward → update.","tinyNNAriaLabel":"Deep learning diagram progress by chapter","mathDiagramTitle":"Math diagram by chapter","mathDiagramDescription":"Select a chapter to see its diagram below. View the flow of basic math at a glance.","midMathDiagramTitle":"Math diagram by chapter","midMathDiagramDescription":"Select a chapter to see its diagram below. View the flow of intermediate math at a glance.","mathDiagramComplete":"Through Ch01 Functions you see the full input → function → output structure.","mathDiagramAriaLabel":"Math diagram by chapter","mlDiagramTitle":"ML diagram by chapter","mlDiagramDescription":"Select a chapter to see its diagram below. View the machine learning flow at a glance.","midMlDiagramTitle":"Intermediate ML diagram by chapter","midMlDiagramDescription":"Select a chapter to see its diagram below. View the intermediate ML flow at a glance.","midMlIntroRoadmapHeading":"What you learn in Ch01–Ch20","midMlIntroRoadmapIntro":"Intermediate ML connects **real-world preprocessing** with **model and hyperparameter tuning**. You handle scaling, encoding, missing data, and imbalance, then SVM, PCA, boosting, and clustering, and finally **pipelines** with grid, random, and Bayesian (Optuna) search.","mlDiagramAriaLabel":"ML diagram by chapter","linkToPlayground":"How this computation is used in neural networks","introRoadmapHeading":"What you learn in Ch01–Ch12","mathIntroRoadmapIntro":"Understanding deep learning and machine learning requires basic math such as **functions**, **exponential and log**, **limits, derivatives, integrals**, and **probability and distributions**. Ch01–Ch12 cover exactly that. **Functions** are the basis of input→output; **derivatives and gradients** are what the model uses to decide **where and how much** to change parameters when learning; **probability and distributions** are needed for prediction and uncertainty.","midMathIntroRoadmapHeading":"What you learn in Ch01–Ch20","midMathIntroRoadmapIntro":"Intermediate math deepens the language you use to understand AI. You learn how data is represented and transformed using **vectors**, **matrices**, and **linear transformations**, then quantify similarity and direction with **dot products** and **projections**. Next, you interpret change and curvature using **Jacobians** and **Hessians**, which lets you understand the shape of the loss landscape. Finally, you design learning more robustly with **Taylor series** and **convex optimization**, and learn uncertainty with **Bayes**, **covariance**, and the **multivariate normal distribution**.","premiumBadge":"Premium","premiumTitle":"This is a Premium Chapter","premiumDescription":"This chapter is exclusive to Learn paid subscribers. After subscribing, you get unlimited access to all Learn chapters: concept explanations, problem sets, and AI coaching.","premiumFeature1":"Unlock all Chapters 04–12","premiumFeature2":"Unlimited AI learning coach questions","premiumFeature3":"Early access to new chapters","premiumMonthly":"month","premiumCTA":"Subscribe to Premium","premiumComingSoon":"Coming soon","premiumLogin":"Already subscribed?","premiumLoginLink":"Log in","premiumLoginFirst":"Sign in to subscribe to Premium.","freeChaptersNote":"Chapters 01–03 are free to use.","mlMseProblemPromptBinaryCrossEntropyLog2Y1":"Binary cross-entropy for one sample: $\\ell = -\\big( y \\log_2 \\hat p + (1-y) \\log_2(1-\\hat p) \\big)$. Given $y=1$ and $\\hat p = {pFrac}$, find $\\ell$ as an integer.","mlMseProblemPromptBinaryCrossEntropyLog2Y0":"Binary cross-entropy for one sample: $\\ell = -\\big( y \\log_2 \\hat p + (1-y) \\log_2(1-\\hat p) \\big)$. Given $y=0$ and $1-\\hat p = {pFrac}$, find $\\ell$ as an integer."},"playground":{"title":"Neural Network Playground","seoFaqTitle":"Frequently asked questions","relatedLearnLabel":"Related lessons","nav":{"sectionTitle":"Playground","sectionSubtitle":"Hands-on AI lab","comingSoon":"Coming soon","categories":{"dl":"Deep Learning","ml":"Machine Learning","rl":"Reinforcement Learning","vision":"Computer Vision","transformer":"Transformer","agent":"LLM Agents"},"items":{"nnClassifier":"NN Classifier","optimizer":"Loss Landscape Lab","knn":"KNN Neighbor Classifier","rl":"RL Agent","cnn":"Conv Vision","transformer":"Attention Playground","agent":"Claude Agents"}},"classifier":{"title":"Neural Network Playground","subtitle":"Tinker with a neural network right here!","dataTitle":"Data","dataHint":"Which K-culture dataset?","datasets":{"taegeuk":"Taegeuk","danjeong":"Dancheong","hallyu":"K-Wave","kpop":"K-Pop ♥"},"trainRatio":"Train/Test ratio","noise":"Noise","batchSize":"Batch size","showTest":"Show test data","regenerate":"Regenerate","featuresTitle":"Features","featuresHint":"Choose inputs for the network","featuresIntro":"A feature is an input value derived from each point's coordinates (x₁, x₂). Use raw X₁ and X₂, or enable squares, products, and sin terms so the model can learn more complex decision boundaries. Each enabled feature adds one input neuron.","featuresIntroToggle":"About features","featuresIntroExpand":"Expand","featuresIntroCollapse":"Collapse","features":{"x1":"X₁","x1Desc":"Raw horizontal coordinate — good for vertical splits","x2":"X₂","x2Desc":"Raw vertical coordinate — good for horizontal splits","x1sq":"X₁²","x1sqDesc":"Square of x₁ — circles and parabolic curves","x2sq":"X₂²","x2sqDesc":"Square of x₂ — left-right symmetric curves","x1x2":"X₁X₂","x1x2Desc":"Product of both coords — diagonal or twisted boundaries","sinX1":"sin(X₁)","sinX1Desc":"sin of x₁ — wavy, periodic boundaries","sinX2":"sin(X₂)","sinX2Desc":"sin of x₂ — vertically repeating boundaries"},"networkTitle":"Hidden layers","networkHint":"Line thickness = weight magnitude, color = sign (purple=+, orange=−)","addLayer":"Add layer","removeLayer":"Remove layer","outputTitle":"Output","outputHint":"Model decision boundary and data points","outputLegend":"Faint background = true data pattern · bold color = network prediction","outputLive":"LIVE","testLoss":"Test loss","trainLoss":"Training loss","epoch":"Epoch","colorNegative":"Negative (−1)","colorPositive":"Positive (+1)","play":"Play","pause":"Pause","step":"Step","reset":"Reset","controlsIntro":"These settings apply each time you play or step training. Try the defaults first, then change values and watch what happens.","controlsIntroToggle":"Training settings help","learningRate":"Learning rate","learningRateDesc":"How big each weight update is. Too large jumps around; too small learns slowly. Try around 0.01–0.03 to start.","activation":"Activation","activationDesc":"Non-linear curve applied to each neuron’s output. Tanh is smooth (−1 to 1); ReLU zeros out negative values.","activationReLU":"ReLU — negative → 0, positive unchanged (most common)","activationTanh":"Tanh — S-shaped, outputs between −1 and 1","activationSigmoid":"Sigmoid — squashes into 0 to 1","activationLinear":"Linear — no curve (almost no non-linearity)","regularization":"Regularization","regularizationDesc":"Penalty to stop weights from growing too large. Helps reduce overfitting (memorizing the training set).","regNone":"None","regL1":"L1 — sum of absolute weights (sparser weights)","regL2":"L2 — sum of squared weights (smaller weights overall)","regRate":"Reg. strength","regRateDesc":"How strong the regularization penalty is. Ignored when regularization is None.","regRateDisabledHint":"Choose L1 or L2 regularization to adjust this.","showFormulas":"Show formulas","showFormulasDesc":"Shows the math (backprop, activation, etc.) for your current settings below the panels.","learnChapterTooltip":"Learn: {title}","relatedLearn":"Related lessons","formulasTitle":"Formulas"},"rl":{"title":"Swing RL","subtitle":"Q-learning discovers when to push and when to coast—just like pumping a swing!","seoIntro":"Free browser playground to train a Q-learning agent on a pendulum swing. Tune learning rate, discount, and exploration—then read the episode reward chart to see the policy improve.","play":"Play","pause":"Pause","step":"Step","reset":"Reset","episode":"Episode","alpha":"Learning rate α","alphaDesc":"How much each Q update moves. Large values learn fast but can oscillate.","gamma":"Discount γ","gammaDesc":"How much future rewards matter now. Closer to 1 weights distant rewards more.","epsilon":"Exploration ε","epsilonDesc":"Random push/coast chance. High ε tries many rhythms; low ε sticks to what worked.","speed":"Speed","showFormulas":"Show formulas","controlsIntro":"Push on the way down, coast on the way up—the Q-table learns this timing from rewards.","controlsIntroToggle":"Training settings","envTitle":"Swing setup","envHint":"Rope, friction, and wind change the challenge","swingGoal":"Reward is swing height (1−cos θ). The agent learns left/right pushes to build amplitude.","swingTip1":"Push opposite to motion near the bottom to add energy","swingTip2":"Coasting near the top usually helps","presets":{"playground":"Playground","playgroundDesc":"Default rope · balanced","longRope":"Long rope","longRopeDesc":"Slow, wide swings","shortRope":"Short rope","shortRopeDesc":"Fast back-and-forth","heavySeat":"Heavy seat","heavySeatDesc":"High friction · harder climb","breezy":"Windy day","breezyDesc":"Gentle random gusts","powerPump":"Power pump","powerPumpDesc":"Strong pushes · quick height"},"presetRope":"Rope","presetDamping":"Friction","presetPush":"Push","presetWind":"Wind","worldTitle":"Swing simulator","worldHint":"Purple robot = agent · bar = height","swingAria":"Swing reinforcement learning simulator","swingHeight":"Height","swingMaxEp":"Episode best","swingHighCount":"High swings","actions":{"left":"Push left","coast":"Coast","right":"Push right"},"metricSteps":"Steps this episode","metricEpReward":"Episode return","rewardTitle":"Episode return","rewardHint":"Higher swings raise the curve","rewardChartEmpty":"Start training to see per-episode returns","lastReward":"Latest episode return: {value}","rewardChartSummary":"Last completed: {completed} · In progress: {current}","rewardChartCurrent":"In-progress return: {value}","formulasTitle":"Pendulum · Q-learning","formulaPendulum":"Modeled as a pendulum: angle θ, angular velocity ω, push torque τ.","formulaBellman":"We discretize (θ,ω) and learn when to push from a Q-table.","formulaRewardIntro":"Reward grows with swing height:","formulaRewardOutro":"Higher swings earn more. ε-greedy mixes exploration and exploitation.","relatedLearn":"Related lessons"},"cnn":{"title":"Convolution Vision Playground","subtitle":"Apply filters to K-culture patterns and watch feature maps change in real time!","seoIntro":"Free in-browser convolution (CNN) playground. Apply 3×3 kernels to 16×16 K-culture patterns—Taegeuk, Dancheong, Hangul, K-Pop—and inspect feature maps, ReLU, and 2×2 max pooling in real time. Seven presets plus sliding step animation help computer vision and deep learning beginners grasp convolution intuitively.","controlsIntro":"Press ▶ to slide the kernel over the input and fill conv → ReLU → pooling one cell at a time. Use step for manual advance.","controlsIntroToggle":"Settings guide","play":"Play","pause":"Pause","step":"Step","speed":"Speed","animPhase":"Phase","animStep":"Progress","animReady":"Ready","animDone":"Done","animConvValue":"Output","animPoolValue":"max","pipelineIntroAnim":"The kernel (purple) scans the input while the feature map fills in. During pooling, each orange 2×2 window takes the max.","padding":"Padding","paddingValid":"Valid (shrinks output)","paddingSame":"Same (keeps size)","useRelu":"Apply ReLU","usePool":"2×2 max pooling","showFormulas":"Show formulas","reset":"Reset","imageTitle":"Input image","imageHint":"Which K-culture pattern to try?","imageIntro":"16×16 pixel patterns are converted to grayscale before convolution. Taegeuk and Dancheong share themes with the NN Classifier datasets.","patterns":{"taegeuk":"Taegeuk","taegeukDesc":"Yin-yang S-curve inside a circle","danjeong":"Dancheong","danjeongDesc":"Diagonal stripe grid","hangeul":"Hangul ㄱ","hangeulDesc":"Stroke pattern (vertical, horizontal, hook)","kpop":"K-Pop ♥","kpopDesc":"Heart-hand heart region"},"kernelTitle":"3×3 kernel","kernelHint":"Pick a preset or click cells to edit","kernelIntro":"The kernel slides over the input; each output cell is a weighted sum of a 3×3 neighborhood. Sobel finds edges, blur smooths, sharpen boosts outlines.","kernelPresets":"Presets","kernelClickHint":"Click a kernel cell to cycle values from −2 to 2.","customKernel":"Custom edited kernel","presets":{"identity":"Identity","blur":"Blur","sharpen":"Sharpen","sobelX":"Sobel X","sobelY":"Sobel Y","edge":"Edge","emboss":"Emboss"},"presetDesc":{"identity":"Center weight 1 — passes the input through almost unchanged","blur":"Average of 9 neighbors — smooths noise and softens the image","sharpen":"Boost center, subtract surround — sharpens edges and detail","sobelX":"Horizontal brightness change — emphasizes vertical edges","sobelY":"Vertical brightness change — emphasizes horizontal edges","edge":"Center minus surround — highlights boundaries in all directions","emboss":"Diagonal brightness slope — creates a raised, 3D relief look"},"pipelineTitle":"CNN pipeline","pipelineHint":"Input → conv → (ReLU) → (pooling)","outputTitle":"Output","outputHint":"Feature maps at each stage — play to fill conv, ReLU, and pooling cell by cell","pipelineIntro":"Hover the input to highlight the 3×3 kernel window in purple.","stageInput":"Input","stageConv":"Conv","stageRelu":"ReLU","stagePool":"Max pool","stageInputRole":"16×16 raw pixels — K-culture pattern as grayscale CNN input","stageConvRole":"3×3 filter sliding — weighted sum extracts edges, textures, spatial features","stageReluRole":"max(0, x) — zeros out negative responses, keeps active features","stagePoolRole":"2×2 max — shrinks the map and adds slight shift invariance","heatmapLegend":"Heatmap: purple = positive, orange = negative · darker = stronger","formulasTitle":"Formulas","formulaConv":"Convolution","formulaPool":"Max pooling","relatedLearn":"Related lessons"},"transformer":{"title":"Attention Playground","subtitle":"Send a question — the model finds clues in it, then answers","chatTitle":"Question → Answer","chatModelBadge":"Finding clues in the question","chatEmptyHint":"Type your question below and send","chatPlaceholder":"Type your question…","chatCustomAnswer":"The word most similar to 「{query}」 is 「{word}」 ({pct}%).","chatSend":"Send question","chatTokenizing":"Splitting the question into tokens…","chatEmbedding":"Adding word + position info…","chatPipelineTitle":"What the model does","chatStepTokenize":"Split","chatStepEmbed":"Understand","chatStepCompare":"Compare","chatStepWeight":"Focus %","chatStepAnswer":"Answer","chatStepTokenizeDesc":"Breaks the sentence into small pieces (tokens)","chatStepEmbedDesc":"Adds “which word comes when” position info to each token","chatStepCompareDesc":"From the answer position, checks how related each other word is","chatStepWeightDesc":"Turns relatedness into % — brighter tokens and higher % mean more focus","chatStepAnswerDesc":"Find the most similar word → mix info → write the answer","chatAnswerSimilarLead":"The word most similar to 「{query}」 is 「{word}」 ({pct}%).","chatAnswerSimilarEmpty":"No related words found from 「{query}」.","chatAnswerSimilarExplain":"Higher attention % at the answer position means a closer related word.","chatAnswerBriefTitle":"What happens next","chatAnswerBriefBody":"Mix the focused words by weight, pick the next words, then finish the reply. ChatGPT repeats this cycle.","chatCompareTitle":"Comparing word relatedness","chatCompareHint":"Measuring how close each word is to 「{word}」","chatWeightTitle":"Focus percentages","chatWeightHint":"Relatedness becomes % — they always add up to 100%","chatPosition":"#{n}","chatTokenTruncated":"Showing the first {shown} tokens only ({total} words total)","chatSearching":"Finding answer clues in the question…","chatCluesFound":"Clues found","chatSearchExplain":"Before answering, see which question words get attention — brightness and % show how much","chatQuestionLabel":"Question","chatConnectTitle":"Word links → focus %","chatConnectIntro":"From {query} below · token brightness and % = focus level","chatConnectAria":"Connection graph between question words and answer position","chatAnswerFlow":"After reading the question","chatAnswerSlot":"Answer starts here","chatAnswerSlotHint":"Where ChatGPT begins its reply (the next slot after the question)","chatQueryFromToken":"Viewing from 「{word}」 · tap below to return to the answer slot","chatCalcTitle":"Why these percentages?","chatCalcIntro":"Compare Q({query}) with each word's K, then apply softmax.","chatCalcQueryLabel":"answer slot","chatCalcStep1":"① Q · K = similarity score (higher = more related)","chatCalcStep2":"② softmax — scores become ratios (sum 100%)","chatCalcStep2Pending":"② computing softmax…","chatClueSummary":"Key clues: {words}","chatAnswering":"Answer","chatDoneHint":"Tap tokens to see different focus.","chatPickNext":"Ask another question","chatReadyHint":"Type a new question in the input below","chatReadyFooter":"Enter a question below to start a new chat","chatFooterNote":"Like ChatGPT, the model finds relevant words in the question before answering","qaQuestions":{"catCafe":"What's sitting by the café window on a rainy day?","robotPaint":"Who's painting a purple sky in front of the canvas?","moonRamen":"On a moonlit night in the alley, what's that warm smell?","snowTrain":"Where does the train across the snowy fields stop next?"},"qaAnswers":{"catCafe":"A cat. It's resting by the café window, listening to the rain.","robotPaint":"A robot. It's slowly brushing a purple sky onto the canvas.","moonRamen":"Ramen. The rich broth scent fills the moonlit alley.","snowTrain":"A small town. The train crosses the snow and pulls into the station."},"qaTokens":{"catCafe":["rainy","café","window","sitting","what","?"],"robotPaint":["canvas","purple","sky","painting","who","?"],"moonRamen":["moon","alley","warm","smell","ramen","?"],"snowTrain":["snow","train","next","stop","at","where"]},"setupTitle":"Input","setupHint":"Pattern · query token","patterns":{"catCafe":"Pattern A","robotPaint":"Pattern B","moonRamen":"Pattern C","snowTrain":"Pattern D"},"metricMaxAttn":"Max attn","outputHint":"New representation after blending by attention","seoIntro":"A free Self-Attention playground with classifier-style layout: pick sentence and query → connection graph (line thickness = attention) → output pipeline. BERT/GPT masks, √d_k scaling, step animation.","controlsIntro":"▶ Play highlights each pipeline stage. Click any token to update the graph and colors without playing.","controlsIntroToggle":"Settings help","play":"Play","pause":"Pause","step":"Step","speed":"Speed","reset":"Reset","animReady":"Ready","animDone":"Done","animPhase":{"label":"Phase","scores":"Q·K^T scores","softmax":"softmax","output":"Weighted V"},"maskLabel":"Mask","maskFull":"Full (BERT)","maskCausal":"Causal (GPT)","scaleDk":"√d_k scale","showFormulas":"Show formulas","sentenceTitle":"Input sentence","sentenceHint":"Which example sentence?","sentenceIntro":"Choose a story and query (Q) token. The connection graph and output update immediately—like picking data and features in the classifier.","queryHint":"Select query token (Q)","presets":{"catCafe":"Cat & cafe","catCafeDesc":"Rainy window seat — cat, cafe, and window attend to each other","robotPaint":"Robot painter","robotPaintDesc":"Purple sky on canvas — robot, paint, and canvas connect","moonRamen":"Moon & ramen","moonRamenDesc":"Moonlit alley steam — night, moon, and ramen weave together","snowTrain":"Snow & train","snowTrainDesc":"Across snowy fields — train, snow, and town line up"},"presetTokens":{"catCafe":["The","cat","sits","by","the","cafe","window"],"robotPaint":["A","robot","paints","a","purple","sky","tonight"],"moonRamen":["On","a","moonlit","night","ramen","steam","rises"],"snowTrain":["The","train","crosses","snowy","fields","toward","town"]},"matrixTitle":"Attention matrix","matrixHint":"Row = Query, Col = Key — darker = higher score/weight","storyTitle":"Attention story","storyHint":"Use bars and spotlight to see who attends to whom","storyStep":{"scores":"Compare","softmax":"To ratios","output":"Blend info"},"storyExplainIdle":"Press ▶ to see how 「{query}」 compares with every other word first.","storyExplainScores":"① Q (question) vs K (label) — higher score means more similar to 「{query}」.","storyExplainSoftmax":"② Scores become 0–100% weights that sum to 1 — the attention share.","storyExplainOutput":"③ Mix each word's V (content) by those weights to build a new representation.","storyExplainPause":"Cycle complete! Next query auto-advances, or click another token.","metaphorQ":"Question","metaphorK":"Label","metaphorV":"Content","metaphorQDesc":"The word we're asking about","metaphorKDesc":"Compare with other words","metaphorVDesc":"Info to pull in","storySentenceLabel":"Spotlight on sentence","distributionScoresTitle":"Similarity scores: 「{query}」 vs each word","distributionTitle":"Attention share from 「{query}」","distributionHint":"Longer bar = more attention. Shows scores or % depending on the step.","storyOutputSummary":"New vector for 「{query}」 = weighted sum of each word's V by attention %.","showMatrixToggle":"Advanced: show full N×N matrix","networkTitle":"Attention connections","networkHint":"Line thickness = attention share (like classifier weights)","networkIntro":"Lines from query (Q) to other tokens show how much each is attended. Click a token to update instantly.","networkLegend":"Thicker lines and higher % mean more attention.","networkAria":"Attention connection graph from query to key tokens","outputTitle":"Output","outputIntro":"Data flows: input sentence → attention (darker = more) → output vector.","outputIntroAnim":"▶ Play to sharpen step by step: faint lines → clear ratios → finished output.","outputProgressLabel":"Clarity","outputLive":"Live","outputTop1":"Top focus","outputTop2":"2nd focus","outputSelf":"Self","outputLegend":"Color intensity shows how much 「{query}」 attends to each token.","pipelineInput":"Input","pipelineInputRole":"Sentence tokens","pipelineAttend":"Attend","pipelineAttendRole":"Softmax weights","pipelineOut":"Output","pipelineOutRole":"Weighted V sum","matrixIntroScores":"Raw Q·K^T dot-product scores before softmax.","matrixIntroWeights":"Softmax weights — each row sums to 1.","matrixAria":"Attention matrix heatmap","flowTitle":"Token connections","flowHint":"Line thickness = softmax weight","flowIntro":"Curves from the selected query to other tokens; thickness shows how much each key is attended.","flowQueryLabel":"Query","flowOutputLabel":"Weighted output (d=4)","formulasTitle":"Formulas","formulaQkv":"Q / K / V","formulaAttention":"Scaled Dot-Product Attention","formulaMask":"Causal mask","formulaScaleNote":"d_k = {dk}","relatedLearn":"Related lessons"},"agent":{"title":"Claude Code multi-agent","subtitle":"Follow Claude Code as it receives your prompt, spawns subagents, and solves ML tasks with Read·Write·Bash·Glob tools!","play":"Play","pause":"Pause","step":"Step","reset":"Reset","stepLabel":"Turn","speed":"Speed","showFormulas":"Show formulas","controlsIntro":"Toggle specialists in the subagent roster. ▶ Play lets Claude Code delegate via Task(); disabled roles halt the session.","controlsIntroToggle":"Control help","taskTitle":"User prompt","taskHint":"Which ML task should Claude Code handle?","taskIntro":"Each scenario is an ML task Claude Code would tackle in a real repo. The orchestrator delegates via Task(subagent=...), and each agent invokes tools.","taskBriefLabel":"Prompt","claudeTaskNote":"All subagents are assumed to run inside a {model} session.","sessionLabel":"~/modudl-ml · Claude Code","scenarioSelectTip":"💡 Turn Code off, press ▶ Play — the session stops at Write train_*.py.","rosterTitle":"Subagent roster","rosterHintSelect":"Toggle spawnable specialists ON/OFF","rosterIntroSelect":"Subagents that the Claude Code orchestrator invokes via Task(). The orchestrator is always active; toggle the rest.","claudeStackTitle":"Claude Code session","claudeStackDesc":"Orchestrator writes plans and delegates via Task(subagent=...) · each agent uses Read/Write/Bash/Glob tools","teamActive":"Active subagents {count}/{total}","toggleRole":"Toggle {role} agent","requiredForTask":"Required for this scenario","boardHintSelect":"Click to toggle · turn progress","boardIntroSelect":"The horizontal track shows subagent delegation order in the Claude Code session. Click a role to toggle spawn targets; the session resets.","clickToAttach":"Click to spawn","clickToDetach":"Click to disable","selectHint":"You can toggle subagents from both the roster and the session board.","orchestrationTitle":"Subagent delegation","teamSync":"orchestrator","payloadLabel":"Session context","handoffReceived":"Task received","toolTraceTitle":"Claude Code · tool use","toolTraceRunning":"Running tool","toolTraceAria":"Claude Code tool call trace","payload":{"taskPlan":"Task plan","rawDataset":"Raw dataset","featureMatrix":"Feature matrix","modelDraft":"Model draft","hpoResult":"HPO result","trainScript":"Training script","verifiedBundle":"Verified model bundle","artifact":"Artifact"},"handoff":{"plan":{"orchestrate":"The orchestrator writes .cursor/plan.md and sequentially spawns enabled subagents via Task(subagent=...)."},"spawnContext":"Passing context 「{payload}」 to workspace","delegate":"{from} → {to}: passing 「{payload}」 context and starting the next Task.","continue":"{role} subagent continues tool calls within the same role using 「{payload}」 context.","retrieve":{"fallback":"data-agent explores the dataset with Glob/Read."},"preprocess":{"fallback":"data-agent runs preprocessing scripts with Write/Bash."},"design":{"fallback":"model-agent uses Write/Edit on config files."},"hpo":{"fallback":"model-agent runs HPO scripts with Bash."},"codegen":{"fallback":"code-agent uses Write/Edit on train_*.py."},"verify":{"fallback":"verify-agent runs evaluation and audit with Bash."},"deploy":{"fallback":"deploy-agent applies manifests and runs health checks."}},"claudeRoles":{"planner":"orchestrator · Task() spawn · .cursor/plan.md","data":"Explore with Glob/Read/Grep · preprocess with Write/Bash","model":"Write configs/*.yaml · HPO/tuning with Bash","code":"Write/Edit train_*.py · serve_*.py","verify":"Bash pytest/bench · Read metrics.json","deploy":"Write deploy/*.yaml · kubectl/curl Bash"},"scenarios":{"tabular":"Tabular classification","tabularDesc":"Retrieve, preprocess, model, code, verify — deploy optional","vision":"Image classification","visionDesc":"Image data, CNN design, training code, verification","fullDeploy":"Through deployment","fullDeployDesc":"Full pipeline plus required Deploy stage"},"consoleTitle":"Claude Code session log","consoleHint":"Task delegation · tool use · context handoff","consoleIntro":"Each turn is subagent tool use. Inspect Glob/Read/Write/Bash calls and workspace context handoff.","consoleInput":"📥 Input (from previous step)","consoleOutput":"📤 Output (to next step)","consoleWorkingOn":"🔧 Working on","consoleLearn":"💡 Learn","workbookTitle":"Task decomposition workbook","workbookHint":"8 steps · subtasks per agent","workbookIntro":"Roadmap of how the orchestrator splits the user prompt into 8 subtasks and delegates via Task(). ▶ Play highlights the current turn.","workbookProblem":"Problem to solve","workbookAria":"Task decomposition roadmap","workbookInProgress":"In progress","consoleProgress":"Session progress","consoleWaiting":"Press ▶ Play to start the Claude Code session.","consoleWorking":"Using tools","consoleLogAria":"Claude Code session log","consoleArtifact":"Workspace artifact","metricCompleted":"Turns done","boardTitle":"Session board","boardHint":"Subagent spawn · tool trace","boardHintDrag":"Drag onto slots · drag back to waiting room","boardIntro":"Top: subagent slots. Bottom: turn order. Disabled subagents block their turns in red.","boardIntroDrag":"The horizontal track is the Claude Code session. Toggle roles to enable or disable subagent spawn targets.","pipelineTrack":"Claude Code session","emptySlot":"Empty","dropHere":"Drop here!","dragToDock":"↓ to dock","dragFromSlot":"{role} — drag from slot","dragFromDock":"{role} — drag from dock","dockTitle":"Agent waiting room","dockHint":"Detached agents wait here","dockEmpty":"🎉 All subagents are enabled in the session!","tapFallback":"Tap works too if drag is awkward on your device.","traceTitle":"Stage trace","alwaysOn":"Always on","attached":"On","detached":"Off","required":"Required","optional":"Optional","roles":{"planner":"Orchestrator","plannerDesc":"Spawn subagents via Task() · write .cursor/plan.md","data":"Data","dataDesc":"Glob/Read/Grep · Write/Bash preprocessing","model":"Model","modelDesc":"Write/Edit configs · Bash HPO","code":"Code","codeDesc":"Write/Edit train/serve scripts","verify":"Verify","verifyDesc":"Bash evaluation & audit · Read metrics","deploy":"Deploy","deployDesc":"Write manifests · kubectl/curl Bash"},"stages":{"plan":"Session plan","retrieve":"Data retrieval","preprocess":"Preprocessing","design":"Model design","hpo":"HPO","codegen":"Code generation","verify":"Verification","deploy":"Deploy"},"status":{"pending":"Pending","running":"Running","success":"Success","skipped":"Skipped","blocked":"Blocked"},"stageMessages":{"plan":{"success":"Orchestrator wrote the plan and set Task() spawn order.","blocked":"Cannot start without the orchestrator.","skipped":"Skipping planning."},"retrieve":{"success":"Data subagent explored the dataset with Glob/Read.","blocked":"Data subagent is off—cannot retrieve data.","skipped":"Skipping retrieval."},"preprocess":{"success":"Data subagent finished preprocessing via Write/Bash.","blocked":"Data subagent missing—preprocessing blocked.","skipped":"Skipping preprocessing."},"design":{"success":"Model subagent wrote config drafts with Write/Edit.","blocked":"Model subagent is off—design blocked.","skipped":"Skipping design."},"hpo":{"success":"Model subagent ran HPO scripts with Bash.","blocked":"Model subagent missing—HPO blocked.","skipped":"Skipping HPO."},"codegen":{"success":"Code subagent generated train/serve scripts.","blocked":"Code subagent is off—cannot Write train_*.py.","skipped":"Skipping code generation."},"verify":{"success":"Verify subagent checked metrics and logs.","blocked":"Verify subagent missing—quality gate failed.","skipped":"Skipping verification."},"deploy":{"success":"Deploy subagent applied manifests and health checks.","blocked":"Deploy subagent is off—cannot deploy.","skipped":"Deploy optional—turn skipped."}},"blockedBecauseDetached":"Turn the subagent back on and press ▶ Play.","resultTitle":"Session result","resultHint":"Did all required turns pass?","resultIdle":"Press ▶ Play or Step to run the Claude Code session.","resultSuccessTitle":"Session complete","resultSuccessBody":"Enabled subagents finished their turns. Without Deploy you may still finish a prototype.","resultFailTitle":"Session blocked","resultFailBody":"A required subagent was off when that turn ran. Enable it in the roster and try again.","missingRole":"{role} subagent required","metricAttached":"Active subagents","metricStages":"Total turns","metricRequired":"Required roles","formulasTitle":"Formulas","formulaOrchestration":"Claude Code session: orchestrator plans, then enabled subagents run tool calls in order.","formulaAttach":"Toggleable subagent set—only enabled roles can execute their turns.","formulaBlock":"Per turn: run tools if on; block if required but off; skip if optional.","relatedLearn":"Related lessons","tasks":{"tabular":{"title":"Customer churn prediction","request":"We have telco customer records in telco_churn.csv. Build an XGBoost classifier and reach at least AUC 0.80 on validation.","goal":"Deliver a validated churn model with AUC ≥ 0.80","stages":{"plan":{"subtask":"Break the big problem into 8 subtasks","input":"User request — telco_churn.csv with validation AUC ≥ 0.80","output":"Task plan — 8-step execution order by role with deliverables defined","learn":"The orchestrator doesn't train models itself. It writes .cursor/plan.md and delegates via Task(subagent=...).","running":"Writing .cursor/plan.md and spawning subagents via Task()","success":"Session plan ready","detail":"Defined Glob/Read → Write/Bash → HPO → Write train_*.py → Bash verify flow; Deploy marked as an optional turn.","blocked":"Cannot start without the orchestrator","blockedDetail":"Press ▶ Play.","skipped":"Planning skipped"},"retrieve":{"subtask":"Find the CSV dataset matching the request","input":"Task plan — required data and schema requirements","output":"telco_churn.csv meta — 7,043 rows, 21 columns, Churn label confirmed","learn":"The Data agent uses search and file-reading tools to verify data exists and quality first.","running":"Searching for telco churn datasets","success":"Dataset located","detail":"Matched telco_churn.csv — 7,043 rows, 21 columns including Churn label.","blocked":"Data agent is off","blockedDetail":"Turn on the Data agent to retrieve and preprocess the CSV.","skipped":"Data retrieval skipped"},"preprocess":{"subtask":"Build a trainable feature matrix","input":"Raw CSV — mixed categorical and numeric fields, with missing values","output":"Feature matrix — 50 dimensions, stratified split (5,634 train / 1,409 validation)","learn":"The same Data agent carries consecutive steps forward. Preprocessing output becomes input for the Model agent.","running":"Imputing missing values and encoding categoricals","success":"Preprocessing complete","detail":"One-hot encoded 15 categorical fields, scaled numerics, stratified 80/20 split.","blocked":"Data agent is off","blockedDetail":"Turn on the Data agent to run imputation and feature encoding.","skipped":"Preprocessing skipped"},"design":{"subtask":"Design a classifier suited to tabular data","input":"Feature matrix + 27% churn rate class imbalance info","output":"XGBClassifier design — balanced class_weight, tree-based mixed-feature handling","learn":"The Model agent proposes architecture candidates. It does not write execution code yet.","running":"Drafting XGBoost architecture and feature strategy","success":"Model design approved","detail":"XGBClassifier with tree-based handling of mixed features; class_weight balanced for 27% churn rate.","blocked":"Model agent is off","blockedDetail":"Turn on the Model agent to propose the classifier design.","skipped":"Model design skipped"},"hpo":{"subtask":"Find hyperparameters that maximize validation AUC","input":"Model design + preprocessed train and validation sets","output":"HPO result — max_depth=6, n_estimators=200, lr=0.08 (validation AUC 0.841)","learn":"The Model agent handles design and search in sequence. Optimal settings become input for the Code agent.","running":"Searching max_depth, learning_rate, n_estimators","success":"Best hyperparameters found","detail":"Optuna trial #38: max_depth=6, n_estimators=200, learning_rate=0.08 — val AUC 0.841.","blocked":"Model agent is off","blockedDetail":"Turn on the Model agent to run hyperparameter search.","skipped":"HPO skipped"},"codegen":{"subtask":"Write a reproducible training script","input":"HPO results + preprocessor and model specs","output":"train_churn.py — Pipeline(preprocess + XGB) + ROC-AUC evaluation","learn":"The Code agent turns design into runnable code. The Verify agent validates that code.","running":"Generating train_churn.py with sklearn Pipeline","success":"Training script generated","detail":"Pipeline bundles preprocessor + XGBClassifier; includes ROC-AUC evaluation on hold-out set.","blocked":"Code agent is off","blockedDetail":"Turn on the Code agent to generate the training script.","skipped":"Code generation skipped"},"verify":{"subtask":"Check AUC target and data leakage","input":"train_churn.py + hold-out validation set","output":"Validated model bundle — AUC 0.847, F1 0.762, no leakage","learn":"The Verify agent separates 'code runs' from 'goal met'. It acts as a quality gate.","running":"Running hold-out evaluation and quality gates","success":"Metrics passed quality gate","detail":"AUC 0.847 exceeds 0.80 threshold; no data leakage detected in pipeline audit.","blocked":"Verify agent is off","blockedDetail":"Turn on the Verify agent to validate metrics and logs.","skipped":"Verification skipped"},"deploy":{"subtask":"Save model artifact (deploy optional)","input":"Validated model bundle + serialization requirements","output":"churn_model.pkl — local artifact (deploy can be skipped in this scenario)","learn":"With Deploy off, prototype completion counts as done. Service deployment is required in a separate scenario.","running":"Packaging model artifact","success":"Model artifact saved","detail":"Serialized pipeline to churn_model.pkl — deploy optional for this scenario.","blocked":"Deploy agent is off","blockedDetail":"Turn on the Deploy agent if you want to ship the model to a service.","skipped":"Deploy is optional — skipped"}},"artifacts":{"plan":{"step1":"Retrieve telco_churn.csv and profile schema","step2":"Impute, encode, and split train/validation","step3":"Design XGBoost and run HPO","step4":"Generate train_churn.py","step5":"Verify AUC ≥ 0.80 and save artifact"},"retrieve":{"fileLabel":"File","fileValue":"telco_churn.csv","rowsLabel":"Rows","rowsValue":"7,043","colsLabel":"Columns","colsValue":"21"},"preprocess":{"line1":"Imputed TotalCharges nulls with median; dropped customerID","line2":"One-hot encoded 15 categorical columns → 50 features","line3":"Stratified split: 5,634 train / 1,409 validation"},"design":{"summary":"XGBClassifier (gradient boosted trees) with balanced class weights — strong baseline for mixed tabular features."},"hpo":{"summary":"Best trial: max_depth=6, n_estimators=200, lr=0.08, subsample=0.85 → val AUC 0.841"},"codegen":{},"verify":{"m1Label":"AUC","m1Value":"0.847","m2Label":"F1","m2Value":"0.762","m3Label":"Val samples","m3Value":"1,409"},"deploy":{"url":"churn_model.pkl (local artifact — deploy not required)"}},"deliverable":{"title":"Churn model ready","summary":"XGBoost pipeline trained on telco_churn.csv with validation AUC 0.847 — above the 0.80 target.","auc":"Validation AUC","aucVal":"0.847","f1":"F1 score","f1Val":"0.762","model":"Model","modelVal":"XGBoost"},"failed":{"title":"Pipeline blocked","summary":"A required agent was off. Turn on missing agents and run again."}},"vision":{"title":"K-culture pattern classifier","request":"Classify 16×16 grayscale K-culture motif patches into 4 classes (Taegeuk, Dancheong, Hallyu, K-Pop) using a lightweight CNN.","goal":"Reach ≥ 90% test accuracy with a compact CNN","stages":{"plan":{"subtask":"Break the vision task into agent stages","input":"User request — 16×16 4-class K-culture CNN, accuracy ≥ 90%","output":"Vision pipeline plan — data → augmentation → CNN → code → verification","learn":"The same Claude Code pattern applies to tabular and image tasks. The orchestrator only adapts turns to the domain.","running":"Planning image pipeline and CNN training steps","success":"Vision task plan ready","detail":"Retrieve k_culture_16x16 dataset → augment → 2-layer CNN → train → verify accuracy.","blocked":"Cannot start without the orchestrator","blockedDetail":"Press ▶ Play.","skipped":"Planning skipped"},"retrieve":{"subtask":"Obtain the K-culture patch image dataset","input":"Vision task plan — 16×16 grayscale, 4-class requirements","output":"k_culture_16x16/ — 3,200 train and 800 test PNGs confirmed","learn":"The Data agent 'finds and verifies' images the same way it does CSVs. Class count and resolution are checked first.","running":"Locating K-culture 16×16 patch dataset","success":"Image dataset loaded","detail":"k_culture_16x16/ — 4 classes, 16×16 grayscale PNGs, 3,200 train / 800 test.","blocked":"Data agent is off","blockedDetail":"Turn on the Data agent to load the image dataset.","skipped":"Data retrieval skipped"},"preprocess":{"subtask":"Build tensor and augmentation pipeline for CNN training","input":"Raw PNG patches — 16×16 grayscale","output":"Normalized, augmented batches — [0,1] scale, 800 samples per class","learn":"Preprocessing directly affects model performance. The Data agent applies domain knowledge (augmentation types).","running":"Normalizing pixels and applying augmentation","success":"Image preprocessing done","detail":"Scaled to [0,1]; random flip + ±5° rotation; class-balanced batches (800 per class).","blocked":"Data agent is off","blockedDetail":"Turn on the Data agent to normalize and augment images.","skipped":"Preprocessing skipped"},"design":{"subtask":"Design a lightweight CNN for 16×16 inputs","input":"Preprocessed image tensors — 1×16×16","output":"2-layer CNN design — Conv(16→32)×2, ~8K parameters","learn":"Small inputs call for small CNNs. The Model agent balances parameter count and representational power.","running":"Designing compact CNN for 16×16 inputs","success":"CNN architecture defined","detail":"2× Conv2d(16→32) + MaxPool + Linear(512→4); ~8K params — fits browser-scale demo.","blocked":"Model agent is off","blockedDetail":"Turn on the Model agent to design the CNN architecture.","skipped":"Model design skipped"},"hpo":{"subtask":"Optimize learning rate, batch size, and epochs","input":"CNN design + augmented training batches","output":"Training config — Adam lr=1e-3, batch=64, 24 epochs, early stopping","learn":"Vision HPO mainly tunes the training schedule. Results move to a PyTorch loop via the Code agent.","running":"Tuning learning rate and batch size","success":"Training config selected","detail":"Adam lr=1e-3, batch=64, 24 epochs with early stopping (patience=4).","blocked":"Model agent is off","blockedDetail":"Turn on the Model agent to tune training hyperparameters.","skipped":"HPO skipped"},"codegen":{"subtask":"Generate PyTorch training and evaluation script","input":"CNN architecture + HPO training config","output":"train_kpattern.py — DataLoader, CNN, accuracy/F1 loop","learn":"The Code agent auto-generates framework-specific boilerplate. You don't have to write every line by hand.","running":"Generating train_kpattern.py with PyTorch","success":"Training script generated","detail":"Includes DataLoader, CNN module, and accuracy/F1 evaluation loop.","blocked":"Code agent is off","blockedDetail":"Turn on the Code agent to generate the PyTorch training script.","skipped":"Code generation skipped"},"verify":{"subtask":"Verify 90% test accuracy target","input":"train_kpattern.py + 800-image hold-out test set","output":"Verification result — 92.3% accuracy, per-class F1 ≥ 0.88","learn":"Verify judges pass/fail against the user goal (90%), not training loss.","running":"Evaluating on held-out test set","success":"Accuracy target met","detail":"Test accuracy 92.3% — above 90% gate; per-class F1 all above 0.88.","blocked":"Verify agent is off","blockedDetail":"Turn on the Verify agent to run the test evaluation.","skipped":"Verification skipped"},"deploy":{"subtask":"Export TorchScript model (optional)","input":"Validated CNN weights + inference spec","output":"kpattern_cnn.pt — local artifact","learn":"Training-only needs can finish without Deploy. Mobile and edge deployment makes this step essential.","running":"Exporting TorchScript model","success":"Model exported","detail":"kpattern_cnn.pt saved — deploy optional for this scenario.","blocked":"Deploy agent is off","blockedDetail":"Turn on the Deploy agent if you want to serve the CNN.","skipped":"Deploy is optional — skipped"}},"artifacts":{"plan":{"step1":"Load k_culture_16x16 4-class patch dataset","step2":"Normalize pixels and apply light augmentation","step3":"Design 2-layer CNN (~8K params)","step4":"Generate train_kpattern.py","step5":"Verify test accuracy ≥ 90%"},"retrieve":{"fileLabel":"Dataset","fileValue":"k_culture_16x16/","rowsLabel":"Train","rowsValue":"3,200","colsLabel":"Test","colsValue":"800"},"preprocess":{"line1":"Grayscale 16×16 → tensor, scaled to [0, 1]","line2":"Augmentation: horizontal flip, ±5° rotation","line3":"Class-balanced batches — 800 samples per class"},"design":{"summary":"Conv2d(1→16→32) + MaxPool2d×2 + Flatten + Linear(512→4) — lightweight CNN for tiny motifs."},"hpo":{"summary":"Adam lr=1e-3, batch=64, 24 epochs, early stopping patience=4"},"codegen":{},"verify":{"m1Label":"Accuracy","m1Value":"92.3%","m2Label":"Val loss","m2Value":"0.284","m3Label":"Epochs","m3Value":"24"},"deploy":{"url":"kpattern_cnn.pt (local artifact — deploy not required)"}},"deliverable":{"title":"K-culture CNN trained","summary":"4-class 16×16 pattern CNN reached 92.3% test accuracy — above the 90% target.","accuracy":"Test accuracy","accVal":"92.3%","params":"Parameters","paramsVal":"8,288","model":"Model","modelVal":"2-layer CNN"},"failed":{"title":"Pipeline blocked","summary":"A required agent was off. Turn on missing agents and run again."}},"fullDeploy":{"title":"Churn API deployment","request":"Deploy the validated churn XGBoost model as a FastAPI inference service with p99 latency under 50 ms.","goal":"Live /predict endpoint with p99 ≤ 50 ms","stages":{"plan":{"subtask":"Plan train, verify, and deploy as one task","input":"User request — FastAPI deployment, p99 ≤ 50 ms","output":"End-to-end plan — load artifact → serving design → deploy → smoke test","learn":"In deployment scenarios, Deploy is required, not optional. The orchestrator maps the full Task() chain at once.","running":"Planning Claude Code session including deployment","success":"End-to-end plan ready","detail":"Train → verify → containerize → deploy FastAPI with health checks and smoke tests.","blocked":"Cannot start without the orchestrator","blockedDetail":"Press ▶ Play.","skipped":"Planning skipped"},"retrieve":{"subtask":"Load trained model and schema artifacts","input":"Deployment plan — required pkl and schema paths","output":"churn_model.pkl + feature_schema.json confirmed","learn":"Deployment pipelines start from validated artifacts, not fresh training. The Data agent loads them.","running":"Loading churn model and feature schema","success":"Model artifacts retrieved","detail":"churn_model.pkl + feature_schema.json from prior training run.","blocked":"Data agent is off","blockedDetail":"Turn on the Data agent to load model artifacts and schema.","skipped":"Data retrieval skipped"},"preprocess":{"subtask":"Rebuild inference preprocessing pipeline","input":"Saved sklearn Pipeline preprocessor","output":"50-dim feature validation — ChurnRequest Pydantic schema","learn":"Serving must use the same preprocessing as training. Mismatch silently degrades performance.","running":"Building inference preprocessor from saved pipeline","success":"Inference pipeline ready","detail":"Extracted sklearn ColumnTransformer; input validation for 50 feature vector.","blocked":"Data agent is off","blockedDetail":"Turn on the Data agent to prepare the inference preprocessor.","skipped":"Preprocessing skipped"},"design":{"subtask":"Design FastAPI inference API","input":"Model + preprocessor + SLA (p99 ≤ 50 ms)","output":"API spec — POST /predict, GET /health, model loaded at startup","learn":"For deployment, the Model agent designs serving architecture, not just model structure.","running":"Designing FastAPI service architecture","success":"API design approved","detail":"POST /predict with Pydantic body; GET /health; model loaded at startup.","blocked":"Model agent is off","blockedDetail":"Turn on the Model agent to design the serving architecture.","skipped":"Model design skipped"},"hpo":{"subtask":"Tune serving config for latency target","input":"API design + staging environment","output":"Serving config — workers=1, batch=1, staging p99 45 ms","learn":"Inference HPO optimizes latency and throughput, not accuracy. The Model agent handles it.","running":"Tuning batch size and worker count for latency","success":"Serving config optimized","detail":"Single worker, batch=1, ONNX-free sklearn path — p99 target 45 ms achieved in staging.","blocked":"Model agent is off","blockedDetail":"Turn on the Model agent to optimize serving configuration.","skipped":"HPO skipped"},"codegen":{"subtask":"Generate FastAPI serving code","input":"API spec + model load path + serving config","output":"serve_churn.py — /predict, /health, logging, request validation","learn":"The Code agent generates training and serving code separately. Role separation is preserved.","running":"Generating serve_churn.py FastAPI app","success":"Serving code generated","detail":"FastAPI app with /predict, /health, structured logging, and request validation.","blocked":"Code agent is off","blockedDetail":"Turn on the Code agent to generate the FastAPI serving code.","skipped":"Code generation skipped"},"verify":{"subtask":"Verify smoke tests, latency, and accuracy together","input":"serve_churn.py + staging endpoint","output":"Verification report — 12/12 smoke pass, p99 45 ms, AUC 0.847 preserved","learn":"Pre-deploy Verify checks function, performance, and quality at once. Any failure blocks Deploy.","running":"Running smoke tests and latency benchmarks","success":"All checks passed","detail":"12/12 smoke tests pass; p99 latency 45 ms; AUC 0.847 preserved on shadow traffic.","blocked":"Verify agent is off","blockedDetail":"Turn on the Verify agent to run smoke tests and benchmarks.","skipped":"Verification skipped"},"deploy":{"subtask":"Launch production /predict endpoint","input":"Verified serve_churn.py + container image","output":"https://churn-api.modudl.example/predict — 200 OK, p99 45 ms","learn":"With Deploy off, you stop at prototype. In this scenario, deployment completes the task.","running":"Deploying to production endpoint","success":"Service live","detail":"Rolling deploy complete — https://churn-api.modudl.example/predict responding 200 OK.","blocked":"Deploy agent is off","blockedDetail":"Turn on the Deploy agent — deployment is required for this scenario.","skipped":"Deploy skipped"}},"artifacts":{"plan":{"step1":"Load churn_model.pkl and feature schema","step2":"Build inference preprocessor","step3":"Design FastAPI /predict + /health","step4":"Generate serve_churn.py","step5":"Verify latency & deploy to production"},"retrieve":{"fileLabel":"Model","fileValue":"churn_model.pkl","rowsLabel":"Features","rowsValue":"50","colsLabel":"Schema","colsValue":"feature_schema.json"},"preprocess":{"line1":"Loaded sklearn Pipeline preprocessor from artifact","line2":"Pydantic ChurnRequest validates 50 numeric features","line3":"Warm-up inference run — 12 ms cold, 8 ms warm"},"design":{"summary":"FastAPI single-worker service: POST /predict returns churn_risk + label; GET /health for probes."},"hpo":{"summary":"Workers=1, batch=1, uvicorn --workers 1 — staging p99 45 ms on 1 vCPU"},"codegen":{},"verify":{"m1Label":"AUC","m1Value":"0.847","m2Label":"p99 latency","m2Value":"45 ms","m3Label":"Smoke tests","m3Value":"12/12 pass"},"deploy":{"url":"https://churn-api.modudl.example/predict"}},"deliverable":{"title":"Churn API deployed","summary":"FastAPI churn service is live with AUC 0.847 and p99 latency 45 ms.","auc":"Production AUC","aucVal":"0.847","latency":"p99 latency","latencyVal":"45 ms","endpoint":"Endpoint","endpointVal":"https://churn-api.modudl.example/predict"},"failed":{"title":"Deployment blocked","summary":"A required agent was off. Turn on all agents including Deploy and run again."}}}},"optimizer":{"title":"Loss Landscape Lab","subtitle":"Compare how SGD, Momentum, and Adam find the minimum from the same starting point on a 2D loss surface!","play":"Play","pause":"Pause","step":"Step","reset":"Reset","stepLabel":"Step","learningRate":"Learning rate η","learningRateDesc":"How far to move each step. Too large diverges; too small is slow.","momentumBeta":"Momentum β","momentumBetaDesc":"Fraction of previous gradient kept for Momentum.","speed":"Speed","showFormulas":"Show formulas","controlsIntro":"▶ Play moves all three optimizers one step at a time. Crank up the learning rate to see SGD oscillate first.","controlsIntroToggle":"Control help","landscapeTitle":"Loss terrain","landscapeHint":"Which surface should we optimize on?","landscapes":{"bowl":"Convex bowl","bowlDesc":"x²+y² — simple convex bowl","saddle":"Saddle","saddleDesc":"(x²−1)²+y² — saddle at center, two minima on sides","valley":"Rosenbrock valley","valleyDesc":"Narrow curved valley — realistic"},"worldTitle":"Optimization paths","worldHint":"Purple=SGD · Blue=Momentum · Orange=Adam · click canvas to move start","canvasAria":"2D loss landscape with optimizer paths","legendStart":"Start","legendMin":"Minimum","legendSaddle":"Saddle point","clickHint":"Click the canvas to change the start and reset paths","optimizers":{"sgd":"SGD","momentum":"Momentum","adam":"Adam"},"converged":"Converged","diverged":"Diverged","running":"Running","formulasTitle":"Formulas","formulaSgdIntro":"Basic gradient descent — one step along the current gradient:","formulaMomentumIntro":"Accumulate past gradients as inertia — reduces oscillation in valleys:","formulaAdamIntro":"Exponential moving averages of 1st and 2nd moments — adaptive step sizes:","relatedLearn":"Related lessons"},"knn":{"title":"KNN Neighbor Classifier","subtitle":"Classify with a majority vote from the K nearest neighbors—no training step! Compare classical ML on the same K-datasets as the neural network classifier.","dataTitle":"Data","dataHint":"Which K-culture dataset?","trainRatio":"Train/Test ratio","noise":"Noise","showTest":"Show test data","regenerate":"Regenerate","kLabel":"Neighbors K","kDesc":"How many nearest training points to use. Small K = jagged; large K = smoother.","metric":"Distance","metricEuclidean":"Euclidean (straight line)","metricManhattan":"Manhattan (grid)","metricDesc":"How to measure distance between points—same L1/L2 ideas as in ml03.","showFormulas":"Show formulas","controlsIntro":"Click the canvas to move the query (purple ring); dashed lines connect the K nearest neighbors.","controlsIntroToggle":"Control help","outputTitle":"Classification","outputHint":"Background = KNN decision region · dashed = neighbor links · click = query","canvasAria":"KNN classification canvas","clickHint":"Click the canvas to move the query and inspect neighbors","testAccuracy":"Test accuracy","queryPrediction":"Query prediction","neighborVotes":"K-vote breakdown","labelPos":"Positive (+1)","labelNeg":"Negative (−1)","neighborRank":"#{rank} · d={dist}","formulasTitle":"Formulas","formulaDistIntro":"Distance — Euclidean (L2) and Manhattan (L1):","formulaVoteIntro":"Majority vote — sign of the sum of K neighbor labels:","relatedLearn":"Related lessons"},"configTitle":"Model settings","inputNodes":"Input nodes","hiddenNeurons":"Hidden layer neurons","activation":"Activation","createModel":"Create model","inputTarget":"Input and target","runForward":"Run forward","forwardSteps":"Forward steps","training":"Training","oneStep":"One step","epochs50":"50 epochs","weightsAndGradients":"Weights and gradients","linkFromProblem":"How this computation is used in the network","fromDotBanner":"Linked to the dot product exercise. The first neuron in the model below computes the dot product of input and weights. Run Forward to see.","inputXLabel":"Input X (comma-separated)","targetLabel":"Target (comma-separated)","trainingInProgress":"Training…","weightsW1":"W₁ (hidden layer weights)","weightsW2":"W₂ (output layer weights)","gradientsDW1":"dW₁ (gradient)","gradientsDW2":"dW₂ (gradient)","createModelHint":"Select settings above and click \"Create model\".","lossGraphEmpty":"Run training to see the loss per epoch.","lossGraphTitle":"Loss per epoch","epochLabel":"Epoch","lastLossLabel":"Last loss: {value} ({count} epochs total)"},"tinyNN":{"batchPhase0":"Samples 1, 2, 3 are separate.","batchPhase1":"When we merge them into one table → we compute with the same W, b at once.","batchPhase2":"The same W, b applies at once to every column (sample).","batchPhase3":"So output Y also comes out as one table at once.","batchInputSeparate":"Input (samples separate)","batchInputTable":"Input table X","batchSample1":"Sample 1","batchSample2":"Sample 2","batchSample3":"Sample 3","batchOneColOneSample":"One column = one sample","batchMergeHint":"Merging makes one table","batchSameWb":"Same W, b","batchComputeOnce":"Compute at once","batchResultY":"Output Y","batchResultCaption":"← Result from same W, b at once","batchFooter1":"Stacking samples into one matrix lets us compute with the same W, b at once.","batchFooter2":"So when we merge inputs into one table, output Y also comes out as one table at once.","batchFooter3":"One table goes through the same W, b. Only the input differs per column; the rule (W, b) is the same.","connDescription":"Each line between layers is a weight (w). Multiply input by weights, add them, then add bias (b) to get the next layer Y.","connWeightLabel":"weight(w)","connBiasLabel":"+bias(b)","connFooter":"Circles are values, lines are weights (w). Add bias (b) to the weighted sum to get the next layer Y.","hiddenDescription":"We only see input (X) and output (Y). The layer in between is used only inside the network, so it’s the hidden layer.","hiddenVisibleInput":"Visible: input","hiddenHiddenH":"Hidden: H","hiddenVisibleOutput":"Visible: output","hiddenBoxLabel":"Hidden layer (not visible from outside)","hiddenFooter":"Values flow input → hidden → output. The hidden layer is an internal representation we don’t see.","deepDescription":"Deep = many hidden layers (middle steps). The “deep” in deep learning is this depth.","deepLayerN":"Layer {n}","deepFooter":"More steps mean a deeper network. Deeper networks can learn more refined patterns.","wideWidthN":"Width {count}","wideNeuronsN":"{count} neurons","wideFooter":"The number of neurons in one layer is the width. Wider layers can handle more features at once.","softmaxScoreToProb":"Score → probability","softmaxExample":"(example: e ≈ 3)","softmaxScore":"Score","softmaxMid":"Mid","softmaxPowerOf3":"3 to the power","softmaxProb":"Probability","softmaxDivideBySum":"Divide by sum","softmaxRaise":"raised to","softmaxPowerLabel":"(3^{n})","activationDescription":"Representative activation functions where output Y changes nonlinearly with input X. (3-level quantized version)","activationSigmoid":"Sigmoid(X)","activationRelu":"ReLU(X)","activationTanh":"Tanh₃(X)","hiddenLayer1Formula":"W₁·X+b₁ → ReLU","hiddenLayer2Formula":"W₂·H+b₂ → ReLU","captionDotProduct":"Left X1,X2,X3 and right Y1,Y2,Y3 are connected by lines. Each right node is the dot product of the left with weights.","captionMatrixMul":"Left is one row of matrix A; right Y1–Y3 are dot products with columns of B. Together they form the matrix product A·B.","captionLinearLayer":"This block is a linear layer. Input is computed to the next layer at once as Y = W·X + b.","captionActivation":"Node values change in a nonlinear way through ReLU or σ. The last layer Y1, Y2, Y3 come from that.","captionArtificialNeuron":"Inside the dashed circle is one artificial neuron. Input (X) times weights (w·x+b), then ReLU, gives output (Y).","captionBatch":"One column = one sample. The same W, b is applied to all columns at once to compute Y = W·X + b.","captionConnection":"Lines between layers are weights (w). Values flow along these lines to the next layer.","captionHidden":"We only see input (X) and output (Y). The layer H in between is used only inside the network, so it’s the hidden layer. Data flows input → hidden → output.","captionDeep":"Deep means many hidden (middle) layers. More steps like X→A→B→C→…→Y mean deeper; deeper networks learn more refined patterns.","captionWide":"The number of neurons in one layer is the width. 1 neuron = 1 feature, 256 = 256 at once. Width can differ per layer (e.g. 1→2→4→8 or 256→128→64).","captionSoftmax":"Softmax divides so the last layer Y1,Y2,Y3 sum to 1. You can treat them as probabilities.","captionGradient":"Gradient (∇) flows from right to left, updating each layer a bit to reduce loss.","captionSummary":"Ch01–Ch12 in one network: forward, backward, weights, activation, gradient all in one picture.","labelWeightedSum":"Weighted sum","labelWeightBias":"Weight·input+bias","labelWeight":"Weight","labelProbSum":"(probability, sum=1)","labelResult":"Result","labelMatrixResult":"Matrix product result","labelNeuron":"Neuron"},"categories":{"math":{"title":"Foundations","navTitle":"Math"},"midMath":{"title":"Intermediate Math"},"advMath":{"title":"Advanced Math"},"dl":{"title":"Basic Deep Learning","navTitle":"Deep learning"},"midDl":{"title":"Intermediate Deep Learning"},"advDl":{"title":"Advanced Deep Learning"},"ml":{"title":"Basic Machine Learning","navTitle":"Machine learning"},"midMl":{"title":"Intermediate Machine Learning"},"advMl":{"title":"Advanced Machine Learning"},"comingSoon":"Coming soon","preparing":"(Coming soon)","completed":"completed"},"concepts":{"sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters in deep learning","howUsed":"How it is used","problemSolving":"Tips for solving the problems"},"dl00":{"sectionTitle":"What is Deep Learning?","whatIs":["**Deep learning is like a smart calculator that learns by itself** — Instead of humans defining every rule one by one, it's a way for computers to find rules on their own by looking at huge amounts of data. Inspired by **neurons** in the brain exchanging signals, small computing units are stacked in **many layers (Layer)**, which is why we call it **deep** learning.","**Deep learning is everywhere in our lives** — From conversational AI you use every day like **ChatGPT** and **Gemini**, to **self-driving cars** that read the road with cameras, to **Netflix and YouTube recommendation systems** that know your taste better than you do—they're all products of deep learning. The core idea is turning complex images and sounds into **numbers**, then adding and multiplying those numbers to find the right answer.","**You need the basics to build more powerful AI** — Beyond just using ready-made models, knowing the **basic math** that happens inside is important if you want to adapt and use models for your own goals. When you understand how numbers are grouped and computed, you can clearly see why an AI made a certain decision and tune it for better performance.","**What one layer in deep learning does** — Each layer multiplies the incoming numbers by **weights** (importance) and adds them, then passes the result to the next layer. As layers get deeper, the AI goes from dots and lines in the data to eyes, nose, mouth, and finally **high-level features** like dog vs. cat. The guide for adjusting those weights precisely toward the right answer is **gradient**.","**This course's learning roadmap** — Deep learning is ultimately an efficient repetition of multiplication and addition. You'll learn the basics of how data moves through **Ch01 dot product** and **Ch02 matrix multiplication**, go through **Ch03–05 artificial neurons and activation functions**, and grasp **Ch06–10 the structure of deep and wide neural networks**. Finally, in **Ch11–12**, you'll conquer step by step the core idea of how AI learns by itself: the gradient.","Follow the **roadmap** below to see what each chapter aims for. If you follow along step by step, you'll gain the ability to interpret what kind of mathematical language state-of-the-art AI systems use internally."],"whyImportant":[],"howUsed":[],"problemSolving":[]},"dl01":{"sectionTitle":"Dot product in deep learning","whatIs":["The **dot product** multiplies **same-position elements** of two vectors and sums the results into a single number. For example, [2, 3] · [4, 1] = 2×4 + 3×1 = 11.","It also measures **how aligned** two vectors are: a large positive dot product means **similar direction**, zero means **perpendicular (unrelated)**, and negative means **opposite direction**. That's why it's great for measuring similarity.","In formula form: **a · b = a₁b₁ + a₂b₂ + … + aₙbₙ**. Both vectors must have the **same number of elements** for the dot product to work.","In real AI systems, dot products are computed between vectors with **hundreds or thousands of dimensions**. Computers do this instantly, so we can compare “how similar two texts are” or “how well an image matches a caption” with **one number**."],"whyImportant":["In deep learning, **one neuron's output is computed as a dot product** between its weights and the input. Multiply same-position values and sum them up—that gives the neuron's \"response score\" for that input.","The dot product is the **most fundamental operation** in deep learning because **matrix multiplication is just many dot products bundled together**. Linear layers, attention, embedding comparison—all rely on repeated dot products.","It also serves as a **similarity measure**: for example, Netflix computes the dot product of a user vector and a movie vector to get a \"match score.\" This idea is also called **cosine similarity**."],"howUsed":["**Recommendation systems (Netflix, YouTube)**: Compute the dot product of a user vector and a content vector to get a \"how much this user would like this content\" score. Higher score = higher recommendation rank.","**Search engines & chatbots**: Convert queries and documents to vectors, then rank by dot product (similarity). ChatGPT uses the same principle when finding the most relevant information for your question.","**Attention mechanism**: In translators and chatbots, word vectors are dotted to compute \"relevance scores\"—the model focuses more on words with high scores.","**Translation & summarization**: The model compares the current token to others with dot products to get relevance scores—this is how it decides **which words to attend to** in context."],"problemSolving":["**How to compute**: Multiply **same-position elements**, then add all the products. Example: [1, 2, 3] · [4, 5, 6] = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32.","**Finding a blank**: If the total dot product and the other products are given, sum the known products first, then subtract from the total to get the missing product. Divide by the known element to find the blank.","**Watch out**: Both vectors must have the **same number of elements**. Also, make sure to include **every** pair of elements—checking off each pair one by one helps avoid mistakes.","**Double-check**: Missing one pair changes the sum. After forming all products, **add them twice** or add in a fixed order to catch slips."],"paragraphs":["The **dot product** is the sum of **element-wise products** of two vectors: a·b = a₁b₁ + a₂b₂ + ….","In deep learning, one step of a linear transform is a **weight vector** dotted with an **input vector**, giving one **neuron**'s output. With many neurons, a **weight matrix** times the input (**matrix multiplication**) computes them at once; each entry is one dot product.","A larger dot product also means the two vectors are more **aligned**, so it is used for **attention**, **similarity**, and **embedding comparison**—measuring how similar two things are with a single number."]},"dl02":{"sectionTitle":"Matrix multiplication in deep learning","whatIs":["**Matrix multiplication** combines two number tables (matrices) into a new one. Take **one row** of the first matrix and **one column** of the second, compute their **dot product**, and that fills **one entry** in the result.","Repeat this for **every row-column combination** and the result matrix is complete. For example, a 2×3 matrix times a 3×2 matrix gives a 2×2 result.","The rule for it to work: the **number of columns** of the first matrix must equal the **number of rows** of the second. Remember this, and you can always tell whether two matrices can be multiplied.","**Why matrices?** Packing many samples into one matrix lets the **GPU** run one matrix multiply for many inputs at once—essential for batches of images, sentences, or users."],"whyImportant":["A **linear layer** in deep learning multiplies the input by a weight matrix—that's matrix multiplication. If you have 10 neurons, you'd need 10 dot products; matrix multiplication does **all 10 at once**.","**GPUs** are specifically designed to do **thousands of matrix multiplications in parallel**. This is why millions of multiplications finish instantly, enabling real-time image recognition and chatbots.","**Nearly every operation** in deep learning boils down to matrix multiplication—attention, convolution, recurrent networks. Understanding matrix multiplication means understanding the backbone of deep learning."],"howUsed":["**Image recognition**: Pixel values are arranged in a matrix, multiplied by weight matrices to extract features like 'is there a dog or a cat?' This repeats across many layers.","**Chatbots & translators**: ChatGPT and Google Translate convert sentences into number matrices, then multiply by huge weight matrices dozens to hundreds of times to generate answers. Matrix multiplication accounts for most of the computation.","**Recommendations & self-driving**: Netflix computing recommendation scores for thousands of users at once, and a self-driving car recognizing obstacles from camera frames—both rely on large-scale matrix multiplication inside.","**Batch scoring**: User–item or query–document vectors are stacked; **one matrix multiply** scores many pairs at once."],"problemSolving":["**Finding one entry**: Entry **(i, j)** of the result = dot product of **row i of A** and **column j of B**. Multiply same-position elements and sum.","**Blank strategy**: If the blank is in the result, just compute the dot product for that row and column. If the blank is in A or B, use the known result and other values to work backwards.","**Check dimensions**: Before multiplying, verify that A's **column count** equals B's **row count**. The result size is (A's rows) × (B's columns).","**Double-check**: One wrong entry poisons the whole row/column. Compute **one full row or column** first, then match the rest."],"paragraphs":["**Matrix multiplication** fills each entry of the result by taking the **dot product** of **each row** of the first matrix and **each column** of the second.","A **linear layer** in deep learning multiplies the input by a **weight matrix** and adds a **bias**; that multiplication is **matrix multiplication**. (m neurons, n inputs → m×n matrix times n-dimensional input → m outputs.)","**GPUs** are optimized for massive **parallel** matrix multiplication, so most of deep learning is **matrix multiplication**."]},"dl03":{"sectionTitle":"Linear layer in deep learning","whatIs":["A **linear layer** multiplies the input by **weights (W)** and adds a **bias (b)** to produce output: **Y = W·X + b**. The W·X part is matrix multiplication, and b shifts the baseline up or down.","Think of it like a grading formula: 'math×0.3 + science×0.5 + English×0.2 + 10'. Here 0.3, 0.5, 0.2 are **weights (W)**, 10 is **bias (b)**, and the subject scores are **input (X)**.","A single linear layer decides **'how much to scale each input and how much to add.'** With multiple outputs, each output uses different weights and bias, computing many scores at once.","**Why 'linear'?** Doubling the input roughly doubles the output (before activation)—a **linear** map. Linear stacks alone cannot draw arbitrary curves, so we **always pair linear layers with nonlinear activations**."],"whyImportant":["**Almost every deep learning model** uses linear layers as basic building blocks. ChatGPT, translators, and image classifiers all repeat 'W·X + b' hundreds to thousands of times. It's the **brick** of deep learning.","**Model size (parameter count)** is determined by 'how many inputs × how many outputs' for each linear layer. This size controls how complex things the model can learn (**capacity**) vs. the risk of **overfitting** (just memorizing training data).","However, stacking linear layers alone is equivalent to **one linear operation** (only straight lines). That's why an **activation function** (a bending function) is always added after each linear layer to enable **curves and complex patterns**."],"howUsed":["**ChatGPT & translators**: Sentences are converted to number vectors, then passed through dozens to hundreds of linear layers, each computing W·X + b followed by an activation, to understand context and generate answers.","**Image recognition**: Feature vectors from photos are fed into linear layers to compute 'dog score,' 'cat score,' 'bird score' simultaneously. The final linear layer's outputs become per-class scores.","**Recommendation systems**: User info and product info are combined into a vector, fed through linear layers to get a 'how much this user would like this product' score. More layers allow finer recommendations.","**Small devices**: Mobile models may use **narrower** linear layers to cut parameters while keeping the same W·X + b pattern."],"problemSolving":["**One formula**: Multiply input **X** by **weight matrix W** and add **bias b** to get output **Y**. So **Y = W·X + b**. Linear layer problems give you **X, W, b** and ask for **Y**, as in the purple box below.","**Numeric example**: With X = [2, 1], W = [[1,0],[1,1]], b = [1, -1], we get W·X = (2, 3). Adding bias b gives **Y = (2+1, 3-1) = [3, 2]**. The bias shifts each output up or down. Each entry of **Y** is the **dot product** of the corresponding **row of W** with **X**, plus the corresponding entry of **b**.","**Blank strategy**: If the blank is in **Y**, compute that row's W·X + b. If the blank is in **W** or **b**, use the known Y and X and rearrange the equation. Then **verify** by plugging back into Y = W·X + b."],"paragraphs":["A **linear layer** computes y = Wx + b: multiply input x by **weight matrix** W and add **bias** b.","Each output **neuron** is one **dot product** of its weight row with the full input. So **dot product** and **matrix multiplication** are the building blocks of linear layers.","Linear maps alone cannot express **nonlinear** functions well, so linear layers are usually followed by an **activation function**."]},"dl04":{"sectionTitle":"Activation in deep learning","whatIs":["An **activation function** transforms a neuron's raw output (weighted sum) into a **specific range or shape**. The most common ones are **ReLU** (negative → 0, positive → unchanged), **Sigmoid** (compresses to 0–1), and **Tanh** (compresses to −1 to 1).","Think of it like a **faucet**: when water (signal) comes in, it either 'only lets through above a threshold (ReLU)' or 'reduces the flow if it's too strong (Sigmoid, Tanh).' This transformation makes the output suitable for the next layer.","**ReLU** is the most popular because it's simple to compute (keep if positive, zero if negative) and trains fast. **Sigmoid** is used when you need probability-like outputs, and **Tanh** when you want values centered around zero.","**GELU / SiLU** are smoother ReLU variants common in modern transformers and generators; choice of activation affects **training dynamics** and sometimes accuracy."],"whyImportant":["**No matter how many multiply-and-add (linear) operations you stack, the result is the same as one multiply-and-add.** Just as connecting straight lines only gives you a straight line, linear operations alone can **never represent curves or complex patterns**.","Activation functions add **bends (nonlinearity)**. These bends allow stacked layers to create **curves and complex boundaries**, enabling the model to learn patterns in images, speech, and text.","Without activation functions, no matter how deep the network, it can only do **what a single line could do**. Activations are the **essential ingredient** that makes deep learning 'deep.'"],"howUsed":["**Image recognition**: After computing W·X + b at each layer, **ReLU** clips irrelevant features (negatives to zero) and passes relevant ones (positives) to the next layer, progressively extracting 'eyes,' 'ears,' 'wheels,' etc.","**Chatbots & translators**: Hidden layers use **ReLU** or **GELU** (a smoother version) for nonlinearity; the final layer uses **Sigmoid** (yes/no decisions) or **Softmax** (choosing among multiple candidates) to produce the answer.","**Speech recognition & self-driving**: Sound waves or camera images are converted to numbers, then passed through many linear + activation layers to determine 'what word is this' or 'what object is that.' Without activation, such complex decisions would be impossible.","**Image generation**: Denoising networks apply linear layers plus **ReLU/SiLU** at each step to predict pixel updates."],"problemSolving":["Find X's interval in the table; that gives Y.","Function | Rule","ReLU | 0 or less → 0; positive → same as X","Sigmoid | Small → 0, middle → 0.5, large → 1","Tanh₃ | Small → -1, middle → 0, large → 1","Note | Check the problem's table for boundaries."],"paragraphs":["An **activation function** makes a neuron's linear output (**weighted sum**) **nonlinear**. **ReLU**, **sigmoid**, and **tanh** are common.","Stacking only **linear layers** is equivalent to one big linear map. **Nonlinear** activations between layers are needed for **deep networks** to learn complex patterns.","Choosing where and which **activation** to use is a key **design decision** in deep learning."],"problemDiagramCaption":"Node values change in a squiggly way through ReLU or σ. The final layer Y1, Y2, Y3 come out that way.","solutionIntro":"For activation problems, Y is determined by which interval X falls into. Below is how to solve ReLU, Sigmoid, and Tanh₃ problems.","solutionRelu":"**ReLU**: X ≤ 0 → Y = 0, X > 0 → Y = X. If Y is blank, just check the sign of X.","solutionSigmoid":"**Sigmoid**: X < -1.5 → 0, -1.5~1.5 → 0.5, X > 1.5 → 1. Find X's interval from the table/graph and use the corresponding Y. Check the problem's table for boundaries.","solutionTanh":"**Tanh₃**: X ≤ -1 → -1, -1 < X < 1 → 0, X ≥ 1 → 1. Find X's interval from the table and fill Y (-1, 0, or 1). For boundary values, check which side the problem uses.","solutionCaption":"Interval boundaries may differ per problem; always check the table (or graph) given in the problem first."},"dl05":{"sectionTitle":"Artificial neuron in deep learning","whatIs":["An **artificial neuron** is the **smallest computational unit** of deep learning. It does exactly two steps: ① compute the **weighted sum** Z = W·X + b, ② apply an **activation function** Y = ReLU(Z) or Sigmoid(Z).","It's inspired by biological neurons: real neurons receive multiple signals, weight each one differently, sum them up, and fire if the total exceeds a threshold. The artificial neuron is a **mathematical simplification** of this process.","Summary: **Input (X)** → **Weight and bias (Z = W·X + b)** → **Activation (Y = f(Z))** → **Output (Y)**. That's everything an artificial neuron does.","**One neuron’s output** becomes input to many neurons in the next layer; **billions** of such units power modern vision and language models."],"whyImportant":["AI models like ChatGPT, image classifiers, and recommendation systems are built by **connecting thousands to billions of these neurons**. Understand one neuron, and you can **read the entire model's behavior**.","**Training** means gradually adjusting each neuron's **weights (W) and bias (b)** so the output gets closer to the correct answer. Knowing how W and b affect the output is key to understanding learning.","A single neuron combines **dot product + bias + activation**, so it unifies everything from the previous chapters: **dot product, matrix multiplication, linear layer, and activation function** all come together here."],"howUsed":["**Real-life analogy—exam pass prediction**: Compute 'Math×0.4 + Science×0.4 + English×0.2 + 5 = 75' (weighted sum), then 'if ≥60 → pass (1), else fail (0)' (activation). That's exactly one neuron's operation.","**One neuron in image recognition**: It takes a specific region of pixels, computes weighted sum + bias, passes through ReLU to get a 'is there a horizontal line here?' score. Thousands of such neurons together can determine 'dog or cat.'","**Chatbots, translators, speech recognition**: Each part of a sentence or sound is converted to numbers, neurons score 'what patterns are present,' and those scores flow to the next layer's neurons to grasp increasingly complex meaning."],"problemSolving":["**Step 1—Weighted sum (Z)**: Compute Z = W·X + b. Dot product W's row with X, then add b. If the blank is in Z, fill it at this step.","**Step 2—Activation (Y)**: Apply the given activation to Z. **ReLU**: Y = Z if Z > 0, Y = 0 if Z ≤ 0. **Sigmoid**: check the table to see which interval Z falls in.","**Blank in W or b**: If Y and X are given, reverse the activation to find Z first, then solve Z = W·X + b for the blank. The key is to **work backwards one step at a time**."],"paragraphs":["An **artificial neuron** takes **weighted** inputs (**weighted sum**), then applies an **activation function** to produce one output.","The weighted sum is a **dot product** of the input and weight vectors; then a **nonlinear** activation is applied.","**Deep learning models** chain many such **neurons** to transform input to output in multiple stages."]},"dl06":{"sectionTitle":"Batch in deep learning","whatIs":["A **batch** means **grouping multiple inputs (samples) into one table (matrix) and computing them all at once with the same weights**. Each **column = one sample** in the table.","Imagine a teacher grading tests **one by one** vs. feeding **30 tests into a grading machine** at once—the machine is much faster. Batching works the same way: the GPU processes many inputs **simultaneously**.","Key idea: the **same W (weights) and b (bias)** are applied to all samples. The only thing that differs per sample is the **input X**. That's why one matrix multiplication can compute results for many samples at once.","**Mini-batches**: Training splits data into chunks (e.g. 32–128 samples), runs forward/backward on each chunk, and updates weights—balancing **memory**, **speed**, and **gradient noise**."],"whyImportant":["**Speed**: GPUs are optimized for processing **thousands of numbers simultaneously** rather than one at a time. Batching lets you use the GPU's full power, computing **tens to hundreds of times faster** than one-by-one.","**Training stability**: Updating weights based on just 1 sample is **noisy**. Using a **mini-batch** (e.g., 32 or 64 samples) averages the gradients for much more **stable** learning. Batch size is a critical training setting.","**Memory management**: With 1 million data points, you can't fit them all at once (GPU memory!). So you split into **mini-batches** (e.g., 64 at a time), process each batch, update weights, and repeat."],"howUsed":["**Netflix & YouTube recommendations**: Instead of computing for one user at a time, **thousands of users' data are batched** for simultaneous scoring. This enables real-time service.","**ChatGPT & translators**: When many users ask questions at the same time, their queries are **batched together** for one GPU pass. That's how millions of users get fast responses simultaneously.","**Image training**: When training on 100,000 images, they're split into mini-batches of 32, running 3,125 iterations. Each mini-batch computes Z = W·X + b, measures error (loss), and slightly adjusts weights.","**Parallel inference**: Many inputs (images, tokens, users) are evaluated together in one batch for throughput."],"problemSolving":["**X has multiple columns**: Each column is one sample. Use the **same W and b** for each column. Find which row and column the blank is in, and use **only that column's numbers** to compute.","**Add/subtract/multiply/mean operations**: These apply to **same positions (same row, same column)**. For mean (e.g., zero-centering), compute the average **per column**. Use only that column's values for the blank.","**Verification tip**: Each column is independent—one column's result doesn't affect another. **Check each column separately** to catch mistakes easily."],"paragraphs":["**Batching** means grouping several **samples** into one **matrix** and computing with the same **weights** at once.","One **matrix operation** over many samples uses the **GPU** much better than processing one sample at a time.","Training usually computes **gradients** and **updates** weights per **mini-batch**."]},"dl07":{"sectionTitle":"Connection in deep learning","whatIs":["A **connection** describes **how neurons in one layer link to neurons in the next layer**. Each connection has a **weight (number)** that determines 'how much this input affects this output.'","**Fully connected**: **Every** neuron in the previous layer connects to **every** neuron in the next. The linear layer (Y = W·X + b) we've learned is exactly a fully connected layer—every entry in W has a number.","**Partially connected**: Some entries in W are **zero**, meaning 'no connection.' That input has **no effect** on that output. CNNs, which connect only nearby pixels, are a classic example of partial connections.","**More connections** mean more capacity but also **more compute and memory**; mobile models **prune or quantize** connections to stay efficient."],"whyImportant":["**Connection structure defines the model's character.** Fully connected considers all inputs (more information but more parameters), while partial connections only look at what's needed (efficient and fast but may miss some information).","**AI training is the process of adjusting connection strengths (weights).** 'Make this connection stronger, that one weaker'—gradually adjusting to produce outputs closer to the correct answer. Large models have billions of such connections.","**Looking at where W is zero** reveals what the model ignores. After training, connections with near-zero weights indicate 'unimportant information.' This is used in **pruning** to make models lighter."],"howUsed":["**Image recognition (CNN)**: Uses **partial connections** where only nearby pixels connect. Distant pixels are less relevant, so this reduces parameters and is faster and more efficient.","**Chatbots & translators (Transformer)**: **Attention** determines 'which words relate to which other words'—it learns which connections to strengthen **dynamically** from the data.","**Recommendation & speech recognition**: The weights connecting user features to product features directly become recommendation scores. In speech recognition, the model learns how each sound frequency connects to the next layer's features."],"problemSolving":["**W = 0 means no connection**: For example, if W(2,1) = 0, the 1st input has **zero effect** on the 2nd output. You can **skip it** entirely in the calculation.","**Finding one output**: Find which inputs **are connected** (W ≠ 0) to that output, multiply W · X for those positions only, sum them, and add b. Zero entries multiply to zero, so skipping them gives the same result.","**Blank strategy**: First, **identify the zero entries in W**. Then set up equations using only the non-zero connections. If the blank is in W, use Y and X to reverse-calculate; if it's in Y, compute forward from W and X."],"paragraphs":["**Connection** is the structure that shows **how neurons** in one **layer** link to neurons in the next layer.","Networks are often described as **fully connected**, **partially connected**, or **recurrent**. In **fully connected** layers, every neuron in one layer links to every neuron in the next (e.g. a **Linear layer**). **Partially connected** means only some neurons link to the next layer (e.g. in CNNs, filters connect only some inputs). **Recurrent** connections feed output back into the same or earlier step.","Each link has a **weight** that scales the signal. Entry (i,j) of matrix W is the strength from input j to output neuron i; these **weights** are learned.","In deep learning there can be millions to billions of connection weights. In Y = W·X + b, a zero in W means **no connection** from that input to that output (partial connection)."]},"dl08":{"sectionTitle":"Hidden layers in deep learning","whatIs":["A **hidden layer** is an **intermediate stage between input and output**. Users only see the input (e.g., a photo) and output (e.g., 'dog'), but in between, hidden layers create **'hidden features.'**","The flow is: **X → Linear(W·X+b) → ReLU → H (hidden representation) → Linear(W·H+b) → ReLU → Y (output)**. H is the hidden layer's result, containing compressed 'key features' of the input.","**Analogy**: When you see a photo and say 'dog,' your brain goes through 'colors → edges → eyes/nose/ears → dog!' These **intermediate thinking steps** are the hidden layers. The number of neurons (width) in the hidden layer determines how many different features it can capture.","A **wider** hidden layer can hold **more diverse features** at once; **deeper** stacks of layers can learn **more abstract** concepts."],"whyImportant":["Hidden layers **progressively summarize and transform** input data. **Early layers** capture simple features (brightness, edges), **later layers** capture complex features (eyes, wheels, letters).","**Without hidden layers**, the model maps input directly to output, only expressing very simple (linear) relationships. **With hidden layers**, it can learn complex relationships (curves, multi-condition combinations).","The **number of neurons (width)** and **number of layers (depth)** determine the model's **representational power**. Too small = information bottleneck and poor performance; too large = overfitting (memorizing instead of learning)."],"howUsed":["**Image recognition**: The stages 'pixels → edges → textures → object parts (eyes, wheels) → whole objects (dog, car)' are all hidden layers. Deeper layers extract more abstract features.","**Chatbots & translators**: After converting text to numbers, multiple hidden layers progressively refine 'word meaning → sentence context → answer direction.' ChatGPT passes through dozens of hidden layers (Transformer blocks) to generate responses.","**Speech recognition**: The transformation 'sound wave → frequency features → phonemes → words → sentences' goes through hidden layers at each stage."],"problemSolving":["**Compute in order**: X → (W·X+b) → ReLU → H → (W·H+b) → ReLU → Y. Compute each step **sequentially**. If the blank is in H, compute only through the first linear+ReLU. If in Y, compute H first then the second stage.","**ReLU caution**: When the linear result (W·input+b) is **negative, ReLU turns it to 0**. In the next layer, that value is 0, so that term **contributes nothing**—you can ignore it entirely. This is a frequent key point in hidden layer problems.","**Blank in W or b**: Hidden layer problems have **two stages** (two linear+activation). First identify which stage the blank is in. If you know the input and output of that stage, solve for the blank using that stage's equation alone."],"paragraphs":["**Hidden layers** sit between **input** and **output** layers and learn internal **representations** not directly observed.","They gradually transform input into **higher-level features**; **lower layers** capture simple patterns, **higher layers** more abstract ones.","The number of **neurons** and **layers** in hidden layers is a key factor in model **capacity**."]},"dl09":{"sectionTitle":"Depth in deep learning","whatIs":["**Deep** means having **many hidden layers (intermediate stages)**. The **'deep'** in **deep learning** refers exactly to this depth! Each layer does linear (W·input+b) + activation (ReLU), then passes the result to the next layer.","**X → A → B → C → … → Y**—the more stages, the deeper. Analogy: with **1 stage** you can only 'draw a line,' with **10 stages** you can 'draw simple shapes,' and with **100 stages** you can 'draw a human face.' More depth = **more precise, complex patterns**.","But deeper isn't always better. Too many layers can cause **vanishing gradients** (learning signals don't reach early layers) or **overfitting** (memorizing training data instead of learning general patterns).","**Image generation** models also get **deeper** as they add more denoising steps; **translation and chat** models stack many blocks into **deep** architectures."],"whyImportant":["**More layers enable more complex functions.** Each layer's activation adds 'bends,' and stacking layers **combines many bends** into very complex curves and decision boundaries.","In image recognition: **layers 1–2** learn 'lines, edges,' **layers 3–5** learn 'eyes, noses, wheels,' **layer 6+** learn 'dogs, cars.' This is possible because of **depth**.","Famous architectures like **ResNet** and **Transformer** can be **dozens to hundreds of layers** deep and still train well. The secret is **skip connections (residual connections)**: gradients can skip layers and flow directly to earlier layers. These techniques overcome the 'limits of depth.'"],"howUsed":["**ChatGPT**: GPT-4 consists of **dozens to hundreds** of Transformer blocks. Each block understands context more deeply, and the final layer generates the answer.","**Self-driving cars**: Camera images go through **deep networks** (e.g., ResNet-152, 152 layers!) to accurately distinguish obstacles, lane markings, and signs through many stages. Depth enables handling complex road situations.","**Speech recognition & translation**: Converting speech to text, or Korean to English, also goes through **deep networks** where each layer progressively captures 'phonemes → words → context → meaning.'","**Speech & translation (detail)**: Deep nets map sound or text through layers from low-level units up to **meaning**—the same depth pattern as vision."],"problemSolving":["**Example**: Input X = [3, 1, 2]. Layer 1: W₁·X+b₁ = [4, -1, 2] (linear), then ReLU gives A = [4, 0, 2]. Layer 2: W₂·A+b₂ = [2, 1, 5], ReLU gives B = [2, 1, 5]. If **A₂ is blank**?","**Solution**: The second entry of layer 1 linear output is -1, so ReLU(-1) = 0. So **A₂ = 0**. For a blank in a middle layer, compute that layer's **linear (W·input+b)** first, then apply **ReLU (negative → 0)**.","**In general**: Wherever the blank is, compute **all previous layers** in order to get that layer's input, then take the **dot product of the corresponding row of W with the input**, add the **bias entry**, and apply ReLU to get the answer."],"paragraphs":["**Deep** means having many **hidden layers**—many **layers** in the **network**. That is the 'deep' in **deep learning**.","More depth allows more stages of **nonlinear transformation** and more **complex functions**, but also **harder training**, **overfitting**, and **cost**.","Architectures like **ResNet** and **Transformer** help **train** very deep networks **stably** with **structural techniques**."]},"dl10":{"sectionTitle":"Width in deep learning","whatIs":["**Width** refers to **how many neurons are in a single layer**. More neurons (wider) = the layer can **represent more features simultaneously**. For example, 1 neuron = 1 feature; 256 neurons = 256 features at once.","Analogy: if an **exam has 1 question**, you can only evaluate one skill; with **100 questions**, you can assess many abilities at once. Similarly, a wider layer **processes more diverse information** in one step.","Layers can have different widths. For example, '1 → 2 → 4 → 8' (widening) or '256 → 128 → 64' (narrowing) are both common designs, depending on the purpose.","**Large server models** may use **thousands** of hidden units per layer; **mobile** models shrink width to save compute and memory."],"whyImportant":["**Depth (number of layers)** and **width (neurons per layer)** together determine the model's **total size (parameter count)**. With the same number of parameters, you can choose '**deep and narrow**' or '**shallow and wide**'—and this choice significantly affects performance.","Greater width means **more features processed simultaneously** per layer, but it also increases **computation and memory**. Too wide risks **overfitting** (memorizing training data).","In practice, **bottleneck** designs are popular: keep the input and output narrow but make the middle wide. This way, the **wide layer extracts key features** while the rest stays compressed. Both ResNet and Transformer use this technique."],"howUsed":["**Image recognition (CNN)**: The **channel count** (number of feature maps) at each layer is its width. Starting from 3 channels (RGB), deeper layers grow to 64 → 128 → 256 → 512 channels, extracting **increasingly diverse features**.","**Chatbots & translators (Transformer)**: The **hidden dimension** (e.g., 768, 1024, 4096) is the number of numbers each layer processes at once (its width). Large models like GPT-4 have dimensions in the thousands—very wide.","**Recommendation systems**: A 'user vector of 256 dimensions' means width 256. It holds 256 features (age, preferences, watch history, etc. transformed into numbers), enabling more detailed recommendations."],"problemSolving":["**Same formula per layer even when widening**: Linear (W·input+b) → ReLU. Find which layer and neuron the blank belongs to, then use **that layer's input** and **the corresponding row of W and entry of b** to compute.","**Watch W dimensions**: When width changes between layers, **W's size changes too**. W is (current width × previous width), so find the right **row** for the blank's neuron and dot it with the previous layer's output, plus b.","**Layer by layer**: Just like with depth problems, **compute previous layers' outputs first** before moving to the next. Don't forget ReLU (negative → 0) at each layer."],"paragraphs":["**Width** is the number of **neurons** (or **channels**) in one layer. A **wider layer** can represent more **features** at once.","How you balance **depth** (number of layers) and **width** (neurons per layer) affects **capacity** and **efficiency**. With the same **parameters**, you can go deeper or wider.","In practice, **width** often varies per layer to add **capacity** where needed."]},"dl11":{"sectionTitle":"Softmax in deep learning","whatIs":["**Softmax** is a function that **converts multiple scores (numbers) into probabilities**. All values become **between 0 and 1**, and they **sum to exactly 1**. So you can read them as probabilities.","The formula is __SOFTMAX_FORMULA__. Because it uses **powers of e (≈2.718)**, the largest score gets **amplified significantly** while others shrink relatively. The gap between 1st and 2nd place becomes more pronounced.","Example: scores [3, 1, 0] → e³≈27, e¹≈2.7, e⁰=1 → sum ≈ 23.7 → probabilities → [0.84, 0.11, 0.04]. The score of 3 was only 3× larger than 1, but the probability is about 8× larger!","**Why exponentiate, then divide?** To **sharpen** differences between scores so the most likely choice stands out clearly."],"whyImportant":["Softmax is used at the **final layer of almost every classification model**. 'This photo is 70% dog, 25% cat, 5% bird' lets you see **per-class probabilities** and **how confident** the model is.","When combined with **cross-entropy loss** during training, the gradients work out **cleanly and stably**. The model naturally learns to 'increase the correct class probability and decrease the rest.'","Softmax's property of 'all positive values that sum to 1' exactly matches the definition of a **probability distribution**. This makes it the **most natural way** to convert scores to probabilities, both statistically and theoretically."],"howUsed":["**Image classification**: The model's final layer outputs scores (logits) like [5.2, 2.1, 0.8, ...]. Softmax converts them to [0.70, 0.25, 0.05, ...]—**probabilities for each class**. The highest probability class is the final answer.","**Chatbots & translators**: When ChatGPT picks the next word, it scores every word in its vocabulary (tens of thousands!), converts to probabilities via softmax, and samples a word based on those probabilities. High-probability words appear often, but occasionally low-probability words are picked for variety.","**Attention mechanism**: In translators, relevance scores for 'which input words to focus on' are passed through softmax to become probabilities (weights). These weights create a **weighted average** of inputs that emphasizes the most relevant parts.","**Spam filters**: Compute spam vs. not-spam probabilities with softmax, then classify by the larger one."],"problemSolving":["**Computation order**: ① Compute __WEIGHTED_SUM_FORMULA__ (logits) ② Compute __SOFTMAX_EXP__ (problem uses __E_APPROX_3__) ③ Compute __SOFTMAX_SUM__ (sum) = add all __SOFTMAX_EXP__ values ④ __SOFTMAX_Y_DIV__ (divide each by the sum). Follow this order.","**Finding blanks**: If Y is blank, compute 'that __SOFTMAX_EXP_DIV_SUM__.' If __SOFTMAX_EXP__ is blank, compute '__Y_TIMES_SUM__.' If Z is blank, reverse from __SOFTMAX_EXP__. If __SOFTMAX_SUM__ is blank, just add all __SOFTMAX_EXP__ values.","**Verification**: After computing, check that all Y values are **between 0 and 1** and **sum to 1**. If not, there's a calculation error. Also confirm whether the problem uses __E_APPROX_3__ or __E_APPROX_2718__."],"paragraphs":{"0":"**Softmax** maps a vector to values in **(0,1)** that **sum to 1**, so it can be interpreted as a **probability distribution**.","1":"In **classification**, applying softmax to the last layer gives **class** **probabilities** and is typically used with **cross-entropy loss**.","2":"The formula is __SOFTMAX_FORMULA__; the **exponent** **amplifies** the largest value."}},"dl12":{"sectionTitle":"Gradient in deep learning","whatIs":["The **gradient** tells you **'if you change a weight (parameter) slightly, how much and in which direction does the loss (error) change.'** Think of it as a **compass** pointing toward 'which way to go to reduce error.'","**Analogy**: Imagine walking down a mountain blindfolded. You feel the **slope (gradient) under your feet** and step toward the downhill direction. Walking **opposite to the gradient** leads you to the valley (minimum loss). This is **gradient descent**.","**Backpropagation** passes gradients **from the output back toward the input, one layer at a time**. Using the **chain rule** from calculus, it efficiently computes the gradient for every weight in every layer **in one pass**.","**Forward pass** computes outputs from inputs; **backpropagation** sends gradients from the loss back toward inputs. Training alternates these two."],"whyImportant":["**AI training = looking at gradients and updating weights.** Without gradients, there's no way to know 'which direction to adjust,' making **learning impossible**. The gradient is the **heart** of deep learning training.","**Learning rate** controls 'how far to step each time.' Too large → overshoot the valley (diverge); too small → takes forever to arrive. Optimizers like **Adam** automatically **adjust the step size** based on gradient magnitude.","If gradients get **too large (gradient explosion)**, training becomes unstable; if they get **too small (gradient vanishing)**, early layers barely learn. Techniques like **gradient clipping**, **batch normalization**, and **skip connections** are used to prevent this."],"howUsed":["**Every trained AI model**: ChatGPT, image recognition, recommendation systems—**every model** computes gradients to update weights. Forward pass → compute loss → backward pass for gradients → update weights. Repeating these 4 steps millions of times is training.","**Forward and backward**: Forward computes Z = W·X going **forward**; backward propagates gradients dW, dX going **backward**. They always work as a pair.","**Fine-tuning**: When adapting ChatGPT for a specific use case, new data is used to compute gradients and slightly adjust weights. Thanks to gradients, a **pre-trained model** can quickly adapt to new purposes."],"problemSolving":["**Problem format**: The equation is either **forward Z = W·X** or **backward dZ = dW·X**. The blank (?) is **one entry of X** or **one entry of Z** (or **dZ**). W and dW are always fully given.","**Forward (Z = W·X)**: Each entry of Z = dot product of **one row of W** with **X**. If the blank is in **Z**, multiply that row of W by X and sum. If the blank is in **X**, use the other Z entries and rows of W to set up an equation and solve for that X entry.","**Backward (dZ = dW·X)**: **Same computation** as forward. Each entry of dZ = dot product of **one row of dW** with **X**. If the blank is in **dZ**, dot that row of dW with X. If the blank is in **X**, solve from the equation."],"paragraphs":["The **gradient** is the vector of **partial derivatives** of the **loss** with respect to each **parameter**—how much and in which **direction** the loss changes.","**Training** usually moves parameters in the **opposite direction** of the gradient (**gradient descent**). Gradients are computed efficiently by **backpropagation**.","**Learning rate**, **optimizer**, and **gradient clipping** are **key settings** for how gradients are used."]},"dl13":{"sectionTitle":"Summary","whatIs":["The diagram below **collects everything from Ch01–Ch12** into **one network**: input X → hidden layers (A, B, C, D) → output Y, with **weights (W)**, **activation (ReLU, etc.)**, **batch**, and **gradient (∇)** shown.","Real training repeats **forward pass** (compute output) → **loss** → **backward pass** (gradients) → **update weights**. After this course you can follow that flow in the math."],"whyImportant":[],"howUsed":[],"problemSolving":[]}},"kimpoMdooai":{"layoutEyebrow":"Kimpo City · Yonsei University AI·SW Curriculum Contest","layoutTitle":"Kimpo \"Reading Kimpo with AI\" Program","layoutLearnCta":"Go to Everyone's AI","layoutNavAria":"Navigate Kimpo program pages","navPromo":"Platform promo","navCourse":"Course overview","promoMetaTitle":"Kimpo AI Education Proposal","promoMetaDescription":"A promotional page showing how Everyone's AI connects Kimpo city data with math, deep learning, and machine learning education.","promoPosterAlt":"Kimpo City and Yonsei University AI·SW curriculum contest poster","promoBadge":"Everyone's AI based education proposal","promoTitle":"A class where students read real Kimpo problems through data and solve them with AI","promoDescription":"Everyone's AI connects basic math, deep learning, and machine learning in one learning flow. This proposal turns Kimpo's population, traffic, environment, and business data into a student-centered AI problem-solving experience.","promoFormulaTitle":"Global AI Learning Platform","promoFormulaDescription":"Everyone's AI is a global learning platform used by AI researchers and developers from many countries.","promoPlatformGeneralDescription":"With more than 500 active users, it supports a broad AI learning community from foundational study to practical application.","promoPrimaryCta":"See the 16-session course","promoSecondaryCta":"Open Everyone's AI","overviewAudienceLabel":"Audience","overviewSessionsLabel":"Program size","overviewFormatLabel":"Format","overviewRegionLabel":"Local context","overviewPlatformLabel":"Core platform","overviewEnrollmentLabel":"Recommended class size","strengthsEyebrow":"Why Everyone's AI","strengthsTitle":"Why choose Everyone's AI as the core platform","strengthsDescription":"The goal is not a one-time demo tool, but a coherent program that builds mathematical understanding and data-driven problem solving together.","bestChoiceEyebrow":"Best Choice Evidence","bestChoiceTitle":"Why Everyone's AI is the best choice for AI·SW education","bestChoiceDescription":"These points show why mdooai is a strong fit for both educational outcomes and real classroom operation.","bestWhyLabel":"Why this is best","bestSchoolFitLabel":"Classroom fit","bestQuoteLabel":"Original core phrase","flowEyebrow":"Learning Flow","flowTitle":"The journey students will follow","outcomesEyebrow":"Learning Outcomes","outcomesTitle":"What students gain from this course","toolsEyebrow":"Platform Stack","toolsTitle":"A toolset ready for the classroom","toolsDescription":"Browser-based learning, public datasets, and Colab practice work together so the class can run from lecture to project without installation.","toolsCta":"View course details","courseMetaTitle":"Kimpo AI Course Overview","courseMetaDescription":"A course overview page for the Everyone's AI-based 16-session program and its connection to Kimpo city datasets.","courseBadge":"Course Overview","courseTitle":"A 16-session AI·SW course built on Kimpo city data","courseDescription":"This course connects basic math, basic deep learning, basic machine learning, and team projects so students can turn city issues into data questions and explain them with AI. The detailed curriculum content below remains in Korean, based on the original teaching plan.","coursePosterAlt":"Kimpo AI education contest poster","coursePosterEyebrow":"Contest Context","coursePosterDescription":"This web page is designed as a contest-ready teaching artifact and keeps the same visual tone as the poster for presentation use.","goalsTitle":"Learning goals","materialsTitle":"Class setup and tools","materialsLabel":"Materials","toolsLabel":"Platforms","quickStartEyebrow":"Quick Start","quickStartTitle":"Start learning right away","quickStartDescription":"Direct chapter links are prepared so classes can begin immediately on Everyone's AI.","quickMath":"Start basic math","quickDl":"Start basic deep learning","quickMl":"Start basic machine learning","quickMidMath":"Intermediate math","quickMidDl":"Intermediate deep learning","quickMidMl":"Intermediate machine learning","quickAdvMath":"Advanced math","quickAdvDl":"Advanced deep learning","quickAdvMl":"Advanced machine learning","phaseEyebrow":"4 Learning Stages","phaseTitle":"16-session operation roadmap","phaseDescription":"The course accumulates in four stages: fundamentals, deep learning structuring, machine learning application, and local problem solving.","phaseOutputLabel":"Output","assessmentEyebrow":"Assessment Design","assessmentTitle":"Assessment plan and method","assessmentDescription":"Evaluation covers process quality, explanation ability, ethics awareness, and collaboration, not just answer accuracy.","curriculumEyebrow":"16 Sessions","curriculumTitle":"Session-by-session curriculum","curriculumDescription":"Each session is summarized with chapter focus, tools, learning target, and Kimpo-specific context.","tableSession":"Session","tableTopic":"Topic","tableSummary":"Summary","tableMaterials":"Materials & tools","tableAchievement":"Learning target","tableKimpoConnection":"Kimpo connection","courseChapterCta":"Go to related chapter","track":{"math":"Basic math","dl":"Basic deep learning","ml":"Basic machine learning","project":"Project"},"issueEyebrow":"Kimpo Issues","issueTitle":"How the course connects to local Kimpo issues","issueDescription":"This is not an abstract AI experience. The course is designed around real Kimpo issues and public datasets.","issueTableArea":"Area","issueTableDetail":"Detail","issueTableDataSource":"Dataset source","issueTableLinkedSessions":"Linked sessions","finalOutcomeTitle":"Expected outputs and outcomes","courseBackCta":"Back to platform promo","courseLearnCta":"Open Everyone's AI"},"locale":{"ko":"Korean","ja":"Japanese","en":"English","zh":"Chinese (Simplified)"},"chapters":{"dl00":{"chapter":"Chapter 00","title":"First steps in deep learning: How does AI think?","description":"Find out at a glance what deep learning is and what you'll learn in Ch01–Ch12."},"dl01":{"chapter":"Chapter 01","title":"Vector dot product: Finding similarity between data","description":"The most basic operation: multiplying two vectors' direction and magnitude into a single value."},"dl02":{"chapter":"Chapter 02","title":"Matrix multiplication: The magic of computing at once","description":"The product of two matrices is a new matrix filled with dot products of rows of the first and columns of the second."},"dl03":{"chapter":"Chapter 03","title":"Linear layer: Weights that decide importance","description":"Linear layer (or linear transformation layer). A layer that multiplies the input by a weight matrix and adds bias."},"dl04":{"chapter":"Chapter 04","title":"Activation function: Adding judgment to AI","description":"Activation function. A function that makes a neuron's output nonlinear."},"dl05":{"chapter":"Chapter 05","title":"Artificial neuron: A unit that gathers information and sends signals","description":"Artificial neuron. A unit that takes input, computes a weighted sum, and applies an activation function."},"dl06":{"chapter":"Chapter 06","title":"Batch processing: Learning together in one go","description":"Batch. A unit that processes multiple samples in one computation."},"dl07":{"chapter":"Chapter 07","title":"Weight connections: The countless chains that build intelligence","description":"Connections. The weighted links between layers and between neurons."},"dl08":{"chapter":"Chapter 08","title":"Hidden layer: The invisible depth of thought","description":"Hidden. Layers between the input and output layers."},"dl09":{"chapter":"Chapter 09","title":"Deep network: The power to solve more complex problems","description":"Depth. A network with many hidden layers is called a deep network."},"dl10":{"chapter":"Chapter 10","title":"Width and neurons: Finding more features at once","description":"Width. A layer with many neurons is called a wide layer."},"dl11":{"chapter":"Chapter 11","title":"Softmax: Turning results into confidence","description":"Softmax (probability distribution). Transforms output so values are between 0 and 1 and sum to 1."},"dl12":{"chapter":"Chapter 12","title":"Gradient and backpropagation: Learning from mistakes","description":"Gradient. Tells which direction to move parameters to reduce loss."},"dl13":{"chapter":"Chapter 13","title":"Summary: A map of AI at a glance","description":"You can see what you learned in Ch01–Ch12 in one neural network diagram."}},"midMathChapters":{"midMath00":{"chapter":"Chapter 00","title":"Intermediate Math and AI: Multivariable Space and Uncertainty"},"midMath01":{"chapter":"Chapter 01","title":"Vectors and Vector Space: Magnitude and Direction Beyond Scalars"},"midMath02":{"chapter":"Chapter 02","title":"Dot Product and Projection: Angle and Similarity Between Data"},"midMath03":{"chapter":"Chapter 03","title":"Matrices and Data: Structural Representation of Many Vectors"},"midMath04":{"chapter":"Chapter 04","title":"Matrix Multiplication and Linear Transformation: Math That Manipulates Space"},"midMath05":{"chapter":"Chapter 05","title":"Inverse and Determinant: Inverse of Transformation and Change in Volume"},"midMath06":{"chapter":"Chapter 06","title":"Linear Independence and Rank: Redundancy and Effective Dimension"},"midMath07":{"chapter":"Chapter 07","title":"Eigenvalues and Eigenvectors: Principal Axes Unchanged by Transformation"},"midMath08":{"chapter":"Chapter 08","title":"Directional Derivative and Gradient: Steepest Ascent in Multidimensional Space"},"midMath09":{"chapter":"Chapter 09","title":"Jacobian Matrix: First Derivatives of Multivariable Vector Functions"},"midMath10":{"chapter":"Chapter 10","title":"Hessian Matrix: Second Derivatives and Curvature of Surfaces"},"midMath11":{"chapter":"Chapter 11","title":"Taylor Series: Approximating Complex Functions with Polynomials"},"midMath12":{"chapter":"Chapter 12","title":"Convex Optimization: Conditions for Finding the Minimum"},"midMath13":{"chapter":"Chapter 13","title":"Conditional Probability and Dependence: Probabilistic Relations Between Variables"},"midMath14":{"chapter":"Chapter 14","title":"Bayes' Theorem: Updating Probability with Observed Data"},"midMath15":{"chapter":"Chapter 15","title":"Covariance and Correlation: Measuring Linear Association Between Two Variables"},"midMath16":{"chapter":"Chapter 16","title":"Multivariate Normal Distribution: Joint Probability Model for Many Variables"},"midMath17":{"chapter":"Chapter 17","title":"Maximum Likelihood Estimation (MLE): Inferring Parameters from Observations"},"midMath18":{"chapter":"Chapter 18","title":"Entropy: Quantifying Uncertainty via Information Theory"},"midMath19":{"chapter":"Chapter 19","title":"Cross-Entropy and KL Divergence: Measuring Difference Between Two Distributions"},"midMath20":{"chapter":"Chapter 20","title":"Intermediate Math Summary: Linear Algebra and Probability Combined"}},"midMathCh00":{"chapter":"Chapter 00","title":"Intermediate Math and AI: Going One Step Deeper","description":"Intermediate math is where the language of AI becomes more precise. Instead of treating data as just numbers, this course views it as **vectors** and **matrices**, and studies the rules that move between them as **linear transformations**. You’ll also interpret how learning behaves by using **Jacobians** (how outputs change with many inputs) and **Hessians** (curvature information), so you can understand why training can be fast, slow, or unstable.","sectionTitle":"Vectors, matrices, and sensitivity: how intermediate math explains AI","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving guide"},"whatIs":{"0":"**Vector spaces** give a framework for describing data by both **direction** and **magnitude**. For example, an image can be represented as coordinates of learned features.","1":"A **matrix** represents transformations of vectors. In particular, **linear transformations** provide consistent rules for how coordinates change—this is exactly how each layer in a neural network can be expressed mathematically.","2":"**Jacobians** and **Hessians** are maps of sensitivity. Jacobians answer “how much the output changes when the inputs change,” while Hessians describe the curvature of the loss landscape. With these maps, you can design learning updates more intelligently."},"whyImportant":{"0":"Training is essentially repeated computation that reduces error. To understand why error decreases, you need multivariable change (gradients and sensitivity), which is the core of intermediate math.","1":"Linear algebra helps interpret representation. Many ideas (like embeddings and component analysis) reduce to “how vectors are rearranged.” Once you know the math, the results become explainable.","2":"Understanding **Hessians** helps you see why learning is slow near some regions and faster near others. Second-order information also underpins methods such as Newton’s method and trust-region approaches."},"howUsed":{"0":"In **forward pass**, input vectors are transformed by matrix multiplications and linear rules. This determines which features are emphasized and which are suppressed.","1":"In **backward pass**, you need how changes propagate—Jacobians play that role. The chain rule becomes a language for tracking how small changes reach the output, enabling accurate gradient computation.","2":"During optimization, curvature information (Hessians) can improve stability. Hessians tell you whether the loss surface is flat or steep, shaping the update step."},"problemSolving":{"0":"| Topic | Role in AI | Intermediate concept |\n| --- | --- | --- |\n| **Similarity & direction** | Bring similar features closer | Dot product, projection |\n| **How a layer operates** | How one layer transforms vectors | Matrices, linear transformations |\n| **Sensitivity (change)** | How output changes when inputs change | Jacobians, gradients |\n| **Learning curvature** | How fast optimization proceeds | Hessians, eigenvalues |\n| **Uncertainty language** | Describe joint behavior of multiple variables | Covariance, multivariate normal |"}},"midMathCh01":{"chapter":"Chapter 01","title":"Vectors and Vector Space: Magnitude and Direction Together","description":"A vector is both a bundle of numbers and an object that encodes **magnitude and direction** at once. In machine learning each sample becomes a feature vector $\\mathbf x$; in deep learning embeddings and weights are vectors. This chapter builds the shared language of vectors in $\\mathbb R^n$ and prepares you for **Ch.02 Dot Product**.","sectionTitle":"Vectors and Vector Space: Magnitude and Direction Together","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it's used","problemSolving":"Problem-solving guide"},"visualShort":"Vectors: components · magnitude · direction · $\\mathbb R^n$","visualIntro":"Inputs are components $(v_x,v_y)$; scalar multiple $k\\mathbf v$ and sum $\\mathbf u+\\mathbf v$ are done **componentwise**. $\\mathbb R^n$ is the space of all real vectors with $n$ components; its dimension is $n$.","visualStep1":"Data · parameters → vector $\\mathbf v\\in\\mathbb R^n$","visualStep2":"Scalar multiple $k\\mathbf v$, sum $\\mathbf u+\\mathbf v$ (componentwise)","visualStep3":"Space $\\mathbb R^n$: dimension $n$, $n$ components","visualStepsLabel":"View order","whatIs":{"intro":"**What is a vector?** An ordered list $\\mathbf v=(v_1,\\ldots,v_n)$ and, geometrically, an arrow with magnitude and direction. When a function has several real inputs, packing them into one vector keeps notation clean.","plain":"Navigation apps say “3 km east, 4 km north”—direction and distance together. On the plane that is one arrow—a 2D vector. In components $(3,4)$; length is $\\sqrt{3^2+4^2}$.","definition":"More formally, **$\\mathbb R^n$** consists of real vectors with $n$ components. **Addition** is componentwise; **scalar multiplication** multiplies each component by a real number. The **zero vector** $\\mathbf 0$ has all zeros. The **Euclidean norm** is $\\|\\mathbf v\\|=\\sqrt{\\sum_i v_i^2}$; exercises often use $\\|\\mathbf v\\|^2=\\sum_i v_i^2$ as an integer.","inAI":"In supervised learning, features are $\\mathbf x\\in\\mathbb R^d$ and linear weights $\\mathbf w\\in\\mathbb R^d$. Deep networks stack dot products and matrices; this chapter is the first step. In **Ch.10 Hessian** you will read **second derivatives (curvature)** on the same vector space."},"whyImportant":{"bridge":"Calculus “functions and continuity” becomes the habit of **one vector for many inputs**. ML features, distances, and classification—and DL dot products and matrix multiply—all rest on **vector language**.","language":"“Add only in the same dimension”; “scalar multiply hits every component the same way”—that is **vector space structure**. Mastering it reduces confusion later for independence, basis, rank, and eigenvalues."},"howUsed":{"features":"**Feature vector**: one table row (height, weight, …) as $\\mathbf x$; preprocessing, normalization, and distances are vector operations. **kNN / clustering** often use norms of differences.","dlWeights":"**Deep learning**: a neuron computes dot product of input and weight vectors (next chapter) plus bias and activation. Embeddings are vectors in a “meaning space.” **Vector = minimal bundle of numbers AI reads.**"},"summary":"**In sum**, vectors unify geometric (direction, magnitude) and algebraic (components) views; $\\mathbb R^n$ is the space of all $n$-dimensional real vectors. Addition and scalar multiplication are componentwise; inner product, matrices, and derivatives build on this. **Ch.02** turns “how similar” into a number.","problemSolving":{"focus":"The table summarizes **formulas and symbols**; the **item-by-item notes** below explain each definition. **Worked examples** walk through representative types step by step.","examplesHeading":"Worked examples","examplesTable":"$23"},"problemSolvingLabel":"Problem-solving guide","problemSolvingTable":"$24","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: vector · components · $\\mathbb R^n$","visualFlowStep1":"Intuition: arrow (direction · length)","visualFlowStep2":"Math: sum · scalar · norm · dot","visualFlowStep3":"Use: features · embeddings · weights","visualArrowTitle":"Vector = direction + length","visualComponentTitle":"Same direction · length × k","visualAriaLabel":"Vector sum and scalar multiple diagram. Left: u, v, and u plus v. Right: baseline u and k times u on the same line.","visualLegendGray":"Baseline u","visualLegendBlue":"k·u","visualRnLabel":"Closed in $\\mathbb R^2$","problemPromptIntro":"Read each problem and enter the vector-operation result as an integer.","promptDefinition":"If the statement is **true**, choose **1**; if **false**, choose **0**.","promptDefinitionChoice":"Three options (a), (b), and (c) are listed below. Choose the correct one.","promptMagnitudeSquared2D":"For $\\mathbf v=({vx},{vy})$, what is $\\|\\mathbf v\\|^2$ (integer)?","promptDotProduct2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is $\\mathbf u\\cdot\\mathbf v$ (integer)?","promptSumComponent2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is the value of $(\\mathbf u+\\mathbf v)_{axis}$ (integer)? (component: {axis})","promptScalarMultComponent2D":"For $\\mathbf u=({ux},{uy})$, what is the value of $({k}\\mathbf u)_{axis}$ (integer)? (component: {axis})","promptDimensionRn":"Vector $\\mathbf v=({components})$ lies in $\\mathbb R^n$. What is the dimension $n$ (integer)?","promptNumComponentsRn":"How many components does vector $\\mathbf v=({components})$ have (integer)?","promptCrossZ2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is $u_x v_y - u_y v_x$ (integer)?","promptNormMinusSquared2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is $\\|\\mathbf u\\|^2-\\|\\mathbf v\\|^2$ (integer)?","promptDefault":"Choose the correct option below.","mcDefChoice1":"(a)","mcDefChoice2":"(b)","mcDefChoice3":"(c)","mcDefChoice4":"(d) None of (a)–(c)","definitionStatements":{"0":"A vector has magnitude and direction and can be written as components.","1":"A vector in $\\mathbb R^n$ has $n$ real components.","2":"The sum of two vectors of the same dimension is defined componentwise.","3":"Scalar multiplication $k\\mathbf v$ multiplies each component of $\\mathbf v$ by $k$.","4":"The zero vector has all components equal to zero.","5":"A vector space must be closed under addition and scalar multiplication.","6":"$$\\mathbb R^2$ is a 2-dimensional vector space over the reals.","7":"If one vector is a scalar multiple of another, they lie on the same line through the origin.","10":"The Euclidean norm $\\|\\mathbf v\\|$ can be negative.","11":"The dimension of $\\mathbb R^3$ is 2.","12":"You can define the sum $\\mathbf u+\\mathbf v$ for vectors $\\mathbf u$ and $\\mathbf v$ of different dimensions.","13":"Vector addition does **not** satisfy associativity $(\\mathbf u+\\mathbf v)+\\mathbf w=\\mathbf u+(\\mathbf v+\\mathbf w)$.","14":"The dot product $\\mathbf u\\cdot\\mathbf v$ of real vectors is always a vector."},"definitionChoiceQuestions":{"0":"(a) $4$\n(b) $5$\n(c) $6$\n\nQuestion: What is the dimension of $\\mathbb R^5$?","1":"(a) $2$\n(b) $3$\n(c) $1$\n\nQuestion: What is the dimension of $\\mathbb R^2$?","2":"(a) $16$\n(b) $25$\n(c) $9$\n\nQuestion: For $\\mathbf v=(3,4)$, what is $\\|\\mathbf v\\|^2$?","3":"(a) $3$\n(b) $2$\n(c) $5$\n\nQuestion: What is the $y$-component of $2\\mathbf e_1+3\\mathbf e_2$? ($\\mathbf e_1=(1,0),\\mathbf e_2=(0,1)$)","4":"(a) Always $\\mathbf v$\n(b) Always $\\mathbf 0$\n(c) Undefined\n\nQuestion: When $k=0$, what is $k\\mathbf v$?","5":"(a) Parallel\n(b) Orthogonal\n(c) Equal\n\nQuestion: If $\\mathbf u\\cdot\\mathbf v=0$, the vectors are often said to be:","6":"(a) $n-1$\n(b) $n$\n(c) $2n$\n\nQuestion: How many components does a vector in $\\mathbb R^n$ have?","7":"(a) $5$\n(b) $4$\n(c) $3$\n\nQuestion: What is the $x$-component of $(1,2)+(3,4)$?"}},"midMathCh02":{"chapter":"Chapter 02","title":"Dot Product and Orthogonal Projection: Measuring Similarity with Numbers","description":"The **dot product** compresses “how aligned two vectors are” into **a single number**. An **orthogonal projection** moves one vector onto the line (or subspace) spanned by another—like a **shadow**. On $\\mathbb{R}^n$ from Ch.01, this chapter trains you to read **similarity, angles, and distance** in the language of dot products, and connects naturally to **similarity, attention, and linear layers** in ML and deep learning.","sectionTitle":"Dot Product and Orthogonal Projection: Measuring Similarity with Numbers","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"How to solve problems"},"visualShort":"Dot product · angle · projection · cosine similarity","visualIntro":"For arrows $\\mathbf{u},\\mathbf{v}$, the dot product $\\mathbf{u}\\cdot\\mathbf{v}$ couples lengths and angle. The vector you get by “dropping $\\mathbf{v}$ onto $\\mathbf{u}$” is the projection $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$, and the remainder $\\mathbf{v}-\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$ is **orthogonal** to $\\mathbf{u}$.","visualStep1":"Concept: $\\mathbf{u}\\cdot\\mathbf{v}=\\sum_i u_i v_i=\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$","visualStep2":"Intuition: positive if similar direction, 0 if orthogonal, negative if opposite","visualStep3":"Projection: $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}=\\frac{\\mathbf{u}\\cdot\\mathbf{v}}{\\mathbf{u}\\cdot\\mathbf{u}}\\mathbf{u}$","visualStep4":"Applications: embedding similarity, linear layers, least-squares as projection","visualStepsLabel":"Suggested viewing order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: dot product · angle · orthogonality","visualFlowStep1":"Intuition: shadow (projection) · residual","visualFlowStep2":"Math: projection · cosine · Pythagoras","visualFlowStep3":"Applications: recommendation · deep layers · dimensionality reduction","dotVisualAriaLabel":"Dot product, projection, and cosine similarity: rotating vector with live readouts","dotVisualMainTitle":"Similarity as v spins","dotVisualPlotTitle":"Plane: u, v, projection","dotVisualMetricsTitle":"Direction, cosine & values","dotVisualHudDot":"u·v","dotVisualHudCos":"cos θ (direction)","dotVisualHudProj":"|proj| / |v|","dotVisualLegendU":"base u","dotVisualLegendV":"rotating v","dotVisualLegendProj":"shadow","dotVisualLegendRes":"residual ⊥ u","dotVisualInsetLabel":"direction","dotVisualCaption":"As **green vector** $v$ rotates, **$\\theta$** changes, and the **amber shadow (projection)** length, **dot product**, and $\\cos\\theta$ move together. Nearer **same direction** → larger **dot product**; **orthogonal** → $0$; **opposite** → **negative**. The small circle shows **only** $v$'s **direction**.","whatIs":{"intro":"The **dot product** is the “multiply matching entries and add” rule from Ch.01 folded into **one number**. Geometrically it is $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$. A **projection** onto a direction is the **shadow vector** you get after rescaling by that dot-product coefficient.","plain":"In plain words, the dot product scores **how much two arrows point the same way**. Same direction → large positive; perpendicular → $0$; opposite → negative. Think of projection as the **shadow** of a flashlight on a wall.","definition":"$25","inAI":"In **deep learning**, each linear layer is built from dot products between rows of weights and the input. **Attention** uses query–key dot products (or scores) to decide where to look. **Recommendation** uses dot products / cosines between user and item embeddings."},"whyImportant":{"bridge":"After Ch.01’s “boxes of numbers,” this chapter is the rule that **pairs boxes to make one score**. That score becomes the common language for **distance, angle, and similarity** before matrices, eigenvalues, and optimization.","similarity":"To make “similar” precise you need a **measure**. Dot products and cosines separate **direction vs magnitude** in high dimensions and tie directly to preprocessing (e.g. normalization)."},"howUsed":{"ml":"**Machine learning:** similarity for kNN, kernels, linear/logistic terms $\\mathbf{w}\\cdot\\mathbf{x}$; outliers may show up as small dot products or large angles.","geometry":"**Geometry:** least-squares fits as **projection** onto the column space; PCA / orthogonal bases; Gram–Schmidt subtracts projections to orthogonalize."},"summary":"**Summary:** The dot product is “sum of products of components” and couples length and angle; projection is the **shadow** along a direction; cosine focuses on direction; projections pair with orthogonal residuals. **Ch.03 matrices** bundle many dot products at once.","problemSolving":{"focus":"The table summarizes **formulas and symbol meanings** for solving problems, followed by **item-by-item notes** on why those definitions are set up that way. **Worked examples** walk through representative types step by step.","examplesHeading":"Worked examples","examplesTable":"$26"},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"$27","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **10 problems** sampled from a bank of **60** (easy 4 · medium 3 · hard 3; order easy→medium→hard). Each item is **multiple choice**—pick the option number.","practiceProblemsInstruction":"Read the question and choose the best option.","problems":{"definition_0":"In $\\mathbb{R}^n$, which expression best matches writing the **dot product** $\\mathbf{u}\\cdot\\mathbf{v}$ in components?\n\n① $\\sum_i u_i v_i$ (multiply matching entries and add)\n② $\\sum_i u_i + v_i$\n③ $\\max_i u_i v_i$\n④ $\\prod_i u_i v_i$","definition_1":"When two vectors are **orthogonal**, what is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① Always $0$\n② Always $1$\n③ Always positive\n④ Always a vector","definition_2":"In $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$, what does $\\theta$ represent?\n\n① The **angle** between the vectors (the smaller one)\n② The dimension\n③ Only the norms\n④ The rank of a matrix","definition_3":"For $\\mathbf{u}\\neq\\mathbf{0}$, the **orthogonal projection** of $\\mathbf{v}$ onto $\\mathbf{u}$ is which vector?\n\n① $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\mathbf{u}\\cdot\\mathbf{u}}\\,\\mathbf{u}$\n② $\\mathbf{v}-\\mathbf{u}$\n③ $\\dfrac{\\mathbf{v}}{\\|\\mathbf{u}\\|}$\n④ $\\mathbf{u}\\times\\mathbf{v}$","definition_4":"What is the usual range of **cosine similarity** $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|}$ for real vectors (typically)?\n\n① $[-1,1]$\n② $[0,\\infty)$\n③ $(-\\infty,\\infty)$ only\n④ Only $0$ or $1$","definition_5":"The result of $\\mathbf{u}\\cdot\\mathbf{v}$ is best described as:\n\n① A **scalar** (one real number)\n② Always a vector\n③ Always a matrix\n④ Always a Boolean","definition_6":"Which relation always holds between $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|$ and $\\|\\mathbf{v}\\|$?\n\n① $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|\\le \\|\\mathbf{v}\\|$\n② $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|> \\|\\mathbf{v}\\|$ always\n③ They are always equal\n④ Cannot compare","definition_7":"In logistic regression with $z=\\mathbf{w}\\cdot\\mathbf{x}+b$, what does $\\mathbf{w}\\cdot\\mathbf{x}$ mainly encode?\n\n① A score of **alignment** between weight and feature vectors\n② A cross product\n③ A determinant\n④ The probability itself","definition_8":"Which is a correct **property of the dot product**? ($\\mathbf{a},\\mathbf{b},\\mathbf{c}$ same dimension; $c$ is a scalar)\n\n① $(c\\mathbf{a})\\cdot\\mathbf{b}=c(\\mathbf{a}\\cdot\\mathbf{b})$\n② $(\\mathbf{a}\\cdot\\mathbf{b})\\cdot\\mathbf{c}$ is always defined\n③ $\\mathbf{a}\\cdot\\mathbf{b}=\\mathbf{a}+\\mathbf{b}$\n④ The dot product never commutes","definition_9":"To connect with $\\mathbb{R}^n$ from Ch.01, for $\\mathbf{u}\\cdot\\mathbf{v}$ to be defined we need:\n\n① The same dimension (**same $n$**)\n② Different dimensions are fine\n③ Both must be unit vectors\n④ One must be the zero vector","trueFalse_0":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\mathbf{u}\\cdot\\mathbf{v}=0$ always implies both vectors are the zero vector.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_1":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nFor every $\\mathbf{v}$, $\\mathbf{0}\\cdot\\mathbf{v}=0$.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_2":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\mathbf{u}\\cdot\\mathbf{v}=\\mathbf{v}\\cdot\\mathbf{u}$ always holds (when defined).\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_3":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nThe projection $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$ is always parallel to $\\mathbf{u}$ ($\\mathbf{u}\\neq\\mathbf{0}$).\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_4":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nCosine similarity is always nonnegative.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_5":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\|\\mathbf{u}+\\mathbf{v}\\|^2=\\|\\mathbf{u}\\|^2+\\|\\mathbf{v}\\|^2$ always holds.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_6":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nDot products are linear: $\\mathbf{u}\\cdot(\\mathbf{v}+\\mathbf{w})=\\mathbf{u}\\cdot\\mathbf{v}+\\mathbf{u}\\cdot\\mathbf{w}$.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_7":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\mathbf{u}\\cdot\\mathbf{u}=\\|\\mathbf{u}\\|^2$.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_8":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nIn recommender systems, dot products / cosines can score similarity between user and item embeddings.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_9":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nThe residual $\\mathbf{v}-\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$ is orthogonal to $\\mathbf{u}$ ($\\mathbf{u}\\neq\\mathbf{0}$).\n\n① True\n② False\n③ Neither\n④ Empty statement","calc_0":"$$\\mathbf{u}=(2,3)$, $\\mathbf{v}=(4,-1)$. What is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① $5$\n② $11$\n③ $-5$\n④ $14$","calc_1":"$$\\mathbf{a}=(1,1,1)$, $\\mathbf{b}=(2,-3,1)$. What is $\\mathbf{a}\\cdot\\mathbf{b}$?\n\n① $0$\n② $3$\n③ $6$\n④ $-1$","calc_2":"$$\\|\\mathbf{u}\\|=5$, $\\|\\mathbf{v}\\|=4$, and the vectors point the same direction. What is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① $20$\n② $9$\n③ $1$\n④ $0$","calc_3":"$$\\mathbf{u}=(3,4)$. What is $\\mathbf{u}\\cdot\\mathbf{u}$?\n\n① $25$\n② $5$\n③ $12$\n④ $7$","calc_4":"$$\\mathbf{u}=(2,0)$, $\\mathbf{v}=(1,\\sqrt{3})$. What is cosine similarity $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|}$?\n\n① $\\dfrac{1}{2}$\n② $1$\n③ $0$\n④ $\\dfrac{\\sqrt{3}}{2}$","calc_5":"$$\\mathbf{u}=(1,2)$, $\\mathbf{v}=(2,4)$, and $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}=\\alpha\\mathbf{u}$. What is $\\alpha$?\n\n① $2$\n② $1$\n③ $0$\n④ $4$","calc_6":"$$\\mathbf{e}_1=(1,0,0)$, $\\mathbf{v}=(3,-2,6)$. What is the first component of $\\mathrm{proj}_{\\mathbf{e}_1}\\mathbf{v}$ (the $x$-coordinate)?\n\n① $3$\n② $6$\n③ $-2$\n④ $0$","calc_7":"$$\\mathbf{u}=(1,0)$, $\\mathbf{v}=(0,5)$. What is $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|$?\n\n① $0$\n② $1$\n③ $5$\n④ $25$","calc_8":"What is the norm $\\|\\mathbf{a}\\|$ for $\\mathbf{a}=(1,2,2)$?\n\n① $3$\n② $9$\n③ $\\sqrt{5}$\n④ $5$","calc_9":"$$\\mathbf{u}=(-1,2)$, $\\mathbf{v}=(4,2)$. What is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① $0$\n② $10$\n③ $-4$\n④ $6$","concept_0":"In deep learning, which best matches treating an attention score as a dot product?\n\n① It scores **alignment** between query and key vectors\n② It only uses norms\n③ It disables backprop\n④ It only looks at activations","concept_1":"In least squares with an orthogonal (orthonormal) design matrix, what becomes easier?\n\n① **Independently** interpreting each coefficient\n② Always diverging\n③ Learning rate becomes 0\n④ Dot products are always 0","concept_2":"When feature scales differ a lot, cosine similarity is often preferable to Euclidean distance because:\n\n① You care about **direction** more than **magnitude**\n② You want larger lengths\n③ Derivatives vanish\n④ It is always slower","concept_3":"Which operation is central to Gram–Schmidt?\n\n① **Subtract** projections onto previous directions to orthogonalize\n② Computing determinants\n③ Only eigenvalues\n④ Probability integrals","concept_4":"What basic idea connects PCA (covariance eigenvectors) to projections and orthogonality?\n\n① Maximizing variance along **orthogonal** axes (quadratic forms)\n② Dot products are always 0\n③ Only cross products\n④ Probability only","concept_5":"For loss $L(\\mathbf{w})=\\|\\mathbf{y}-X\\mathbf{w}\\|^2$, what role does $X\\mathbf{w}$ play?\n\n① It is the **projection** of $\\mathbf{y}$ onto the column space of $X$ (in the LS sense)\n② Random noise\n③ Always zero\n④ An activation function","concept_6":"For a linear layer row $\\mathbf{w}_i^{\\mathsf T}\\mathbf{x}$ in $\\mathbf{z}=W\\mathbf{x}$ before ReLU, what is it?\n\n① One **dot product** between a weight row and the input\n② A cross product\n③ Softmax\n④ Only batch norm","concept_7":"Why can cosine similarity become unstable when $\\|\\mathbf{u}\\|$ is tiny?\n\n① The denominator $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|$ is **near 0**, blowing up scale\n② Dot product is always 0\n③ Cosine is always 1\n④ Vectors are orthogonal","concept_8":"If word embeddings are **unit-normalized**, what happens when you compare them with cosine similarity?\n\n① Cosine $\\approx$ plain **dot product** (direction only)\n② Always wrong\n③ Dot product undefined\n④ Dimension changes","concept_9":"Which description highlights that an **orthogonal projection** is a **linear map**?\n\n① It preserves sums and scalar multiples (can be written as a matrix $P$)\n② Always nonlinear\n③ Only rotations\n④ Only changes probability","projection_0":"$$\\mathbf{u}=(1,1)$, $\\mathbf{v}=(3,0)$. If $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}=(a,a)$, what is $a$?\n\n① $\\dfrac{3}{2}$\n② $3$\n③ $\\dfrac{1}{2}$\n④ $0$","projection_1":"$$\\mathbf{u}=(2,1)$, $\\mathbf{v}=(1,2)$. What is the $x$-component of $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$?\n\n① $\\dfrac{8}{5}$\n② $2$\n③ $1$\n④ $0$","projection_2":"Project $\\mathbf{v}=(6,8)$ onto $\\mathbf{e}_1=(1,0)$. What is the norm of the result?\n\n① $6$\n② $8$\n③ $10$\n④ $0$","projection_3":"If $\\mathbf{\\hat{u}}$ is a unit vector, what simplified form does the projection onto $\\mathbf{\\hat{u}}$ take?\n\n① $(\\mathbf{v}\\cdot\\mathbf{\\hat{u}})\\,\\mathbf{\\hat{u}}$\n② $\\mathbf{v}-\\mathbf{\\hat{u}}$\n③ Only $\\|\\mathbf{v}\\|\\mathbf{\\hat{u}}$\n④ $\\mathbf{\\hat{u}}/\\|\\mathbf{v}\\|$","projection_4":"$$\\mathbf{a}=(1,1,1)$, $\\mathbf{b}=(1,0,0)$. What is the sum of the three components of $\\mathrm{proj}_{\\mathbf{a}}\\mathbf{b}$?\n\n① $1$\n② $3$\n③ $0$\n④ $\\dfrac{1}{3}$","projection_5":"Let $\\mathbf{r}=\\mathbf{v}-\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$. What is $\\mathbf{r}\\cdot\\mathbf{u}$? ($\\mathbf{u}\\neq\\mathbf{0}$)\n\n① $0$\n② $\\|\\mathbf{u}\\|^2$\n③ $\\|\\mathbf{v}\\|^2$\n④ $1$","projection_6":"Let $\\mathbf{\\hat{u}}$ be the unit vector in the direction of $\\mathbf{u}=(4,3)$. What is $\\|\\mathrm{proj}_{\\mathbf{\\hat{u}}}\\mathbf{v}\\|$ for $\\mathbf{v}=(1,0)$? (Use dot products only.)\n\n① $\\dfrac{4}{5}$\n② $1$\n③ $\\dfrac{3}{5}$\n④ $5$","projection_7":"The area of the parallelogram spanned by two plane vectors is $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\||\\sin\\theta|$; in 3D this equals $\\|\\mathbf{u}\\times\\mathbf{v}\\|$. How does this connect to dot products?\n\n① $\\sin^2\\theta=1-\\cos^2\\theta$ links to the **orthogonal component**\n② Unrelated to dot products\n③ Always 0\n④ Norms are always 1","projection_8":"Let $\\mathbf{v}=\\mathbf{p}+\\mathbf{r}$ be the orthogonal decomposition where $\\mathbf{p}$ is the projection onto $\\mathbf{u}$ and $\\mathbf{r}$ is the residual. Which Pythagorean relation holds between $\\|\\mathbf{v}\\|^2$ and $\\|\\mathbf{p}\\|^2+\\|\\mathbf{r}\\|^2$?\n\n① Always $\\|\\mathbf{v}\\|^2=\\|\\mathbf{p}\\|^2+\\|\\mathbf{r}\\|^2$\n② Always $\\|\\mathbf{v}\\|^2=\\|\\mathbf{p}\\|^2-\\|\\mathbf{r}\\|^2$\n③ Never holds\n④ $\\|\\mathbf{p}\\|=\\|\\mathbf{r}\\|$","projection_9":"For matrix $A$, if $y_i=\\mathbf{a}_i\\cdot\\mathbf{x}$ for each row $\\mathbf{a}_i^{\\mathsf T}$, what viewpoint is this?\n\n① The coordinates of $A\\mathbf{x}$ are **row–vector dot products** with $\\mathbf{x}$\n② Cross product magnitude\n③ Determinant\n④ Variance","scenario_0":"Two document embeddings have cosine similarity $0.92$. In a recommender, a simple reading is:\n\n① Topics are **fairly aligned** (after scale normalization)\n② The probability is 92%\n③ The documents have equal length\n④ They must use the same words","scenario_1":"Image feature dim $\\neq$ text feature dim. To use cosine similarity directly you should:\n\n① First map both into the **same dimension** (shared embedding space)\n② It always works if dims differ\n③ Dot products ignore dimension\n④ Match probabilities only","scenario_2":"Mini-batch SGD loss is noisy. Which intuition about gradient $\\mathbf{g}$ matches an update step?\n\n① A step moves mainly opposite to $\\mathbf{g}$ (**steepest descent** direction)\n② Always same direction as $\\mathbf{g}$\n③ Unrelated to $\\mathbf{g}$\n④ Dot product always 0","scenario_3":"Collaborative filtering predicts $\\hat{r}=\\mathbf{u}\\cdot\\mathbf{v}$. If the dot product is large, the model usually means:\n\n① User and item factors **line up** (under the model)\n② Always dislike\n③ Cannot learn\n④ Probability equals 1","scenario_4":"Why divide by $\\sqrt{d_k}$ in scaled dot-product attention (Transformer)?\n\n① Reduce variance so softmax is less **saturated**\n② Remove dot products\n③ Disable backprop\n④ Force orthogonality","scenario_5":"After standardizing features, linear SVM margins connect naturally to:\n\n① Separating via distances/angles in an **inner-product** space (kernels)\n② Probability only\n③ Clustering only\n④ Unsupervised only","scenario_6":"When is cosine better than Euclidean distance between two autoencoder latent vectors?\n\n① When **direction** (pattern) matters more than **length**\n② Only when distance is always better\n③ Only without images\n④ Never","scenario_7":"Which is closest to how **projection** appears in ML pipelines?\n\n① Visualizing high-dim data in a **low-dimensional subspace** (e.g. PCA)\n② Only estimating probabilities\n③ Always deleting data\n④ Only changing batch size","scenario_8":"Why can a larger dot product after normalization **not** guarantee semantic similarity?\n\n① Embeddings depend on **training data and objectives**\n② Dot products are always wrong\n③ Cosine is always 0\n④ Vectors are orthogonal","scenario_9":"Viewing matrix–vector product $A\\mathbf{x}$ (Ch.03) through dot products:\n\n① It is the vector of **row dot products** with $\\mathbf{x}$\n② Determinant only\n③ Always a scalar\n④ Cross products only"},"problemAnswers":{"definition_0":1,"definition_1":1,"definition_2":1,"definition_3":1,"definition_4":1,"definition_5":1,"definition_6":1,"definition_7":1,"definition_8":1,"definition_9":1,"trueFalse_0":2,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":2,"trueFalse_5":2,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":1,"calc_0":1,"calc_1":1,"calc_2":1,"calc_3":1,"calc_4":1,"calc_5":1,"calc_6":1,"calc_7":1,"calc_8":1,"calc_9":1,"concept_0":1,"concept_1":1,"concept_2":1,"concept_3":1,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"projection_0":1,"projection_1":1,"projection_2":1,"projection_3":1,"projection_4":1,"projection_5":1,"projection_6":1,"projection_7":1,"projection_8":1,"projection_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1},"problemSolutions":{"definition_0":"**(1) Concept:** Multiply matching components and add. **(2) Example:** $\\mathbf{u}=(1,2)$, $\\mathbf{v}=(3,-1)$ gives $1\\cdot3+2\\cdot(-1)=1$. **(3) Answer: ①**","definition_1":"**(1) Concept:** Orthogonal means $90^\\circ$, so $\\cos\\theta=0$ and the dot product is $0$. **(2) Example:** $(1,0)\\cdot(0,1)=0$. **(3) Answer: ①**","definition_2":"**(1) Concept:** In $\\mathbf{u}\\cdot\\mathbf{v}=\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$, $\\theta$ is the angle between arrows. **(2) Example:** Same direction $\\Rightarrow\\theta=0$. **(3) Answer: ①**","definition_3":"**(1) Concept:** Keep only the $\\mathbf{u}$ component by projecting; coefficient $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\|\\mathbf{u}\\|^2}$. **(2) Example:** $\\mathbf{u}=(1,0)$, $\\mathbf{v}=(3,4)$ $\\Rightarrow$ projection $(3,0)$. **(3) Answer: ①**","definition_4":"**(1) Concept:** Cosine lies in $[-1,1]$. **(2) Example:** Same direction $\\approx1$, opposite $\\approx-1$, orthogonal $0$. **(3) Answer: ①**","definition_5":"**(1) Concept:** The dot product is a single real **scalar**. **(2) Example:** $(1,2)\\cdot(3,1)=5$. **(3) Answer: ①**","definition_6":"**(1) Concept:** A projection cannot be longer than the original vector (right triangle). **(2) Example:** Equality if $\\mathbf{v}$ is already parallel to $\\mathbf{u}$. **(3) Answer: ①**","definition_7":"**(1) Concept:** Linear models use dot products to score feature alignment. **(2) Example:** Text similarity often uses dot/cosine-style scores. **(3) Answer: ①**","definition_8":"**(1) Concept:** Scalar multiples factor out of one side. **(2) Example:** $(2\\mathbf{a})\\cdot\\mathbf{b}=2(\\mathbf{a}\\cdot\\mathbf{b})$. **(3) Answer: ①**","definition_9":"**(1) Concept:** Componentwise multiply/add needs equal length. **(2) Example:** $(1,2)\\in\\mathbb{R}^2$ vs $(1,2,0)\\in\\mathbb{R}^3$ cannot pair. **(3) Answer: ①**","trueFalse_0":"**(1) Counterexample:** $(1,0)\\cdot(0,1)=0$ but neither vector is zero (orthogonal). **(2) Answer: ②**","trueFalse_1":"**(1) Concept:** $\\mathbf{0}$ has all zero components. **(2) Answer: ①**","trueFalse_2":"**(1) Concept:** Commutativity of the dot product. **(2) Answer: ①**","trueFalse_3":"**(1) Concept:** The projection lies on the line spanned by $\\mathbf{u}$. **(2) Answer: ①**","trueFalse_4":"**(1) Counterexample:** Opposite directions can make cosine negative. **(2) Answer: ②**","trueFalse_5":"**(1) Concept:** $\\|\\mathbf{u}+\\mathbf{v}\\|^2=\\|\\mathbf{u}\\|^2+\\|\\mathbf{v}\\|^2+2\\mathbf{u}\\cdot\\mathbf{v}$. **(2) Answer: ②**","trueFalse_6":"**(1) Concept:** Distributivity / linearity. **(2) Answer: ①**","trueFalse_7":"**(1) Example:** $(3,4)\\cdot(3,4)=25=\\|\\mathbf{u}\\|^2$. **(2) Answer: ①**","trueFalse_8":"**(1) Practice:** Similarity scoring in recommender systems. **(2) Answer: ①**","trueFalse_9":"**(1) Concept:** The residual is orthogonal to $\\mathbf{u}$. **(2) Answer: ①**","calc_0":"**(1) Compute:** $2\\cdot4+3\\cdot(-1)=5$. **(2) Answer: ①**","calc_1":"**(1) Compute:** $2-3+1=0$. **(2) Answer: ①**","calc_2":"**(1) Concept:** Same direction $\\Rightarrow\\cos\\theta=1$, dot $=5\\cdot4=20$. **(2) Answer: ①**","calc_3":"**(1) Compute:** $9+16=25=\\|\\mathbf{u}\\|^2$. **(2) Answer: ①**","calc_4":"**(1) Compute:** Dot $=2$, norms $2$ and $2$ $\\Rightarrow$ $\\dfrac{2}{4}=\\dfrac{1}{2}$. **(2) Answer: ①**","calc_5":"**(1) Compute:** $\\mathbf{u}\\cdot\\mathbf{v}=10$, $\\mathbf{u}\\cdot\\mathbf{u}=5$ $\\Rightarrow$ $\\alpha=2$. **(2) Answer: ①**","calc_6":"**(1) Concept:** Projection onto $\\mathbf{e}_1$ keeps the $x$ component: $3$. **(2) Answer: ①**","calc_7":"**(1) Compute:** $\\mathbf{v}$ is orthogonal to $\\mathbf{u}$, so projection length $0$. **(2) Answer: ①**","calc_8":"**(1) Compute:** $\\sqrt{1+4+4}=3$. **(2) Answer: ①**","calc_9":"**(1) Compute:** $-4+4=0$ (orthogonal). **(2) Answer: ①**","concept_0":"**(1) Practice:** Larger dot $\\Rightarrow$ more attention mass. **(2) Answer: ①**","concept_1":"**(1) Concept:** Orthogonal columns decouple coefficient effects. **(2) Answer: ①**","concept_2":"**(1) Intuition:** Cosine focuses on topic direction, not document length. **(2) Answer: ①**","concept_3":"**(1) Concept:** Subtract projections to orthogonalize. **(2) Answer: ①**","concept_4":"**(1) Bridge:** Dot products, orthogonality, projections underpin PCA. **(2) Answer: ①**","concept_5":"**(1) Practice:** Least squares $\\Leftrightarrow$ projection onto $\\mathrm{Col}(X)$. **(2) Answer: ①**","concept_6":"**(1) Practice:** Deep layers stack dot products. **(2) Answer: ①**","concept_7":"**(1) Practice:** Stabilize with regularization / clipping when norms are tiny. **(2) Answer: ①**","concept_8":"**(1) Concept:** On the unit sphere, $\\mathbf{u}\\cdot\\mathbf{v}=\\cos\\theta$. **(2) Answer: ①**","concept_9":"**(1) Concept:** Projection matrices $P=\\dfrac{\\mathbf{u}\\mathbf{u}^{\\mathsf T}}{\\mathbf{u}^{\\mathsf T}\\mathbf{u}}$. **(2) Answer: ①**","projection_0":"**(1) Compute:** $\\mathbf{u}\\cdot\\mathbf{v}=3$, $\\mathbf{u}\\cdot\\mathbf{u}=2$ $\\Rightarrow$ coeff. $3/2$, projection $(3/2,3/2)$. **(2) Answer: ①**","projection_1":"**(1) Compute:** Dot $4$, $\\|\\mathbf{u}\\|^2=5$ $\\Rightarrow$ projection $\\dfrac{4}{5}(2,1)$, $x=8/5$. **(2) Answer: ①**","projection_2":"**(1) Concept:** Project to $x$-axis: $(6,0)$, norm $6$. **(2) Answer: ①**","projection_3":"**(1) Concept:** If $\\|\\mathbf{\\hat{u}}\\|=1$, the scalar coefficient is $\\mathbf{v}\\cdot\\mathbf{\\hat{u}}$. **(2) Answer: ①**","projection_4":"**(1) Compute:** $\\mathbf{a}\\cdot\\mathbf{b}=1$, $\\mathbf{a}\\cdot\\mathbf{a}=3$ $\\Rightarrow$ $\\dfrac{1}{3}(1,1,1)$, sum $=1$. **(2) Answer: ①**","projection_5":"**(1) Concept:** Residual is orthogonal to $\\mathbf{u}$. **(2) Answer: ①**","projection_6":"**(1) Compute:** $\\mathbf{\\hat{u}}=(4/5,3/5)$, dot $=4/5$, length $=|\\mathbf{v}\\cdot\\mathbf{\\hat{u}}|=4/5$. **(2) Answer: ①**","projection_7":"**(1) Bridge:** $\\cos$ from dot products, $\\sin$ from cross/area. **(2) Answer: ①**","projection_8":"**(1) Concept:** With $\\mathbf{p}\\perp\\mathbf{r}$, Pythagoras gives $\\|\\mathbf{v}\\|^2=\\|\\mathbf{p}\\|^2+\\|\\mathbf{r}\\|^2$. **(2) Answer: ①**","projection_9":"**(1) Practice:** $A\\mathbf{x}$ is stacked row dot products (preview Ch.03). **(2) Answer: ①**","scenario_0":"**(1) Practice:** Embeddings give approximate similarity only. **(2) Answer: ①**","scenario_1":"**(1) Practice:** Dot products need a common $\\mathbb{R}^n$. **(2) Answer: ①**","scenario_2":"**(1) Bridge:** Gradient descent is the optimization story (next topics). **(2) Answer: ①**","scenario_3":"**(1) Practice:** Matrix-factorization-style models use dot scores. **(2) Answer: ①**","scenario_4":"**(1) Practice:** Scale dot products to keep softmax stable. **(2) Answer: ①**","scenario_5":"**(1) Bridge:** Vectors $\\rightarrow$ kernels via inner products. **(2) Answer: ①**","scenario_6":"**(1) Practice:** Cosine is common when magnitude is arbitrary. **(2) Answer: ①**","scenario_7":"**(1) Practice:** PCA projects onto a subspace. **(2) Answer: ①**","scenario_8":"**(1) Practice:** Math tools assume training setup. **(2) Answer: ①**","scenario_9":"**(1) Preview:** Rows $\\cdot\\,\\mathbf{x}$ build deep linear layers. **(2) Answer: ①**"},"problemTestCodes":{"definition_0":"answer = 1\nassert answer == 1","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 1\nassert answer == 1","definition_3":"answer = 1\nassert answer == 1","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 1\nassert answer == 1","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 1\nassert answer == 1","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 2\nassert answer == 2","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 2\nassert answer == 2","trueFalse_5":"answer = 2\nassert answer == 2","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 1\nassert answer == 1","calc_0":"answer = 1\nassert answer == 1","calc_1":"answer = 1\nassert answer == 1","calc_2":"answer = 1\nassert answer == 1","calc_3":"answer = 1\nassert answer == 1","calc_4":"answer = 1\nassert answer == 1","calc_5":"answer = 1\nassert answer == 1","calc_6":"answer = 1\nassert answer == 1","calc_7":"answer = 1\nassert answer == 1","calc_8":"answer = 1\nassert answer == 1","calc_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","projection_0":"answer = 1\nassert answer == 1","projection_1":"answer = 1\nassert answer == 1","projection_2":"answer = 1\nassert answer == 1","projection_3":"answer = 1\nassert answer == 1","projection_4":"answer = 1\nassert answer == 1","projection_5":"answer = 1\nassert answer == 1","projection_6":"answer = 1\nassert answer == 1","projection_7":"answer = 1\nassert answer == 1","projection_8":"answer = 1\nassert answer == 1","projection_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1"}},"midMathCh03":{"chapter":"Chapter 03","title":"Matrices and data batches: putting many vectors on one sheet","description":"A **matrix** is a rectangular grid of numbers—**one sheet**. In machine learning, one **row** is often one **sample** (one person, one image) and one **column** is one **feature**. This chapter connects vectors (Ch.01) and dot products (Ch.02) to how they appear **many times at once** inside a matrix, and sets up **matrix multiplication and linear layers (Ch.04)**.","sectionTitle":"Matrices and data batches: putting many vectors on one sheet","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Notes for solving problems"},"visualShort":"Matrix · rows/columns · transpose · data matrix","visualIntro":"An $m\\times n$ matrix $A$ has **$m$ rows** and **$n$ columns**. Side‑by‑side **columns** are “many vectors on one sheet”; each **row** is one equation line or one sample record. The **transpose** $A^{\\mathsf T}$ swaps rows and columns.","visualStep1":"Concept: $A\\in\\mathbb{R}^{m\\times n}$, entries $a_{ij}$","visualStep2":"Intuition: columns = stacked vectors / rows = one sample line","visualStep3":"Ops: sum, scalar multiply, transpose (product next chapter)","visualStep4":"Use: design matrix, mini‑batch, weight tables","visualStepsLabel":"Suggested viewing order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: matrix as a grid","visualFlowStep1":"Intuition: column vs row reading","visualFlowStep2":"Math: match dimensions · transpose","visualFlowStep3":"Link: row dot products and $A\\mathbf{u}$","visualFlowStep4":"Apply: data matrix · batch tensor","dotVisualAriaLabel":"Matrix with column highlight: animated emphasis and dimension panel","dotVisualMainTitle":"What changes when the column changes","dotVisualPlotTitle":"Grid: highlighting a column in $3\\times 3$","dotVisualMetricsTitle":"Shape · highlight · summary","dotVisualHudDot":"Rows $m$","dotVisualHudCos":"Cols $n$","dotVisualHudProj":"Highlighted column","dotVisualLegendU":"Grid","dotVisualLegendV":"Highlighted column","dotVisualLegendProj":"Axes","dotVisualLegendRes":"Labels","dotVisualInsetLabel":"Column index","dotVisualCaption":"**Purple columns** highlight in turn. Each column is a **vector of the same length**; placing three columns next to each other forms **one matrix**. The right panel shows **$m\\times n$** and **which column** is active. Reading by **rows** gives **one line per sample** (a common data convention).","whatIs":{"intro":"If a **vector** is numbers in a line, a **matrix** stacks several such lines into a rectangle. Size $m\\times n$ means **$m$ rows** meet **$n$ columns**. Notation varies (sometimes rows are samples, sometimes columns)—**always check the shape first**.","plain":"Think of a matrix as **one spreadsheet**: each cell is a number; a **whole column** can be one **feature vector**; a **whole row** can be one **record**. The same table changes meaning depending on **which direction you read**.","definition":"Core facts:\n\n1. **Shape**: $m\\times n$ means $m$ rows and $n$ columns of real entries.\n2. **Entries**: the value at row $i$, column $j$ is $a_{ij}$.\n3. **Transpose**: $A^{\\mathsf T}$ is $n\\times m$ with $(A^{\\mathsf T})_{ji}=a_{ij}$.\n4. **Columns as vectors**: columns $\\mathbf{a}_j\\in\\mathbb{R}^m$ can be written $A=[\\mathbf{a}_1\\ \\cdots\\ \\mathbf{a}_n]$.\n5. **Add / scale**: for the same shape, $(A+B)_{ij}=a_{ij}+b_{ij}$ and $(cA)_{ij}=c\\,a_{ij}$.\n\nThis chapter focuses on **reading stacked vectors safely** and **matching dimensions** before full **matrix multiplication**.","inAI":"In **deep learning**, weights are often **matrices** (or 2D slices of tensors). One layer’s linear map is “many dot products at once”; **batching** stacks samples along a row/column. In **machine learning**, the **design matrix** stacks feature vectors into one data table."},"whyImportant":{"bridge":"Ch.01 gave vectors; Ch.02 gave dot products for one interaction. Ch.03 extends that interaction to **whole tables**. Matrices are the language of **losses, gradients, and weight updates**.","similarity":"Real data is usually **many samples × many features**. Stating the **shape $m\\times n$** makes the layout explicit; wrong shapes silently break code."},"howUsed":{"ml":"Training data is often a **design matrix**; linear models are written as matrix–vector products. Logistic/softmax, linear SVM, and matrix‑factorization recommendations all use **batched vector operations**.","geometry":"Columns **span** a subspace (column space); fitting data to a lower dimension is **projection** to a subspace (later chapters)."},"summary":"**One‑line summary:** a matrix **bundles many vectors on one sheet**; **whether rows or columns are samples** follows convention. The **transpose** swaps axes to **match dimensions**. Row dot products from Ch.02 become the coordinates of $A\\mathbf{u}$. Next: matrix multiplication and linear maps.","problemSolving":{"focus":"The table below lists **symbols and dimension rules** for problem solving. **Worked patterns** illustrate typical steps.","examplesHeading":"Worked examples","examplesTable":"**Example 1 — counting entries**\n\nProblem: If $A$ is $4\\times 7$, how many entries?\n\nSolution: $4\\times 7=28$.\n\n→ pick the option matching **$28$**.\n\n---\n\n**Example 2 — transpose shape**\n\nProblem: If $A$ is $3\\times 5$, what is the shape of $A^{\\mathsf T}$?\n\nSolution: $5\\times 3$.\n\n---\n\n**Example 3 — addition**\n\nProblem: If $A,B$ are both $2\\times 2$, what is $(A+B)_{11}$?\n\nSolution: $a_{11}+b_{11}$.\n\n---\n\n**Example 4 — column vectors**\n\nProblem: If $A=[\\mathbf{a}_1\\ \\mathbf{a}_2]$ and $\\mathbf{a}_1\\in\\mathbb{R}^m$, how many rows does $A$ have?\n\nSolution: each column has length $m$, so **$m$ rows**.\n\n---\n\n**Example 5 — link to Ch.02**\n\nProblem: What is the $i$‑th component of $A\\mathbf{u}$?\n\nSolution: the **dot product** of the **$i$‑th row** of $A$ with $\\mathbf{u}$."},"problemSolvingLabel":"Notes for solving problems","problemSolvingTable":"| Symbol | Meaning |\n| :--- | :--- |\n| $m\\times n$ | $m$ rows and $n$ columns |\n| $a_{ij}$ | entry at row $i$, column $j$ |\n| $A^{\\mathsf T}$ | transpose: $(A^{\\mathsf T})_{ji}=a_{ij}$ |\n| column $\\mathbf{a}_j$ | column $j$ of $A$ as a vector |\n| same shape | $A+B$ only if dimensions match |\n| $A\\mathbf{u}$ (preview) | vector of row–$\\mathbf{u}$ dot products |\n\n**Details**\n\n**① Shape** Always check dimensions before add/multiply.\n\n**② Transpose** Swaps sample/feature axes when needed.\n\n**③ Row/column view** Meaning depends on the problem setup.\n\n**④ Ch.02 link** Each row dotted with $\\mathbf{u}$ gives one coordinate of $A\\mathbf{u}$.","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **10 problems** sampled from a bank of **60** (4 easy · 3 medium · 3 hard; order easy→medium→hard). Each item is **multiple choice**—pick the option number.","practiceProblemsInstruction":"Read the question and choose the best option.","problems":{"definition_0":"How many **entries** does an $m\\times n$ matrix have?\n\n① $m+n$\n② $m\\times n$\n③ $\\max(m,n)$\n④ $m-n$","definition_1":"The usual notation for the $(i,j)$ entry of matrix $A$ is?\n\n① $a_{ij}$\n② only $a_{ji}$\n③ $A_i$\n④ $\\det(A)$","definition_2":"If $A$ is $m\\times n$, the length (dimension) of each **column vector** is?\n\n① $m$\n② $n$\n③ $m+n$\n④ $mn$","definition_3":"If $A$ is $m\\times n$, the shape of $A^{\\mathsf T}$ is?\n\n① $n\\times m$\n② $m\\times n$\n③ $m\\times m$\n④ $n\\times n$","definition_4":"A **square matrix** means?\n\n① same number of rows and columns\n② all entries are 1\n③ always invertible\n④ always zero","definition_5":"Which property matches a **zero matrix**?\n\n① every entry is 0\n② only diagonal entries are 0\n③ determinant is always 1\n④ transpose is undefined","definition_6":"The size of an identity matrix $I_n$ is?\n\n① $n\\times n$\n② $n\\times 1$\n③ $1\\times n$\n④ $2n\\times 2n$","definition_7":"$$\\mathbb{R}^{m\\times n}$ denotes?\n\n① all real $m\\times n$ matrices\n② an $(m+n)$‑dimensional vector space only\n③ the set of determinants\n④ square matrices only","definition_8":"If $A=[\\mathbf{a}_1\\ \\cdots\\ \\mathbf{a}_n]$ with $\\mathbf{a}_j\\in\\mathbb{R}^m$, the shape of $A$ is?\n\n① $m\\times n$\n② $n\\times m$\n③ $m\\times 1$\n④ $1\\times n$","definition_9":"A **row vector** of shape $1\\times n$ has how many entries?\n\n① $n$\n② $1$\n③ $n+1$\n④ $0$","trueFalse_0":"If the statement is **true** choose ①, if **false** choose ②.\n\nMatrix addition $A+B$ is defined only when $A$ and $B$ have the same shape.\n\n① True\n② False","trueFalse_1":"If the statement is **true** choose ①, if **false** choose ②.\n\n$(A^{\\mathsf T})^{\\mathsf T}=A$.\n\n① True\n② False","trueFalse_2":"If the statement is **true** choose ①, if **false** choose ②.\n\nA $2\\times 3$ matrix and a $3\\times 2$ matrix can have the same number of entries.\n\n① True\n② False","trueFalse_3":"If the statement is **true** choose ①, if **false** choose ②.\n\nEvery square matrix is invertible.\n\n① True\n② False","trueFalse_4":"If the statement is **true** choose ①, if **false** choose ②.\n\nIf $A$ is $m\\times n$, then $A^{\\mathsf T}$ is $n\\times m$.\n\n① True\n② False","trueFalse_5":"If the statement is **true** choose ①, if **false** choose ②.\n\nA common data convention is “one row = one sample”.\n\n① True\n② False","trueFalse_6":"If the statement is **true** choose ①, if **false** choose ②.\n\n$A+B=B+A$ whenever addition is defined.\n\n① True\n② False","trueFalse_7":"If the statement is **true** choose ①, if **false** choose ②.\n\n$(cA)^{\\mathsf T}=cA^{\\mathsf T}$.\n\n① True\n② False","trueFalse_8":"If the statement is **true** choose ①, if **false** choose ②.\n\nFor $I_n A=A$ to hold, $A$ must be $n\\times n$.\n\n① True\n② False","trueFalse_9":"If the statement is **true** choose ①, if **false** choose ②.\n\nDot products from Ch.02 connect to one row of a matrix–vector product.\n\n① True\n② False","calc_0":"For $A=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, $\\mathrm{tr}(A)=a_{11}+a_{22}$ equals?\n\n① $5$\n② $4$\n③ $6$\n④ $7$","calc_1":"Let $A=\\begin{pmatrix}1&0\\\\2&-1\\end{pmatrix}$, $B=\\begin{pmatrix}0&1\\\\1&1\\end{pmatrix}$. What is $(A+B)_{12}$?\n\n① $1$\n② $0$\n③ $2$\n④ $-1$","calc_2":"Let $A=\\begin{pmatrix}2&-1\\end{pmatrix}$ and $c=3$. What is $(cA)_{11}$?\n\n① $6$\n② $2$\n③ $-3$\n④ $9$","calc_3":"If $A$ is $2\\times 3$, how many entries does $A^{\\mathsf T}$ have?\n\n① $5$\n② $6$\n③ $8$\n④ $9$","calc_4":"For $A=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, the $(2,1)$ entry of $A^{\\mathsf T}$ is?\n\n① $2$\n② $3$\n③ $4$\n④ $1$","calc_5":"Let $A=\\begin{pmatrix}0&1\\\\2&3\\end{pmatrix}$, $B=\\begin{pmatrix}1&-1\\\\0&2\\end{pmatrix}$. What is $(A+B)_{21}$?\n\n① $2$\n② $3$\n③ $1$\n④ $0$","calc_6":"$$A=\\begin{pmatrix}1&2&3\\end{pmatrix}$ is $1\\times 3$. The shape of $A^{\\mathsf T}$ is?\n\n① $3\\times 1$\n② $1\\times 3$\n③ $3\\times 3$\n④ $1\\times 1$","calc_7":"For $A=\\begin{pmatrix}5\\end{pmatrix}$, what is the shape of $A^{\\mathsf T}$ (ignoring determinant)?\n\n① $1\\times 1$\n② $0\\times 0$\n③ $1\\times 0$\n④ undefined","calc_8":"What is the shape of $A=\\begin{pmatrix}1&2\\\\3&4\\\\5&6\\end{pmatrix}$?\n\n① $3\\times 2$\n② $2\\times 3$\n③ $6\\times 1$\n④ $1\\times 6$","calc_9":"Let the first column of $\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$ be $\\mathbf{a}_1$. The **second component** of $\\mathbf{a}_1$ is?\n\n① $3$\n② $1$\n③ $2$\n④ $4$","concept_0":"In linear regression, a common convention (e.g. scikit‑learn) with **samples as rows** means?\n\n① each row is one observation (sample)\n② each column is one observation\n③ only $1\\times n$ is allowed\n④ matrices are never used","concept_1":"In deep learning, a common **2D batch** layout is described as?\n\n① often something like (batch size)$\\times$(feature dim)\n② scalars only\n③ batch size is always 0\n④ matrices are never used","concept_2":"Connecting to Ch.02, the $i$‑th coordinate of $A\\mathbf{u}$ is?\n\n① dot product of row $i$ of $A$ with $\\mathbf{u}$\n② always dot with column $i$ only\n③ always 0\n④ the trace","concept_3":"Reading a matrix as **a bundle of column vectors** fits when?\n\n① each column is the same kind of feature vector\n② only when columns are samples\n③ only when rows are features\n④ transpose is impossible","concept_4":"Why **flatten** an image into a vector before a linear layer?\n\n① to match the vector input size the FC layer expects\n② images always have one pixel\n③ matrices are forbidden\n④ because of softmax only","concept_5":"Standardizing **per column** in tabular data usually means?\n\n① scale within the same feature (same column)\n② only within rows\n③ always add a constant\n④ change matrix size","concept_6":"In collaborative filtering, a user–item **rating matrix** often implies?\n\n① rows are users, columns are items (or vice versa) by convention\n② always $1\\times 1$\n③ always zero\n④ unrelated to dot products","concept_7":"Intuitively, **rank** relates to (details later)?\n\n① how many independent column/row directions\n② always equals determinant\n③ always 0\n④ always increases under transpose","concept_8":"Why is **broadcasting** easy to misuse with matrices?\n\n① adding without checking shapes can silently be wrong\n② shape checks are never needed\n③ matrices are always $1\\times 1$\n④ transpose is always identity","concept_9":"For matrix multiplication $AB$ (Ch.04 preview), a necessary match is?\n\n① #columns of $A$ equals #rows of $B$\n② $A$ and $B$ must be square\n③ always $AB=BA$\n④ product is always a scalar","projection_0":"Let $A\\in\\mathbb{R}^{m\\times n}$ and $\\mathbf{u}\\in\\mathbb{R}^n$. The vector $A\\mathbf{u}$ lives in $\\mathbb{R}^?$ with dimension?\n\n① $m$\n② $n$\n③ $m+n$\n④ $mn$","projection_1":"If row $i$ of $A$ is $\\mathbf{r}_i^{\\mathsf T}$, then $(A\\mathbf{u})_i$ equals?\n\n① $\\mathbf{r}_i\\cdot\\mathbf{u}$\n② $\\mathbf{r}_i+\\mathbf{u}$\n③ $\\|\\mathbf{r}_i\\|$\n④ $\\det(A)$","projection_2":"If $A\\mathbf{u}=\\mathbf{0}$ for **every** $\\mathbf{u}$, columns of $A$ likely satisfy?\n\n① they may be linearly dependent\n② always $A=I$\n③ always invertible\n④ all column norms are 1","projection_3":"The rank intuition of $\\mathbf{u}\\mathbf{v}^{\\mathsf T}$ (outer product form) is?\n\n① at most 1 for nonzero vectors\n② always $n$\n③ always 0\n④ always invertible","projection_4":"The column space $\\mathrm{Col}(A)$ is best described as?\n\n① all linear combinations of the columns of $A$\n② always the whole space\n③ only $\\{\\mathbf{0}\\}$\n④ the set of determinants","projection_5":"If $A\\mathbf{x}=\\mathbf{b}$ has a solution, $\\mathbf{b}$ must lie in?\n\n① $\\mathrm{Col}(A)$\n② the unit ball only\n③ only the zero vector\n④ $\\mathbb{R}$","projection_6":"Viewing $A$ by rows, each row vector lies in which space (length viewpoint)?\n\n① $\\mathbb{R}^n$\n② $\\mathbb{R}^m$\n③ $\\mathbb{R}^{mn}$\n④ $\\mathbb{R}$","projection_7":"For $A\\in\\mathbb{R}^{m\\times n}$ and standard basis $\\mathbf{e}_j\\in\\mathbb{R}^n$, $A\\mathbf{e}_j$ equals?\n\n① column $j$ of $A$\n② row $j$ of $A$\n③ always 0\n④ only the $(j,j)$ entry","projection_8":"If data matrix $X$ has **samples as rows**, what does $X^{\\mathsf T}$ swap?\n\n① sample axis and feature axis\n② nothing\n③ always becomes square\n④ always becomes zero","projection_9":"From a linear map viewpoint, $A\\mathbf{u}$ is?\n\n① the image of $\\mathbf{u}$ under a map $\\mathbb{R}^n\\to\\mathbb{R}^m$\n② always length‑preserving\n③ always a rotation only\n④ always a probability vector","scenario_0":"In scikit‑learn, feature matrix **X** with **samples as rows** usually has shape?\n\n① (n_samples)$\\times$(n_features)\n② only (n_features)$\\times$(n_samples)\n③ always $1\\times 1$\n④ (n_classes)$\\times$(batch)","scenario_1":"A 2D tensor with batch 32 and feature dim 128 is often read as a matrix of shape?\n\n① $32\\times 128$\n② only $128\\times 32$\n③ $32\\times 32$\n④ $128\\times 128$","scenario_2":"Why **flatten** after convolutions before a fully connected layer?\n\n① FC expects a vector input\n② only because of softmax\n③ images are always 1D\n④ to disable backprop","scenario_3":"Filling missing values with **column means** averages along?\n\n① the same column (same feature)\n② only rows\n③ only the diagonal\n④ one global scalar","scenario_4":"In collaborative filtering, a very **sparse** ratings matrix $R$ means?\n\n① most entries are unobserved\n② all entries are 1\n③ always invertible\n④ matrices are not used","scenario_5":"Stacking **sentence embeddings as rows** suggests?\n\n① each row is one sentence (or a pooled vector)\n② columns are always sentences\n③ always $1\\times 1$\n④ softmax only","scenario_6":"On GPU, performance often relates to?\n\n① memory layout/stride and tensor shape\n② matrices are always scalars\n③ transpose is always free\n④ rank is always 0","scenario_7":"Which claim is **easy to overstate** using only Ch.03?\n\n① “Matrices instantly mean deep learning is always best”\n② data is often tabular\n③ matching shapes matters\n④ transpose swaps axes","scenario_8":"Flattening an $H\\times W$ grayscale image gives a vector of length?\n\n① $H\\times W$\n② $H+W$\n③ $\\max(H,W)$\n④ $1$","scenario_9":"Preview of Ch.04: in $\\mathbf{y}=W\\mathbf{x}+\\mathbf{b}$, $W$ represents?\n\n① a linear map mixing features\n② always one scalar multiply\n③ always softmax\n④ always the loss"},"problemAnswers":{"definition_0":2,"definition_1":1,"definition_2":1,"definition_3":1,"definition_4":1,"definition_5":1,"definition_6":1,"definition_7":1,"definition_8":1,"definition_9":1,"trueFalse_0":1,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":2,"trueFalse_4":1,"trueFalse_5":1,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":2,"trueFalse_9":1,"calc_0":1,"calc_1":1,"calc_2":1,"calc_3":2,"calc_4":1,"calc_5":1,"calc_6":1,"calc_7":1,"calc_8":1,"calc_9":1,"concept_0":1,"concept_1":1,"concept_2":1,"concept_3":1,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"projection_0":1,"projection_1":1,"projection_2":1,"projection_3":1,"projection_4":1,"projection_5":1,"projection_6":1,"projection_7":1,"projection_8":1,"projection_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1},"problemSolutions":{"definition_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","definition_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","trueFalse_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","trueFalse_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","calc_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 1\nassert answer == 1","definition_3":"answer = 1\nassert answer == 1","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 1\nassert answer == 1","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 1\nassert answer == 1","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 2\nassert answer == 2","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 1\nassert answer == 1","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 2\nassert answer == 2","trueFalse_9":"answer = 1\nassert answer == 1","calc_0":"answer = 1\nassert answer == 1","calc_1":"answer = 1\nassert answer == 1","calc_2":"answer = 1\nassert answer == 1","calc_3":"answer = 2\nassert answer == 2","calc_4":"answer = 1\nassert answer == 1","calc_5":"answer = 1\nassert answer == 1","calc_6":"answer = 1\nassert answer == 1","calc_7":"answer = 1\nassert answer == 1","calc_8":"answer = 1\nassert answer == 1","calc_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","projection_0":"answer = 1\nassert answer == 1","projection_1":"answer = 1\nassert answer == 1","projection_2":"answer = 1\nassert answer == 1","projection_3":"answer = 1\nassert answer == 1","projection_4":"answer = 1\nassert answer == 1","projection_5":"answer = 1\nassert answer == 1","projection_6":"answer = 1\nassert answer == 1","projection_7":"answer = 1\nassert answer == 1","projection_8":"answer = 1\nassert answer == 1","projection_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1"}},"midMathCh04":{"chapter":"Chapter 04","title":"Matrix multiplication and linear maps: a smart filter that designs your data","description":"Matrix multiplication is not just a tedious pile of additions and multiplications. A matrix plays the same role as a **smart filter in a digital photo editor**—rotating, twisting, and compressing raw data. This chapter dives into **linear transformations**: putting one datum (a vector) through an editor (a matrix) and mapping it into a genuinely different “space.” We unpack what it means, mathematically, for the backbone $\\mathbf{y} = W\\mathbf{x} + \\mathbf{b}$ of deep learning models to work the way it does.","sectionTitle":"Matrix multiplication and linear maps: edit space with full control","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Notes for solving problems"},"visualShort":"Matrix×vector = move coordinates in one shot · matrix product = chain two maps","visualIntro":"Multiplying $A$ by a vector **mixes the numbers** into a new vector. **$AB$** means “do $B$, then $A$,” written as **one** product. Picture a grid tilting—that’s enough intuition.","visualStep1":"Concept: $A\\in\\mathbb{R}^{m\\times n}$ maps $\\mathbb{R}^n\\to\\mathbb{R}^m$","visualStep2":"Intuition: grid tilts and stretches (origin fixed)","visualStep3":"Formula: $(AB)_{ij}$ row–column dot; $(AB)\\mathbf{x}=A(B\\mathbf{x})$","visualStep4":"Use: FC layer, batched matmul, attention scores","visualStepsLabel":"Suggested viewing order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: linear map = matrix × vector","visualFlowStep1":"Intuition: grid deformation and composition","visualFlowStep2":"Math: product rules · transpose · composition","visualFlowStep3":"Link: Ch.02 dot = one row of matmul","visualFlowStep4":"Apply: FC · batch · score matrix","mapVisualStep1":"① Choose input x","mapVisualStep2":"② Multiply by A","mapVisualStep3":"③ Output Ax stays on blue","mapVisualPanelLeft":"Input","mapVisualPanelRight":"Output","dotVisualAnimateHint":"On one plane, the square grid becomes a parallelogram under A, and x moves to Ax.","dotVisualPhaseHint0":"**1/4** Only $x_1$ moves; $x_2$ is fixed. The output slides **along the first column** (orange segment).","dotVisualPhaseHint1":"**2/4** Only $x_2$ moves; $x_1$ is fixed. The output moves **along the second column** (teal segment).","dotVisualPhaseHint2":"**3/4** $x_1=x_2$ together. The output follows the **sum of the two columns**.","dotVisualPhaseHint3":"**4/4** $(x_1,x_2)$ traces a circle; $A\\mathbf{x}$ loops on the **blue patch**.","dotVisualHudCoeffLine":"$$x_1={x1}$, $x_2={x2}$","dotVisualDecompKey":"","dotVisualEasyHook":"**In one line:** $A\\mathbf{x}$ **moves** the input $\\mathbf{x}$ to a new spot in one go. The **big green box** is “where answers could live”; the **blue patch** is “where they **actually** land” (spanned by the columns of $A$).","dotVisualAriaLabel":"Unit square grid is transformed by matrix A; point x moves to Ax","dotVisualMainTitle":"Square grid → $A$ → skewed grid","dotVisualPlotTitle":"The **same coordinates** on the left land **in one step** on the right—the whole grid stretches with them.","dotVisualMetricsTitle":"Remember this picture","dotVisualHudDot":"","dotVisualHudCos":"$$T(\\mathbf{x})=A\\mathbf{x}$. **$T$** is just a **name (function symbol)** for the **linear map** “multiply by $A$”. So **$T(\\mathbf{x})$** means “apply $T$ to $\\mathbf{x}$”, which is exactly **$A\\mathbf{x}$**.","dotVisualHudPlain":"**Blue region** = parallelogram spanned by the **two column vectors** of $A$. The output **$T(\\mathbf{x})=A\\mathbf{x}$** always lies **inside** that region (reachable as a combination of the columns).","dotVisualHudProj":"Column space","dotVisualLegendU":"Input / box","dotVisualLegendV":"$$A$","dotVisualLegendProj":"Reachable","dotVisualLegendRes":"$$\\mathbf{x} \\mapsto A\\mathbf{x}$","dotVisualInsetLabel":"t","mapVisualDomainCaption":"Input","mapVisualCodomainCaption":"All possible outputs","mapVisualRangeCaption":"Actually reachable","mapVisualRnLabel":"ℝ²","mapVisualRnSubLabel":"Real plane · input","mapVisualRnSvgTitle":"ℝ²: the 2D real space for inputs (pairs of coordinates). Not the regression R-squared.","mapVisualRnA11y":"ℝ²: the 2D real space for inputs (pairs of coordinates). Not the regression R-squared.","mapVisualRmLabel":"ℝ²","mapVisualRmSubLabel":"Real plane · output","mapVisualRmSvgTitle":"ℝ²: the 2D real space where we plot the transformed point; same dimension as in this figure.","mapVisualRmA11y":"ℝ²: the 2D real space where we plot the transformed point; same dimension as in this figure.","mapVisualLabelX":"x","mapVisualLabelTx":"Ax","mapVisualLabelMap":"A","mapVisualMapHint":"**Orange “A” badge:** the **matrix $A$** (linear map) for this figure. It multiplies the left-hand coordinates $\\mathbf{x}$ to produce the point $A\\mathbf{x}$ on the right.","mapVisualCol1Tag":"a₁","mapVisualCol2Tag":"a₂","mapVisualGhostHint":"Gray dashed: square if A = I","dotVisualCaption":"**$A\\mathbf{x}$** is the rule “mix the entries of $\\mathbf{x}$ using $A$.” On the left, $\\mathbf{x}$ moves on the **input patch**; on the right, the answer stays on the **blue patch**. **$AB$** means **chain** the maps: $B$ first, then $A$.","whatIs":{"0":"**1. Linear transformation (Linear Transformation): the “Free Transform” tool in an image editor**\n\n**Concept:** Imagine an image drawn on a transparent grid, opened in Photoshop. Dragging corners to stretch diagonally, rotating 45°, or shearing—all of that is a **linear transformation** in geometry.\n\n**Strict rules:** This tool has two rules that must never break. First, **the origin $(0,0)$ at the center of the image stays fixed** after the transform. Second, **lines that were straight stay straight** (no bending), and lines that were parallel stay parallel.","1":"**2. Matrix × vector ($A\\mathbf{x}$): applying a filter to the raw image**\n\n**Concept:** Here **$\\mathbf{x}$** is the “original data (position of a point)” with no effect applied yet, and **$A$** is a **smart filter (transform rule)** that shears at specific angles and scales. Applying the filter is written $A\\mathbf{x}$ (matrix $A$ acts on $\\mathbf{x}$).\n\n**In deep learning:** One neural network layer uses this to build **$\\mathbf{y} = W\\mathbf{x} + \\mathbf{b}$**.\n* $W$ (weight matrix): shears data into angles/scales that are easier for the model to analyze (linear map).\n* $\\mathbf{b}$ (bias vector): drags the sheared result sideways like moving a layer in an editor (translation).\nThe result $\\mathbf{y}$ after “deform + shift” is passed to the next layer.","2":"**3. Matrix × matrix ($AB$): stacking filters in order**\n\n**Concept:** Multiplying $A$ and $B$ means **applying two editing filters one after another**. In $AB$, expressions flow **right to left**, so you **apply $B$ first, then overlay $A$** on the result.\n\n**Key fact ($AB \\neq BA$):** “Stretch horizontally by 2, then rotate 90°” gives a **tall** image; “rotate 90°, then stretch horizontally by 2” gives a **wide** image. Order changes the outcome, so **$AB \\neq BA$ (multiplication is not commutative)**.","3":"**4. Matching dimensions: plugging compatible cables**\n\n**Concept:** When stacking filters, connectors must match: the **number of columns** of the left matrix must equal the **number of rows** of the right matrix.\n\n**Key formula:** $(m\\times n)$ times $(n\\times p)$ absorbs the touching $n$ and outputs **$(m\\times p)$**. In code, **transpose** flips a table so batch data $X$ and weights $W$ line up cleanly as **$Y = XW^{\\mathsf{T}}$**.","4":"$28"},"whyImportant":{"bridge":"**The magic of parallelism: millions of pixels in one shot**\n\nA single high-res photo can have millions of pixels. Looping with a for-loop over each pixel would choke the CPU and make training impractical. Matrix multiplication packs those numbers into **one huge table (matrix)** and encodes the transform as **another matrix**, so the “apply a filter” picture becomes **one multiply**.\n\nGPUs are built so **thousands of cores** share this work. **Batch GEMM** in TensorFlow/PyTorch stacks many samples as rows of $X$, multiplies by $W$ once, and pushes the whole mini-batch through **$Y = XW^{\\mathsf{T}}$**. Deep learning digests huge data quickly because **matrices** are a common format hardware can parallelize.","similarity":"**One shared language across AI**\n\nWhether Netflix recommendations, Tesla lane detection, or ChatGPT, the bottom layer keeps running **$Y = XW^{\\mathsf{T}}$**. Fully connected layers, embeddings, attention scores—different names, same **matrix × matrix** pattern.\n\nWith this mindset, **shape mismatches** are easier to debug: mismatched inner sizes are like **mismatched cable specs**. Once this “shared language” clicks, papers, code, and logs in different domains read off the **same map**."},"howUsed":{"ml":"$29","geometry":"$2a"},"summary":"**Practitioner’s recap:** Matrix multiplication treats data not as a flat list of numbers but as **dynamic spatial transforms** via **$\\mathbf{y} = W\\mathbf{x} + \\mathbf{b}$**. When stacking layers, **matching shapes ($(m \\times n) \\times (n \\times p)$)** comes first; never forget that **order ($AB \\neq BA$)** completely changes the result.","problemSolving":{"focus":"The table lists **shape rules** and identities. Examples sketch typical steps.","examplesHeading":"Worked examples","examplesTable":"**Example 1 — shape**\n\nQ: $A$ is $4\\times 7$, $B$ is $7\\times 3$. $AB$?\n\nA: $4\\times 3$.\n\n---\n\n**Example 2 — order**\n\nQ: Matrix for “$B$ then $A$”?\n\nA: $AB$.\n\n---\n\n**Example 3 — transpose**\n\nQ: $(AB)^{\\mathsf T}$?\n\nA: $B^{\\mathsf T}A^{\\mathsf T}$.\n\n---\n\n**Example 4 — column**\n\nQ: $A\\mathbf{e}_2$?\n\nA: Second **column** of $A$.\n\n---\n\n**Example 5 — batch**\n\nQ: One-shot linear layer with rows as samples?\n\nA: Often $XW^{\\mathsf T}$."},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"| Symbol | Meaning |\n| :--- | :--- |\n| $AB$ | Defined when cols of $A$ = rows of $B$ |\n| $(AB)_{ij}$ | Dot of row $i$ of $A$ and col $j$ of $B$ |\n| $A\\mathbf{x}$ | Vector of row–$\\mathbf{x}$ dots |\n| $(AB)^{\\mathsf T}$ | $B^{\\mathsf T}A^{\\mathsf T}$ |\n| Composition | $\\mathbf{x}\\mapsto A(B\\mathbf{x})=(AB)\\mathbf{x}$ |\n| FC layer | $\\mathbf{y}=W\\mathbf{x}+\\mathbf{b}$ |\n\n**① Shapes** Inner dimensions must match.\n\n**② Batch** Same $W$ on each row → GEMM.","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"","practiceProblemsInstruction":"Read the question and choose the best option.","problems":{"definition_0":"For $A\\in\\mathbb{R}^{m\\times n}$ and $B\\in\\mathbb{R}^{p\\times q}$, a necessary condition for $AB$ to be **defined** is?\n\n① $m=p$\n② $n=p$\n③ $m=q$\n④ $n=m$","definition_1":"Which defines $(AB)_{ij}$? ($i$-th **row** of $A$, $j$-th **column** of $B$)\n\n① $a_{ij}b_{ij}$\n② **Dot product** of row $i$ of $A$ and column $j$ of $B$\n③ $a_{ij}+b_{ij}$\n④ $a_{ji}b_{ji}$","definition_2":"If $A\\in\\mathbb{R}^{m\\times n}$ and $B\\in\\mathbb{R}^{n\\times p}$, the shape of $AB$ is?\n\n① $m\\times p$\n② $n\\times n$\n③ $m\\times n$\n④ $p\\times m$","definition_3":"For $A\\in\\mathbb{R}^{m\\times n}$, to have $AI_n=A$, the identity $I_n$ must be?\n\n① $n\\times n$\n② $m\\times m$\n③ $m\\times n$\n④ $n\\times m$","definition_4":"For $A\\in\\mathbb{R}^{m\\times n}$, to have $I_m A=A$, the identity $I_m$ must be?\n\n① $m\\times m$\n② $n\\times n$\n③ $m\\times n$\n④ $n\\times m$","definition_5":"Transpose rule for $(AB)^{\\mathsf T}$:\n\n① $A^{\\mathsf T}B^{\\mathsf T}$\n② $B^{\\mathsf T}A^{\\mathsf T}$\n③ $(A^{\\mathsf T})^{\\mathsf T}B$\n④ $AB^{\\mathsf T}$","definition_6":"For $A\\in\\mathbb{R}^{m\\times n}$ and $\\mathbf{u}\\in\\mathbb{R}^n$, the vector $A\\mathbf{u}$ lies in?\n\n① $\\mathbb{R}^m$\n② $\\mathbb{R}^n$\n③ $\\mathbb{R}^{mn}$\n④ $\\mathbb{R}^{m+n}$","definition_7":"For a linear map $T(\\mathbf{x})=A\\mathbf{x}$, which always holds?\n\n① $T(\\mathbf{0})=\\mathbf{0}$\n② $T(\\mathbf{x})=\\mathbf{x}$\n③ $\\|T(\\mathbf{x})\\|=\\|\\mathbf{x}\\|$\n④ $T(\\mathbf{x}+\\mathbf{y})=T(\\mathbf{x})T(\\mathbf{y})$","definition_8":"Which inequality always holds for $\\mathrm{rank}(AB)$?\n\n① $\\mathrm{rank}(AB)\\ge \\mathrm{rank}(A)$\n② $\\mathrm{rank}(AB)\\le \\min(\\mathrm{rank}(A),\\mathrm{rank}(B))$\n③ $\\mathrm{rank}(AB)=\\mathrm{rank}(A)+\\mathrm{rank}(B)$\n④ $\\mathrm{rank}(AB)=mn$","definition_9":"Column-vector convention: matrix for “apply $B$, then $A$” is?\n\n① $AB$\n② $BA$\n③ $A+B$\n④ $A^{\\mathsf T}B^{\\mathsf T}$","trueFalse_0":"If **true** choose ①, if **false** choose ②.\n\nFor all square $A,B$, $AB=BA$.\n\n① True\n② False","trueFalse_1":"Whenever defined, $(AB)C=A(BC)$.\n\n① True\n② False","trueFalse_2":"Whenever defined, $A(B+C)=AB+AC$.\n\n① True\n② False","trueFalse_3":"If $AB=O$, must $A=O$ or $B=O$?\n\n① True\n② False","trueFalse_4":"Always $(A+B)^2=A^2+2AB+B^2$ for square matrices?\n\n① True\n② False","trueFalse_5":"For same-size square $A,B$, $\\det(AB)=\\det(A)\\det(B)$?\n\n① True\n② False","trueFalse_6":"Linear $T(\\mathbf{x})=A\\mathbf{x}$ always has $T(\\mathbf{0})=\\mathbf{0}$?\n\n① True\n② False","trueFalse_7":"Orthogonal $Q$ satisfies $Q^{\\mathsf T}Q=I$?\n\n① True\n② False","trueFalse_8":"Scaling by $c$ is the matrix $cI$?\n\n① True\n② False","trueFalse_9":"If each **row** of batch $X$ is a sample, row-wise $\\mathbf{y}^{\\mathsf T}=\\mathbf{x}^{\\mathsf T}W^{\\mathsf T}$ equals multiplying each sample by the same $W^{\\mathsf T}$?\n\n① True\n② False","calc_0":"Let $A=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, $B=\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$. What is $(AB)_{11}$?\n\n① $2$\n② $1$\n③ $3$\n④ $0$","calc_1":"Let $A=\\begin{pmatrix}1&0\\\\0&2\\end{pmatrix}$, $\\mathbf{x}=\\begin{pmatrix}3\\\\4\\end{pmatrix}$. First component of $A\\mathbf{x}$?\n\n① $3$\n② $4$\n③ $7$\n④ $12$","calc_2":"Let $R=\\begin{pmatrix}0&-1\\\\1&0\\end{pmatrix}$ (CCW $90^\\circ$). Then $R\\begin{pmatrix}1\\\\0\\end{pmatrix}$ is?\n\n① $\\begin{pmatrix}0\\\\1\\end{pmatrix}$\n② $\\begin{pmatrix}1\\\\0\\end{pmatrix}$\n③ $\\begin{pmatrix}-1\\\\0\\end{pmatrix}$\n④ $\\begin{pmatrix}0\\\\-1\\end{pmatrix}$","calc_3":"The $(2,2)$ entry of $\\begin{pmatrix}2&1\\\\0&3\\end{pmatrix}\\begin{pmatrix}1&1\\\\0&1\\end{pmatrix}$ is?\n\n① $3$\n② $4$\n③ $6$\n④ $0$","calc_4":"Value of $\\begin{pmatrix}1&2&3\\end{pmatrix}\\begin{pmatrix}4\\\\5\\\\6\\end{pmatrix}$?\n\n① $32$\n② $21$\n③ $18$\n④ $720$","calc_5":"Let $A=\\begin{pmatrix}1&1\\\\0&1\\end{pmatrix}^2$. The $(1,2)$ entry is?\n\n① $2$\n② $1$\n③ $0$\n④ $3$","calc_6":"The $(2,1)$ entry of $\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}\\begin{pmatrix}1&0\\\\0&0\\end{pmatrix}$ is?\n\n① $3$\n② $1$\n③ $0$\n④ $4$","calc_7":"For $B=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, $B\\mathbf{e}_1$ is?\n\n① First **column** of $B$\n② First **row** of $B$\n③ Zero\n④ $(1,0)^{\\mathsf T}$","calc_8":"Let $A=\\begin{pmatrix}1&0\\\\0&0\\end{pmatrix}$, $B=\\begin{pmatrix}0&0\\\\0&1\\end{pmatrix}$. Then $AB$ is?\n\n① Zero matrix\n② $I_2$\n③ $\\begin{pmatrix}1&0\\\\0&1\\end{pmatrix}$\n④ $\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$","calc_9":"Multiply $\\begin{pmatrix}3\\end{pmatrix}\\begin{pmatrix}2\\end{pmatrix}$ (both $1\\times 1$).\n\n① $\\begin{pmatrix}6\\end{pmatrix}$\n② $5$\n③ Undefined\n④ $13$","concept_0":"In $\\mathbf{y}=W\\mathbf{x}+\\mathbf{b}$, $W$ mainly?\n\n① **Linearly mixes** input features\n② Always outputs probabilities\n③ Always rotates images\n④ Minimizes loss directly","concept_1":"How many entries in $W\\in\\mathbb{R}^{d_{out}\\times d_{in}}$?\n\n① $d_{out}\\times d_{in}$\n② $B\\times d_{in}$\n③ $d_{in}+d_{out}$\n④ $B\\times d_{out}$","concept_2":"Coordinate $i$ of $A\\mathbf{u}$ matches Ch.02?\n\n① Dot of row $i$ of $A$ with $\\mathbf{u}$\n② Outer product only\n③ Norm of $\\mathbf{u}$\n④ Determinant","concept_3":"Deep linear-only nets are?\n\n① Compositions of matrix products (and biases)\n② Always adding the same matrix\n③ Determinants only\n④ Transposes only","concept_4":"One-shot batch formula for $X\\in\\mathbb{R}^{B\\times d_{in}}$, $W\\in\\mathbb{R}^{d_{out}\\times d_{in}}$?\n\n① $XW^{\\mathsf T}$\n② $WX$ (always defined)\n③ $X+X$\n④ Only $W^{\\mathsf T}X^{\\mathsf T}$","concept_5":"Before activation $\\sigma$, one layer is?\n\n① Linear map (matrix)\n② Always nonlinear only\n③ Always softmax\n④ The loss","concept_6":"First check for linear layer on sample-as-row $X$?\n\n① Match shapes of $X$ and $W$\n② Never transpose\n③ Collapse to scalars\n④ Batch size 1","concept_7":"Why is multiplication usually non-commutative?\n\n① Order of maps can change the result\n② Matrices are always symmetric\n③ No dot products\n④ No inverses","concept_8":"In $\\hat{\\mathbf{y}}=X\\boldsymbol{\\beta}$, $X\\boldsymbol{\\beta}$ is?\n\n① Linear combo of columns of $X$\n② Always a norm\n③ Determinant\n④ Only eigendecomposition","concept_9":"Outputs $\\{A\\mathbf{x}\\}$ form the?\n\n① Column space\n② Unit sphere\n③ One scalar\n④ Always full space","projection_0":"For standard basis $\\mathbf{e}_j$, $A\\mathbf{e}_j$ is?\n\n① Column $j$ of $A$\n② Row $j$ of $A$\n③ Always zero\n④ Diagonal only","projection_1":"If $A\\mathbf{x}=\\mathbf{0}$ for all $\\mathbf{x}$, $\\mathrm{rank}(A)$ is?\n\n① $0$\n② Always $n$\n③ Always $m$\n④ Always $\\min(m,n)$","projection_2":"$$\\{A\\mathbf{x}: \\mathbf{x}\\in\\mathbb{R}^n\\}$ is the?\n\n① **Column space** of $A$\n② Always $\\mathbb{R}^m$\n③ Always $\\{\\mathbf{0}\\}$\n④ Always row space","projection_3":"$$A(B\\mathbf{x})=(AB)\\mathbf{x}$ means?\n\n① Composition ↔ matrix multiply\n② Always $AB=BA$\n③ Commutative\n④ No dots","projection_4":"If $P^2=P$, $P\\mathbf{x}$ intuitively?\n\n① Projects onto a subspace\n② Rotation only\n③ Always invertible globally\n④ Scalar only","projection_5":"For $A\\in\\mathbb{R}^{m\\times n}$, $m0$ for a $2\\times 2$ real $A$, orientation of the linear map is:\n\n① preserved (no reflection)\n② always symmetric\n③ only rotation\n④ always diagonalizable","concept_9":"In $\\mathbb{R}^3$, the volume of the parallelepiped spanned by the three columns of $A$ is:\n\n① $\\lvert\\det([\\mathbf{a}_1\\ \\mathbf{a}_2\\ \\mathbf{a}_3])\\rvert$\n② sum of norms\n③ $\\mathrm{tr}(A)$\n④ always $1$","projection_0":"Laplace (cofactor) expansion of $\\det(A)$ along a row/column is:\n\n① a standard valid method\n② only defined for $3\\times 3$\n③ only for symmetric $A$\n④ always yields $0$ after transpose","projection_1":"The adjugate matrix satisfies:\n\n① $A\\,\\mathrm{adj}(A)=\\det(A)\\,I$\n② $A\\,\\mathrm{adj}(A)=I$\n③ $\\mathrm{adj}(A)=A^{-1}$\n④ $\\det(\\mathrm{adj}(A))=0$","projection_2":"For invertible $A$, $\\det(A^{-1}BA)$ equals:\n\n① $\\det(B)$\n② $\\det(A)$\n③ $\\det(A^{-1})$\n④ $\\det(A)+\\det(B)$","projection_3":"If $\\lambda$ is an eigenvalue of $A$, then necessarily:\n\n① $\\det(A-\\lambda I)=0$\n② $\\det(A-\\lambda I)=1$\n③ $\\det(A)=\\lambda$\n④ $A=\\lambda I$","projection_4":"Using Cramer’s rule for all three coordinates of a $3\\times 3$ system typically requires how many determinants?\n\n① $4$\n② $1$\n③ $9$\n④ only $3$","projection_5":"For block-diagonal $\\begin{pmatrix}A&0\\\\0&D\\end{pmatrix}$ with square blocks, $\\det$ equals:\n\n① $\\det(A)\\det(D)$\n② $\\det(A)+\\det(D)$\n③ $\\det(AD)$\n④ $0$","projection_6":"Swapping two rows of a matrix:\n\n① flips the sign of $\\det$\n② leaves $\\det$ unchanged\n③ forces $\\det=0$\n④ doubles $\\det$","projection_7":"Adding a multiple of one row to another row:\n\n① leaves $\\det$ unchanged\n② only flips sign\n③ forces $\\det=0$\n④ doubles $\\det$","projection_8":"If $AB$ is invertible for $n\\times n$ real $A,B$, then:\n\n① both $A$ and $B$ are invertible\n② only $A$ must be invertible\n③ only $B$ must be invertible\n④ one of them must be zero","projection_9":"In $\\mathbb{R}^n$, an invertible linear map $A$ scales any region volume $V$ by:\n\n① $\\lvert\\det(A)\\rvert\\cdot V$\n② $V/\\lvert\\det(A)\\rvert$\n③ always $V$\n④ $\\mathrm{tr}(A)\\cdot V$","scenario_0":"PyTorch torch.linalg.det(A) keeps batch dimensions and returns determinants of the last two axes. This means:\n\n① many small matrices can be processed at once\n② always a single scalar\n③ it also returns inverses\n④ only defined on GPU","scenario_1":"Why is torch.linalg.solve(A, b) usually preferred over forming inv(A) and multiplying by b?\n\n① more stable and often faster direct solve\n② determinants cannot be computed\n③ inverses never exist\n④ b must not be a vector","scenario_2":"If the Hessian (or GN approximation) is nearly singular during training, a common symptom is:\n\n① exploding/unstable step directions\n② immediate guaranteed convergence\n③ loss becomes exactly $0$\n④ gradients vanish entirely","scenario_3":"A main reason ridge regression uses $X^{\\mathsf T}X+\\lambda I$ with $\\lambda>0$ is:\n\n① to make the matrix better conditioned / invertible\n② to force determinant $0$\n③ to forbid inverses\n④ to shrink batch size","scenario_4":"The $\\det(\\Sigma)^{-1/2}$ factor in a multivariate normal density is directly tied to:\n\n① volume scaling under a linear change (Jacobian idea)\n② softmax temperature\n③ ReLU slope\n④ dropout rate","scenario_5":"For overdetermined $A\\mathbf{x}=\\mathbf{b}$, the Moore–Penrose pinv is closest to:\n\n① providing a meaningful least-norm least-squares solution when not invertible\n② forcing $\\det(A)=1$\n③ always returning an exact solution\n④ computing softmax","scenario_6":"Many nearly singular Hessian directions on a loss surface often indicate:\n\n① flat valleys / ambiguous curvature\n② a unique global minimum only\n③ gradients are always zero\n④ learning rate is meaningless","scenario_7":"If det(A) is extremely close to $0$, what can we safely conclude?\n\n① inversion may be numerically unstable\n② training is impossible\n③ parameters are optimal\n④ softmax diverges","scenario_8":"If $A=Q\\Lambda Q^{-1}$ is diagonalizable, then $\\det(A)$ equals:\n\n① product of eigenvalues\n② sum of eigenvalues\n③ $\\mathrm{tr}(Q)$\n④ always $0$","scenario_9":"When a mini-batch covariance $S$ is nearly singular, a common way to stabilize $\\log\\det S$ in a log-likelihood is:\n\n① Cholesky / small $\\varepsilon I$ regularization\n② force determinant $0$\n③ replace $S$ by zeros\n④ apply softmax to $S$"},"problemAnswers":{"definition_0":2,"definition_1":1,"definition_2":2,"definition_3":3,"definition_4":1,"definition_5":1,"definition_6":2,"definition_7":1,"definition_8":1,"definition_9":2,"trueFalse_0":2,"trueFalse_1":1,"trueFalse_2":2,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":1,"trueFalse_6":2,"trueFalse_7":1,"trueFalse_8":2,"trueFalse_9":1,"calc_0":1,"calc_1":1,"calc_2":3,"calc_3":1,"calc_4":1,"calc_5":1,"calc_6":1,"calc_7":1,"calc_8":1,"calc_9":1,"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":2,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"projection_0":1,"projection_1":1,"projection_2":1,"projection_3":1,"projection_4":1,"projection_5":1,"projection_6":1,"projection_7":1,"projection_8":1,"projection_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1},"problemSolutions":{"definition_0":"**1) Definition:** $\\det(A)=ad-bc$. **2) Example:** $\\begin{pmatrix}2&1\\\\0&3\\end{pmatrix}$ gives $6$. **3) Answer ②**","definition_1":"**1) Fact:** Square $A$ is invertible iff $\\det(A)\\neq0$. **2) Example:** $\\det\\begin{pmatrix}1&1\\\\0&1\\end{pmatrix}=1$. **3) Answer ①**","definition_2":"**1) Rule:** Undo in reverse order: $(AB)^{-1}=B^{-1}A^{-1}$. **3) Answer ②**","definition_3":"**1) Geometry:** Volume scale is $1$. **3) Answer ③**","definition_4":"**1) Property:** Transpose does not change $\\det$. **3) Answer ①**","definition_5":"**1) Rule:** Invert diagonal entries. **3) Answer ①**","definition_6":"**1) Rule:** Factor $2$ from each row: $2\\cdot2=4$. **3) Answer ②**","definition_7":"**1) Geometry:** Parallelogram area scales by $\\lvert\\det(A)\\rvert$. **3) Answer ①**","definition_8":"**1) Link:** Invertible $\\Rightarrow$ full rank. **3) Answer ①**","definition_9":"**1) Theorem:** $\\det(AB)=\\det(A)\\det(B)$. **3) Answer ②**","trueFalse_0":"**1) Terminology:** Singular means $\\det(A)=0$. **3) Answer ②**","trueFalse_1":"**1) Theorem:** Product rule. **3) Answer ①**","trueFalse_2":"**1) Counterexample:** Zero matrix. **3) Answer ②**","trueFalse_3":"**1) Derive:** $\\det(A)\\det(A^{-1})=1$. **3) Answer ①**","trueFalse_4":"**1) Example:** Two axis projections summing to $I_2$. **3) Answer ①**","trueFalse_5":"**1) Theorem:** $\\det(Q)^2=1$. **3) Answer ①**","trueFalse_6":"**1) Counterexample:** $A=B=I$. **3) Answer ②**","trueFalse_7":"**1) Theorem:** Triangular case. **3) Answer ①**","trueFalse_8":"**1) Link:** $\\det=0$ implies dependence. **3) Answer ②**","trueFalse_9":"**1) Derive:** $\\det(AA)=\\det(A)^2$. **3) Answer ①**","calc_0":"**1) Compute:** $4-6=-2$. **3) Answer ①**","calc_1":"**1) Compute:** $2\\cdot3=6$. **3) Answer ①**","calc_2":"**1) Compute:** duplicate columns $\\Rightarrow$ $0$. **3) Answer ③**","calc_3":"**1) Compute:** $A^{-1}=\\mathrm{diag}(1,1/2)$. **3) Answer ①**","calc_4":"**1) Compute:** $3-2=1$. **3) Answer ①**","calc_5":"**1) Compute:** $0-(-1)=1$. **3) Answer ①**","calc_6":"**1) Compute:** $A^{-1}=\\frac12I$, trace $1$. **3) Answer ①**","calc_7":"**1) Compute:** rows proportional $\\Rightarrow$ $0$. **3) Answer ①**","calc_8":"**1) Compute:** inverse is $\\begin{pmatrix}1&-1\\\\0&1\\end{pmatrix}$. **3) Answer ①**","calc_9":"**1) Compute:** $\\cos^2t+\\sin^2t=1$. **3) Answer ①**","concept_0":"**1) Link:** invertible Hessian $\\Rightarrow$ solvable Newton system. **3) Answer ①**","concept_1":"**1) Practice:** direct solves are usually better conditioned. **3) Answer ①**","concept_2":"**1) Order:** undo $A$ then $B$ $\\Rightarrow$ $B^{-1}A^{-1}$. **3) Answer ②**","concept_3":"**1) Intuition:** rank drop / collapse. **3) Answer ②**","concept_4":"**1) Link:** invertible Gram matrix. **3) Answer ①**","concept_5":"**1) Numerics:** ill-conditioning. **3) Answer ①**","concept_6":"**1) SVD view:** product of singular values. **3) Answer ①**","concept_7":"**1) Formula:** denominator $\\det(A)$. **3) Answer ①**","concept_8":"**1) Sign:** positive determinant preserves orientation. **3) Answer ①**","concept_9":"**1) Geometry:** absolute determinant is volume. **3) Answer ①**","projection_0":"**1) Theorem:** cofactor expansion. **3) Answer ①**","projection_1":"**1) Definition:** adjugate identity. **3) Answer ①**","projection_2":"**1) Compute:** similarity preserves determinant. **3) Answer ①**","projection_3":"**1) Characteristic polynomial:** eigenvalues are roots. **3) Answer ①**","projection_4":"**1) Count:** $\\det(A)$ plus three modified determinants. **3) Answer ①**","projection_5":"**1) Theorem:** block diagonal determinant. **3) Answer ①**","projection_6":"**1) Property:** row swap $\\Rightarrow$ sign flip. **3) Answer ①**","projection_7":"**1) Property:** row addition invariant. **3) Answer ①**","projection_8":"**1) Theorem:** $\\det(AB)\\neq0$ implies both nonsingular. **3) Answer ①**","projection_9":"**1) Geometry:** Jacobian magnitude. **3) Answer ①**","scenario_0":"**1) Practice:** batched determinants. **3) Answer ①**","scenario_1":"**1) Practice:** avoid explicit inverse. **3) Answer ①**","scenario_2":"**1) Link:** ill-conditioned quadratic model. **3) Answer ①**","scenario_3":"**1) Stats/ML:** regularization stabilizes inversion. **3) Answer ①**","scenario_4":"**1) Link:** normalization constant from covariance volume. **3) Answer ①**","scenario_5":"**1) Practice:** Moore–Penrose pseudoinverse. **3) Answer ①**","scenario_6":"**1) Optimization:** poor conditioning directions. **3) Answer ①**","scenario_7":"**1) Caution:** floating-point conditioning. **3) Answer ①**","scenario_8":"**1) Theorem:** determinant is product of eigenvalues. **3) Answer ①**","scenario_9":"**1) Practice:** PSD stabilization tricks. **3) Answer ①**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 3\nassert answer == 3","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 1\nassert answer == 1","definition_6":"answer = 2\nassert answer == 2","definition_7":"answer = 1\nassert answer == 1","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 2\nassert answer == 2","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 2\nassert answer == 2","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 1\nassert answer == 1","trueFalse_6":"answer = 2\nassert answer == 2","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 2\nassert answer == 2","trueFalse_9":"answer = 1\nassert answer == 1","calc_0":"answer = 1\nassert answer == 1","calc_1":"answer = 1\nassert answer == 1","calc_2":"answer = 3\nassert answer == 3","calc_3":"answer = 1\nassert answer == 1","calc_4":"answer = 1\nassert answer == 1","calc_5":"answer = 1\nassert answer == 1","calc_6":"answer = 1\nassert answer == 1","calc_7":"answer = 1\nassert answer == 1","calc_8":"answer = 1\nassert answer == 1","calc_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","projection_0":"answer = 1\nassert answer == 1","projection_1":"answer = 1\nassert answer == 1","projection_2":"answer = 1\nassert answer == 1","projection_3":"answer = 1\nassert answer == 1","projection_4":"answer = 1\nassert answer == 1","projection_5":"answer = 1\nassert answer == 1","projection_6":"answer = 1\nassert answer == 1","projection_7":"answer = 1\nassert answer == 1","projection_8":"answer = 1\nassert answer == 1","projection_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1"}},"midMathCh06":{"chapter":"Chapter 06","title":"Linear Independence and Rank: How Many Real Dimensions?","description":"Imagine a startup with 100 employees on paper. In practice, 20 people drive new ideas while 80 mostly copy the same approvals with different names. Is the “real workload dimension” 100 or 20?\n\nLast chapter: matrices reshape space. Here we learn to spot **fake vs real** arrows in data: **linear independence** (a direction nobody else can replace) vs **dependence** (a free rider that is just a combination). After stripping redundant shadows, **rank** counts the **true backbone** of information—without being fooled by raw column counts.","sectionTitle":"Linear Independence and Rank: How Many Real Dimensions?","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving notes"},"visualShort":"Same line vs new direction · rank 1↔2","visualIntro":"The **dashed line** is the first direction. The **orange** vector moves **on** that line, then **off**—watch **rank** flip between **1** and **2**.","visualStep1":"Definition: $\\sum c_i\\mathbf{v}_i=\\mathbf{0}$ forces all $c_i=0$ ⇔ linear independence","visualStep2":"Intuition: **collinear** ≈ dependence; **break away** → independence & rank","visualStep3":"Math: $\\mathrm{rank}(A)$ = column-space dimension = pivot count","visualStep4":"Uses: multicollinearity, ridge, layer bottlenecks","visualStepsLabel":"Reading order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: independence, dependence, basis, rank","visualFlowStep1":"Intuition: collinearity vs independence, rank","visualFlowStep2":"Geometry ↔ algebra","visualFlowStep3":"Link: Ch.05 invertibility and $\\det$","visualFlowStep4":"Applications: regression & deep nets","rankVisualAriaLabel":"A dashed line for the first direction and two vectors from the origin; the corner badge fades between rank 1 and rank 2 as the orange vector moves.","rankVisualMainTitle":"Linear Independence and Rank: How Many Real Dimensions?","rankVisualSubtitle":"**Independent** means the directions **don’t overlap**. **Rank** counts **non-redundant** directions (here **1** or **2** in this demo).","rankVisualCaption":"When the **orange** arrow lies **on the dashed line** (the span of the first direction), it does not add a new axis—**linear dependence**; in this figure the demo reads **rank 1**.\n\nWhen orange **leaves the line**, the two directions differ → **linear independence**, and the demo reads **rank 2**.","whatIs":{"0":"**1. Linear independence — “RGB primaries”**\n\nMixing paint or light, **red, green, blue** are **fundamental**: you cannot synthesize one from the others alone. Vectors are **linearly independent** when no vector is a combination of the rest: $c_1\\mathbf{v}_1+\\cdots+c_k\\mathbf{v}_k=\\mathbf{0}$ forces **every** $c_i=0$. Each new independent vector opens a **new axis** of information.","1":"**2. Linear dependence — echoes and “free riders”**\n\nIf red and green lights already exist, adding a “**yellow**” lamp (red + green) does **not** widen the color gamut: it is **redundant**. If $\\mathbf{v}_3=2\\mathbf{v}_1+3\\mathbf{v}_2$, the third vector is a **linear combination** of the others — **dependence**. It may look like more data, but it is an **echo**, not new signal.","2":"**3. Rank — “information purity” after defoaming**\n\n$\\mathrm{rank}(A)$ is the **maximum number of independent columns**, no matter whether you have 100 or 1000 columns. If 100 arrows all lie in one **plane**, rank is still **2**: rank is the **true effective dimension** of the data.","3":"**4. Basis — minimal steel frame**\n\nA **basis** is a **smallest independent set** that still **spans** the whole subspace—like the steel frame that fixes a building’s shape even if many bricks fill the walls. The number of basis vectors is the **dimension**.","4":"**5. Link to Ch.05 — what $\\det(A)$ means, and rank**\n\nThe **determinant** $\\det(A)$ is **one number** that tells how an $n\\times n$ linear map **scales unit $n$-dimensional volume** (in 2D: **area** of the unit square). If $\\det(A)=0$, the map **squashes** space so volume collapses and **no inverse** exists; if $\\det(A)\\neq 0$, you can **undo** the map with **$A^{-1}$** (Ch.05).\n\nIf $\\mathrm{rank}(A)=n$, columns are independent and space is **not** fully flattened, so **$\\det(A)\\neq 0$** and **$A^{-1}$** exists. Lower rank lets the space collapse, so **$\\det(A)=0$** and inversion fails."},"whyImportant":{"bridge":"Five witnesses sound great—unless they all watched from the **same window** (dependence): you hear **one** clue five times (rank 1). Three witnesses from a street, a rooftop, and CCTV (independence, rank 3) carry **far more** real information.\n\nIn ML, feeding both “area in m²” and “area in pyeong” points the **same direction**: **multicollinearity**. The model may not “notice” the duplication and can return unstable or nonsense weights.","similarity":"**Rank** asks: *how many nutrient-dense directions are really here?* Stripping redundant mixtures is core prep for stable training and faster, clearer computation."},"howUsed":{"ml":"**1. Saving linear regression (ridge)**\nLeast squares needs $(X^{\\mathsf T}X)^{-1}$. Nearly duplicate columns make $X^{\\mathsf T}X$ singular. **Ridge** adds a tiny diagonal “shim”—like slipping a toothpick into a crushed sandwich—to restore **numerical volume** so an inverse can be computed.","geometry":"**2. Bottlenecks in deep nets**\nThink of a 100-lane highway through linear layers. If a layer has **effective rank 10**, the road suddenly narrows: **information bottleneck**—most lanes of detail are destroyed. Designers watch rank-like behavior when choosing widths."},"summary":"**One line:** independence = **irreplaceable** directions; dependence = **mixtures**; rank = **true dimension** after removing foam.","problemSolving":{"focus":"The table lists **symbols and tips**. **Worked patterns** walk through representative practice types—**definitions**, **true/false**, **numeric rank**, **rank–nullity**, **rank identities**, **short scenarios**—in a compact **question / solution** format.","examplesHeading":"Worked patterns","examplesTable":"$2c"},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"| Symbol | Meaning |\n| :--- | :--- |\n| Independent | Only trivial combination gives zero |\n| Dependent | One column is a combination of others |\n| $\\mathrm{rank}(A)$ | Dimension of column space |\n| Basis | Independent spanning set |\n| $\\mathrm{rank}(AB)$ | $\\le\\min\\{\\mathrm{rank}A,\\mathrm{rank}B\\}$ |\n| $\\det(A)$ | Volume/area scaling factor for the map (Ch.05); $\\det(A)=0$ ⇒ no inverse |","practiceProblemsTitle":"Practice","practiceProblemsIntro":"**10 random questions** are drawn from a bank of 60.","practiceProblemsInstruction":"Read each question carefully and choose the best answer.","problems":{"definition_0":"Which best characterizes linear independence of $\\mathbf{v}_1,\\mathbf{v}_2$?\n\n① Equal norms always\n② $c_1\\mathbf{v}_1+c_2\\mathbf{v}_2=\\mathbf{0}\\Rightarrow c_1=c_2=0$\n③ Dot product is 0\n④ Both are unit vectors","definition_1":"Which defines $\\mathrm{rank}(A)$?\n\n① Number of rows\n② Dimension of column space\n③ Sum of entries\n④ Trace","definition_2":"The size of a basis of a subspace is?\n\n① Can vary for the same subspace\n② Fixed for the same subspace\n③ Always equals number of rows\n④ Always 1","definition_3":"Max number of linearly independent vectors in $\\mathbb{R}^3$?\n\n① 2\n② 3\n③ 4\n④ Infinitely many","definition_4":"If columns of $A$ are linearly dependent then?\n\n① Rank equals number of columns\n② Rank is less than number of columns\n③ $\\det(A)=1$\n④ $A$ must be square","definition_5":"For $A\\in\\mathbb{R}^{m\\times n}$, $\\mathrm{rank}(A)\\le$?\n\n① $\\min(m,n)$\n② $m+n$\n③ $\\max(m,n)$\n④ $mn$","definition_6":"The set containing only $\\mathbf{0}$ is?\n\n① Always independent\n② Not independent\n③ Independent iff $n\\ge2$\n④ A basis","definition_7":"$$\\mathrm{rank}(A^{\\mathsf T})$ vs $\\mathrm{rank}(A)$?\n\n① Always equal\n② Always different\n③ Decreases by 1 after transpose\n④ Always 0","definition_8":"$$\\dim(W)$ for a subspace $W$ equals?\n\n① Number of basis vectors\n② Number of all vectors in $W$\n③ Always 0\n④ Always full space dimension","definition_9":"If $\\mathbf{v}_1,\\ldots,\\mathbf{v}_k$ are independent, then $\\mathrm{rank}([\\mathbf{v}_1\\ \\cdots\\ \\mathbf{v}_k])$ equals?\n\n① Less than $k$\n② $k$\n③ 0\n④ Unrelated","trueFalse_0":"More vectors always implies independence.\n\n① True\n② False","trueFalse_1":"$$\\mathrm{rank}(A+B)\\le \\mathrm{rank}(A)+\\mathrm{rank}(B)$.\n\n① True\n② False","trueFalse_2":"If $A$ is invertible $n\\times n$, then $\\mathrm{rank}(A)=n$.\n\n① True\n② False","trueFalse_3":"Independent columns imply $A$ is square.\n\n① True\n② False","trueFalse_4":"$$\\mathrm{rank}(A^{\\mathsf T}A)=\\mathrm{rank}(A)$ over reals.\n\n① True\n② False","trueFalse_5":"Two distinct vectors in $\\mathbb{R}^2$ are always independent.\n\n① True\n② False","trueFalse_6":"Rank cannot exceed number of columns.\n\n① True\n② False","trueFalse_7":"Pivot count equals rank after RREF.\n\n① True\n② False","trueFalse_8":"Row rank equals column rank.\n\n① True\n② False","trueFalse_9":"Any subset of an independent set is independent.\n\n① True\n② False","calc_0":"$$\\mathrm{rank}\\begin{pmatrix}1&2\\\\2&4\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_1":"$$\\mathrm{rank}\\begin{pmatrix}2&1\\\\4&2\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_2":"$$\\mathrm{rank}\\begin{pmatrix}1&1&0\\\\0&1&1\\end{pmatrix}$?\n\n① 1\n② 2\n③ 3\n④ 0","calc_3":"Max independent vectors in $\\mathbb{R}^4$?\n\n① 3\n② 4\n③ 5\n④ 2","calc_4":"$$\\mathrm{rank}\\begin{pmatrix}1&3\\\\2&6\\end{pmatrix}$?\n\n① 2\n② 1\n③ 0\n④ 3","calc_5":"$$\\mathrm{rank}\\begin{pmatrix}1&2&3\\\\2&4&6\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_6":"$$\\mathrm{rank}\\begin{pmatrix}1&2&3\\\\0&1&1\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_7":"Max rank of $3\\times5$ real $A$?\n\n① 5\n② 4\n③ 3\n④ 8","calc_8":"$$\\mathrm{rank}\\begin{pmatrix}1&0&1\\\\0&1&1\\end{pmatrix}$? (col3 = col1+col2)\n\n① 3\n② 2\n③ 1\n④ 0","calc_9":"$$\\mathrm{rank}\\begin{pmatrix}1&1&2\\\\0&1&1\\\\1&2&3\\end{pmatrix}$? (row3 = row1+row2)\n\n① 0\n② 1\n③ 2\n④ 3","concept_0":"If $A\\in\\mathbb{R}^{m\\times n}$ has 3 columns that are linearly independent, then $\\mathrm{rank}(A)$ is?\n\n① 3\n② At most 2\n③ 0\n④ Unrelated to column count","concept_1":"If a finite set of vectors is linearly dependent, which is always true?\n\n① All are zero\n② At least one is a linear combination of the others\n③ All are unit\n④ All are pairwise orthogonal","concept_2":"After RREF, the number of pivots equals?\n\n① Column rank\n② Always differs from column rank\n③ Always number of rows\n④ Always 0","concept_3":"Let $W\\subseteq\\mathbb{R}^5$ be a subspace with $\\dim(W)=3$. Max number of independent vectors in $W$?\n\n① 2\n② 3\n③ 5\n④ Infinitely many","concept_4":"If $\\mathbf{v}_1,\\mathbf{v}_2,\\mathbf{v}_3$ are independent, then $\\mathbf{v}_1,\\mathbf{v}_2$ are?\n\n① Always dependent\n② Always independent\n③ Always orthogonal\n④ Cannot tell","concept_5":"For $A\\in\\mathbb{R}^{m\\times n}$, a necessary condition for its columns to be linearly independent is?\n\n① $m\\ge n$\n② $m\\le n$\n③ Only $m=n$\n④ $n>m$","concept_6":"For $A\\in\\mathbb{R}^{m\\times n}$, if $\\dim\\{ \\mathbf{x}:A\\mathbf{x}=\\mathbf{0}\\}=k$, then $\\mathrm{rank}(A)$ equals?\n\n① $n-k$\n② $m-k$\n③ $k$\n④ $m+n$","concept_7":"If one column is a linear combination of the others, the column rank is?\n\n① Equal to number of columns\n② Less than number of columns\n③ Always 0\n④ Infinite","concept_8":"A real $2\\times2$ matrix is invertible iff?\n\n① rank 0\n② rank 1\n③ rank 2\n④ rank irrelevant","concept_9":"Which is always true?\n\n① $\\mathrm{rank}(AB)\\ge \\mathrm{rank}(A)$\n② $\\mathrm{rank}(AB)\\le \\mathrm{rank}(A)$\n③ $\\mathrm{rank}(AB)=\\mathrm{rank}(A)$\n④ $AB$ is always full rank","projection_0":"$$\\mathrm{rank}(A^{\\mathsf T})$ equals?\n\n① $\\mathrm{rank}(A)$\n② $\\mathrm{rank}(A)+1$\n③ $0$\n④ $\\det(A)$","projection_1":"Upper bound of $\\mathrm{rank}(AB)$?\n\n① $\\min\\{\\mathrm{rank}A,\\mathrm{rank}B\\}$\n② $\\mathrm{rank}A+\\mathrm{rank}B$\n③ $mn$\n④ Always $\\mathrm{rank}A$","projection_2":"For invertible $P,Q$, $\\mathrm{rank}(PAQ)$ equals?\n\n① $\\mathrm{rank}(A)$\n② 0\n③ $\\mathrm{rank}(P)$\n④ $\\det(A)$","projection_3":"Rank of zero matrix?\n\n① 0\n② 1\n③ #cols\n④ #rows","projection_4":"Rank of triangular with nonzero diagonal?\n\n① 0\n② Number of nonzero diagonal entries\n③ Always 1\n④ Full always","projection_5":"Max rank of $5\\times3$ $A$?\n\n① 5\n② 4\n③ 3\n④ 15","projection_6":"Swapping columns changes rank?\n\n① Preserves\n② +1 always\n③ Always 0\n④ Doubles","projection_7":"Adding a multiple of another column preserves rank?\n\n① Yes\n② Always -1\n③ Always 0\n④ Doubles","projection_8":"Rank of $P=\\begin{pmatrix}1&0\\\\0&0\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","projection_9":"$$\\mathrm{rank}(A)$ vs $\\mathrm{rank}(A^{\\mathsf T}A)$ over reals?\n\n① Equal\n② Always differ\n③ Always larger for $A$\n④ Always 0","scenario_0":"If two distinct columns of a matrix are identical, then?\n\n① Columns are dependent; column rank can be less than #cols\n② Always full column rank\n③ Rank is always 0\n④ Column rank always equals #cols","scenario_1":"If $\\mathbf{a}_3=2\\mathbf{a}_1-\\mathbf{a}_2$ for columns $\\mathbf{a}_1,\\mathbf{a}_2,\\mathbf{a}_3$, then $\\mathrm{rank}([\\mathbf{a}_1\\ \\mathbf{a}_2\\ \\mathbf{a}_3])$ is?\n\n① Always 3\n② At most 2\n③ Always 0\n④ Always 4","scenario_2":"If $A$ is $4\\times4$ with $\\mathrm{rank}(A)=3$, then $\\dim(\\mathrm{Col}(A))$ equals?\n\n① 4\n② 3\n③ 2\n④ 0","scenario_3":"If the rows of $A\\in\\mathbb{R}^{m\\times n}$ are linearly independent as vectors in $\\mathbb{R}^n$, the row rank is?\n\n① $m$\n② Always 0\n③ $n$\n④ Always 1","scenario_4":"For any real $m\\times n$ matrix $A$, $\\mathrm{rank}(A)$ and $\\mathrm{rank}(A^{\\mathsf T})$ are?\n\n① Always equal\n② Always different\n③ Always rank(A) larger\n④ Always 0","scenario_5":"If all $n$ columns of an $m\\times n$ matrix are linearly independent, then necessarily?\n\n① $m\\ge n$\n② $m\\le n$\n③ Only possible if $m=n$\n④ $n>m$","scenario_6":"If $\\mathrm{rank}(A)=r$, then $\\dim(\\mathrm{Col}(A))$ equals?\n\n① $r$\n② $mn$\n③ $n-r$\n④ $m$","scenario_7":"If two rows are proportional (one is a scalar multiple of the other), their contribution to row rank is at most?\n\n① 1\n② Always 2\n③ Always 0\n④ Equal to #rows","scenario_8":"For $T(\\mathbf{x})=A\\mathbf{x}$ with $A\\in\\mathbb{R}^{m\\times n}$, $\\dim(\\mathrm{range}\\,T)$ equals?\n\n① $\\mathrm{rank}(A)$\n② Always $n$\n③ Always $m$\n④ Always 0","scenario_9":"If $A$ is real $n\\times n$ and $\\mathrm{rank}(A)0$, eigenvalues of $S+\\mu I$ (with multiplicity) are?\n\n① the same multiset as $S$\n② **each eigenvalue of $S$ shifted by $\\mu$**\n③ all equal to $\\mu$\n④ all zero","hscn_5":"If $A=Q\\Lambda Q^{\\mathsf T}$ with orthogonal $Q$ and diagonal $\\Lambda$, then $A^5=Q\\Lambda_1 Q^{\\mathsf T}$ where $\\Lambda_1$ is?\n\n① **diagonal with each diagonal entry of $\\Lambda$ raised to the 5th power**\n② $5\\Lambda$\n③ $\\Lambda^{-1}$\n④ $I$"},"problemAnswers":{"edef_0":2,"edef_1":2,"edef_2":2,"edef_3":2,"edef_4":1,"edef_5":2,"etf_0":2,"etf_1":1,"etf_2":2,"etf_3":1,"etf_4":1,"etf_5":1,"ecalc_0":1,"ecalc_1":1,"ecalc_2":1,"ecalc_3":1,"ecalc_4":3,"ecalc_5":2,"eprop_0":1,"eprop_1":1,"eprop_2":1,"eprop_3":1,"eprop_4":2,"eprop_5":1,"mcon_0":1,"mcon_1":2,"mcon_2":2,"mcon_3":2,"mcon_4":1,"mcon_5":1,"mcmp_0":2,"mcmp_1":2,"mcmp_2":1,"mcmp_3":2,"mcmp_4":1,"mcmp_5":2,"mdiag_0":1,"mdiag_1":3,"mdiag_2":1,"mdiag_3":1,"mdiag_4":1,"mdiag_5":1,"hproj_0":1,"hproj_1":1,"hproj_2":3,"hproj_3":3,"hproj_4":1,"hproj_5":1,"hpca_0":1,"hpca_1":2,"hpca_2":2,"hpca_3":2,"hpca_4":1,"hpca_5":1,"hscn_0":2,"hscn_1":2,"hscn_2":2,"hscn_3":2,"hscn_4":2,"hscn_5":1},"problemSolutions":{"edef_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","edef_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","etf_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","etf_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","etf_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","etf_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","etf_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","etf_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","ecalc_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","eprop_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","eprop_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcon_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcon_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcon_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcon_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcon_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcon_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcmp_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcmp_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcmp_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mdiag_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","mdiag_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","hproj_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","hproj_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hpca_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hpca_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hpca_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hpca_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hpca_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hpca_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hscn_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①"},"problemTestCodes":{"edef_0":"answer = 2\nassert answer == 2","edef_1":"answer = 2\nassert answer == 2","edef_2":"answer = 2\nassert answer == 2","edef_3":"answer = 2\nassert answer == 2","edef_4":"answer = 1\nassert answer == 1","edef_5":"answer = 2\nassert answer == 2","etf_0":"answer = 2\nassert answer == 2","etf_1":"answer = 1\nassert answer == 1","etf_2":"answer = 2\nassert answer == 2","etf_3":"answer = 1\nassert answer == 1","etf_4":"answer = 1\nassert answer == 1","etf_5":"answer = 1\nassert answer == 1","ecalc_0":"answer = 1\nassert answer == 1","ecalc_1":"answer = 1\nassert answer == 1","ecalc_2":"answer = 1\nassert answer == 1","ecalc_3":"answer = 1\nassert answer == 1","ecalc_4":"answer = 3\nassert answer == 3","ecalc_5":"answer = 2\nassert answer == 2","eprop_0":"answer = 1\nassert answer == 1","eprop_1":"answer = 1\nassert answer == 1","eprop_2":"answer = 1\nassert answer == 1","eprop_3":"answer = 1\nassert answer == 1","eprop_4":"answer = 2\nassert answer == 2","eprop_5":"answer = 1\nassert answer == 1","mcon_0":"answer = 1\nassert answer == 1","mcon_1":"answer = 2\nassert answer == 2","mcon_2":"answer = 2\nassert answer == 2","mcon_3":"answer = 2\nassert answer == 2","mcon_4":"answer = 1\nassert answer == 1","mcon_5":"answer = 1\nassert answer == 1","mcmp_0":"answer = 2\nassert answer == 2","mcmp_1":"answer = 2\nassert answer == 2","mcmp_2":"answer = 1\nassert answer == 1","mcmp_3":"answer = 2\nassert answer == 2","mcmp_4":"answer = 1\nassert answer == 1","mcmp_5":"answer = 2\nassert answer == 2","mdiag_0":"answer = 1\nassert answer == 1","mdiag_1":"answer = 3\nassert answer == 3","mdiag_2":"answer = 1\nassert answer == 1","mdiag_3":"answer = 1\nassert answer == 1","mdiag_4":"answer = 1\nassert answer == 1","mdiag_5":"answer = 1\nassert answer == 1","hproj_0":"answer = 1\nassert answer == 1","hproj_1":"answer = 1\nassert answer == 1","hproj_2":"answer = 3\nassert answer == 3","hproj_3":"answer = 3\nassert answer == 3","hproj_4":"answer = 1\nassert answer == 1","hproj_5":"answer = 1\nassert answer == 1","hpca_0":"answer = 1\nassert answer == 1","hpca_1":"answer = 2\nassert answer == 2","hpca_2":"answer = 2\nassert answer == 2","hpca_3":"answer = 2\nassert answer == 2","hpca_4":"answer = 1\nassert answer == 1","hpca_5":"answer = 1\nassert answer == 1","hscn_0":"answer = 2\nassert answer == 2","hscn_1":"answer = 2\nassert answer == 2","hscn_2":"answer = 2\nassert answer == 2","hscn_3":"answer = 2\nassert answer == 2","hscn_4":"answer = 2\nassert answer == 2","hscn_5":"answer = 1\nassert answer == 1"}},"midMathCh08":{"chapter":"Chapter 08","title":"Directional Derivative and Gradient: Finding the Way Up","description":"Imagine an explorer in thick fog who cannot see a step ahead and must find the deepest valley of a rugged mountain range—the place where error is smallest. With no view of the terrain, they can only rely on the **slope felt underfoot** to descend. An AI learning from vast data is exactly this precarious climb. Each mistaken prediction builds a loss that forms a huge three-dimensional mountain range.\n\nAt that spot, the measure that answers “if I go east from here, how steep is it?” is the **directional derivative**. The miraculous compass that, among all directions around you, points without error to the **steepest uphill** is the **gradient** $\\nabla f$. This chapter explains in rich, terrain-map metaphors how the gradient—often called the flower of calculus—guides AI safely down complex error mountains, without hiding behind a wall of symbols.","sectionTitle":"Directional Derivative and Gradient: Finding the Way Up","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving notes"},"visualShort":"3D loss surface + contours + descent path","visualIntro":"Brighter colors = bigger error (peaks); cooler = smaller (valleys). The dark line is the path downhill.","visualStep1":"Surface: loss height changes with parameters","visualStep2":"Contours: same height; $\\nabla f$ is perpendicular","visualStep3":"Slope in the direction you face","visualStep4":"Purple arrow = uphill ($\\nabla f$); black line = descent","visualStepsLabel":"Reading order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: partials → directional derivative → gradient","visualFlowStep1":"Intuition: contour map & loss landscape","visualFlowStep2":"Algebra: $D_{\\mathbf{u}} f = \\nabla f \\cdot \\mathbf{u}$","visualFlowStep3":"Link: Ch.02 dot product, basic partials","visualFlowStep4":"Use: gradient descent → Ch.09 Jacobian, Ch.10 Hessian","gradVisualAriaLabel":"Animated 3D loss surface with floor contours, a descent path, and a gradient arrow at one point.","gradVisualMainTitle":"Directional Derivative and Gradient: Finding the Way Up","gradVisualSubtitle":"On a contour map, the **gradient** $\\nabla f$ is **at right angles** to the level curves and points **straight up the steepest hill**. The **directional derivative** $D_{\\mathbf{u}} f$ asks “how much do I climb if I step in direction $\\mathbf{u}$?”—one **dot product**: $\\nabla f \\cdot \\mathbf{u}$.","gradVisualCaption":"How to read the figure: **red/yellow** = big error (**peaks**); **blue/purple** = small error (**valleys**). The **black line** is the path **downhill** to reduce error (gradient descent). The **purple arrow** is the **steepest uphill** direction ($\\nabla f$) at that spot.","gradDiagramLabelSurface":"Loss surface","gradDiagramLabelContour":"Contour","gradDiagramLabelPath":"Descent path","gradDiagramLabelGradient":"Gradient","whatIs":{"0":"**1. Multivariable functions and contours: reading 3D terrain in 2D**\n\nA flat 2D map becomes informative when winding **contour lines** show peaks and sunken valleys. Lines packed tight mean a cliff you would sweat to climb; lines spread wide mean gentle, easy ground. The **loss** AI computes while learning forms the same kind of vast, rugged multidimensional range. Through math we read those invisible contours and see intuitively whether error is rising sharply or settling lower.","1":"**2. Partial derivatives: measuring slope along only one axis**\n\nStop on a rugged hillside. Ask: “If I ignore every other direction and take one step **due east** ($x$-axis only), what is the slope?” Or: “If I walk **due north** ($y$-axis only), uphill or downhill?” Measuring slope along **one** axis only is a **partial derivative**, written $\\frac{\\partial f}{\\partial x}$ with the round $\\partial$ symbol. It is narrow but foundational for every calculation.","2":"**3. Directional derivative: slope along the path you are actually facing**\n\nAn explorer need not walk only the four cardinal directions. You may choose northeast at 30°, or a slanting southwest path—any heading in 360°. The instantaneous rate at which height changes when you take a tiny step in **that** chosen direction is the **directional derivative** $D_{\\mathbf{u}} f$: the felt slope of the trail you are looking at right now.","3":"**4. Gradient: the miraculous compass to the steepest uphill**\n\nAmong all directions around you, exactly **one** points up the most brutally steep climb toward the summit. Combine the $x$- and $y$-slopes into one arrow (vector): the **gradient** $\\nabla f$. This arrow always shoots **perpendicular** to contours—the shortest way across level curves. Its **direction** is steepest uphill nearby; its **length** is how steep that climb is (maximum slope).","4":"**5. The dot product’s “magic” link to the gradient**\n\nYou need not redo heavy calculations for every direction. **Dot** the gradient (your best compass) with your **direction vector** and the slope along that direction appears: **$D_{\\mathbf{u}} f = \\nabla f \\cdot \\mathbf{u}$**. Walk exactly where the gradient points and you face the steepest uphill on earth at that spot."},"whyImportant":{"bridge":"Training (optimization) to make an AI smarter is a hard journey to the calm floor of the deepest valley—where error is smallest—in a vast mountain range. With hundreds of thousands of weights, the terrain becomes millions of dimensions beyond imagination. Blind walking in such fog can wander forever without finding the valley.","similarity":"Then the **gradient** $\\nabla L$ acts like a miraculous navigator: it points precisely to where error would explode uphill fastest. The AI only turns **opposite** that finger and takes quiet steps downhill. Without this mathematical compass, deep learning cannot train; countless weights would wander with no idea which way to change."},"howUsed":{"ml":"**1. The beating heart of deep learning: gradient descent**\n\nAll these ideas converge in one great algorithm: **gradient descent**. The core update $\\mathbf{w}_{k+1} = \\mathbf{w}_k - \\eta \\nabla L(\\mathbf{w}_k)$ shows how AI moves. Here $\\nabla L$ is the “steepest uphill” direction; the minus sign means “I will descend carefully opposite that uphill.”\n\n**Learning rate** $\\eta$ is the explorer’s **stride**. Too large a stride leaps over the target valley onto the opposite peak; too small an ant-like stride never reaches the bottom before training ends. In practice, tuning stride for the environment often decides success.","geometry":"**2. Visualizing loss surfaces to check model health**\n\nPapers and data-science work often show colorful 3D mountain plots or contour heatmaps that darken and lighten. Researchers compress billions of unknown weights into 2–3 visible dimensions and sketch the error surface to see whether training is smooth. They watch whether the descent path zigzags nervously on the heatmap or glides down like a sled, diagnosing model health and learning structure by eye."},"summary":"**In one line:** read the loss range like a contour map; any-direction slope **$D_{\\mathbf{u}} f = \\nabla f \\cdot \\mathbf{u}$**; steepest uphill **$\\nabla f$**; one descent step **$\\mathbf{w}_{k+1}=\\mathbf{w}_k-\\eta\\nabla L$**.","problemSolving":{"focus":"Hold **three formulas**: ① $D_{\\mathbf{u}} f = \\nabla f \\cdot \\mathbf{u}$ (unit $\\mathbf{u}$). ② $\\nabla f$ ⊥ contours, steepest uphill. ③ $\\mathbf{w}_{k+1}=\\mathbf{w}_k-\\eta\\nabla L$ (check minus and $\\eta$). Workflow: **$\\nabla f$ or $\\nabla L$ → normalize $\\mathbf{u}$ → dot product**.","examplesHeading":"Worked examples","examplesTable":"$2d"},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"| In words | Meaning |\n| :--- | :--- |\n| **Directional derivative** | Slope along direction $\\mathbf{u}$ |\n| **Gradient** | Steepest **uphill** $\\nabla f$ |\n| **Key formula** | $D_{\\mathbf{u}} f = \\nabla f \\cdot \\mathbf{u}$ |\n| **Contours** | $\\nabla f$ ⊥ contours; along a contour, slope 0 |\n| **Gradient descent** | $\\mathbf{w}_{k+1}=\\mathbf{w}_k-\\eta\\nabla L$ |\n| **Flat spot** | $\\nabla f\\approx\\mathbf{0}$ → critical-point **candidate** |\n\n**①** $\\nabla f$ first. **②** Unit $\\mathbf{u}$. **③** $D_{\\mathbf{u}} f>0$ = uphill. **④** $\\eta$ too big jumps; too small crawls.","practiceProblemsTitle":"Practice","practiceProblemsIntro":"","practiceProblemsInstruction":"Read each question and choose the best answer.","problems":{"gdef_0":"The rate of change of $f$ when moving slightly along unit vector $\\mathbf{u}$ is called?\n\n① partial derivative\n② directional derivative $D_{\\mathbf{u}} f$\n③ Jacobian\n④ Hessian","gdef_1":"$$\\nabla f$ (gradient) best means?\n\n① always toward a minimum\n② vector of partials, direction of **steepest ascent**\n③ tangent to level curves\n④ always the zero vector","gdef_2":"When $\\|\\mathbf{u}\\|=1$, the correct link between $D_{\\mathbf{u}} f$ and $\\nabla f$ is?\n\n① always $0$\n② $D_{\\mathbf{u}} f=\\nabla f\\cdot\\mathbf{u}$\n③ $D_{\\mathbf{u}} f=\\|\\nabla f\\|^2$\n④ unrelated","gdef_3":"For unit $\\mathbf{u}$, $D_{\\mathbf{u}} f$ is **largest** when?\n\n① $\\mathbf{u}$ is perpendicular to $\\nabla f$\n② $\\mathbf{u}$ is **parallel** to $\\nabla f$\n③ only when $\\nabla f=0$\n④ always east","gdef_4":"At $\\nabla f(\\mathbf{x})=0$, which holds?\n\n① always a saddle\n② zero rate in every direction (**critical point**)\n③ always a maximum\n④ infinite gradient","gdef_5":"$$\\partial f/\\partial x$ is which directional derivative?\n\n① $\\mathbf{u}=(0,1)$\n② $\\mathbf{u}=(1,0)$ (along $x$)\n③ diagonal\n④ none of these","gtf_0":"The gradient $\\nabla f$ is **perpendicular** to level curves.\n\n① True\n② False","gtf_1":"If $\\|\\mathbf{u}\\|=1$, then always $D_{\\mathbf{u}} f\\le \\|\\nabla f\\|$.\n\n① True\n② False","gtf_2":"The gradient always points toward a **minimum**.\n\n① True\n② False","gtf_3":"If $\\nabla f=0$, then for every unit $\\mathbf{u}$, $D_{\\mathbf{u}} f=0$.\n\n① True\n② False","gtf_4":"When $\\|\\mathbf{u}\\|=1$, we can have $D_{\\mathbf{u}} f>\\|\\nabla f\\|$.\n\n① True\n② False","gtf_5":"A partial derivative is a directional derivative along a coordinate axis.\n\n① True\n② False","gcalc_0":"For $f(x,y)=x^2+y^2$ at $(1,1)$, $\\nabla f$ is?\n\n① $(0,0)$\n② $(2,2)$\n③ $(1,1)$\n④ $(-2,-2)$","gcalc_1":"For $f(x,y)=x^2+xy+y^2$ at the origin, $\\nabla f$ is?\n\n① $(0,0)$\n② $(1,1)$\n③ $(2,2)$\n④ undefined","gcalc_2":"For $f(x,y)=3x+2y$, $\\nabla f$ is?\n\n① $(2,3)$\n② $(3,2)$\n③ $(0,0)$\n④ $(5,5)$","gcalc_3":"For $f=x^2+y^2$ at $(1,1)$ with $\\mathbf{u}=\\frac{1}{\\sqrt{2}}(1,0)$, $D_{\\mathbf{u}} f$ is?\n\n① $2$\n② $\\sqrt{2}$\n③ $2\\sqrt{2}$\n④ $0$","gcalc_4":"For $f(x,y)=xy$ at $(2,3)$, $\\nabla f$ is?\n\n① $(2,3)$\n② $(3,2)$\n③ $(0,0)$\n④ $(6,6)$","gcalc_5":"For $f(x,y)=x^2-y^2$ at $(1,0)$, $\\nabla f$ is?\n\n① $(2,0)$\n② $(0,2)$\n③ $(2,2)$\n④ $(0,0)$","gprop_0":"When $\\|\\mathbf{u}\\|=1$, the **maximum** of $D_{\\mathbf{u}} f$ is?\n\n① $0$\n② $\\|\\nabla f\\|$\n③ $\\|\\nabla f\\|^2$\n④ always $1$","gprop_1":"If $\\nabla f\\neq0$, $\\|\\mathbf{u}\\|=1$, and $D_{\\mathbf{u}} f=0$, then $\\mathbf{u}$ is?\n\n① parallel to $\\nabla f$\n② **perpendicular** to $\\nabla f$ (tangent to a level curve)\n③ the zero vector\n④ arbitrary","gprop_2":"The direction of **steepest ascent** is?\n\n① any $\\mathbf{u}$\n② the direction of $\\nabla f$\n③ $-\\nabla f$\n④ tangent to level curves","gprop_3":"For gradient descent $\\mathbf{w}_{k+1}=\\mathbf{w}_k-\\eta\\nabla L$ with $\\eta>0$, the step direction is?\n\n① same as $\\nabla L$\n② **opposite** to $\\nabla L$ (descent)\n③ random\n④ zero","gprop_4":"If $\\nabla f$ is constant at a point, then locally $f$ is?\n\n① only quadratic\n② **close to linear** (a plane)\n③ always $0$\n④ periodic","gprop_5":"On a level curve $f=c$, $\\nabla f$ is?\n\n① tangent to the curve\n② **normal** (perpendicular) to the curve\n③ never related\n④ always zero","mcon_0":"Which is correct about partial vs directional derivatives?\n\n① unrelated\n② a partial derivative is a directional derivative along an axis\n③ directional derivatives are always $0$\n④ the gradient is a scalar","mcon_1":"A point with $\\nabla f=0$ is often called a?\n\n① eigenvalue\n② **critical point** (stationary point)\n③ rank defect only\n④ singularity only","mcon_2":"For $f(x,y)=x^2+y^2$ at the origin, $\\nabla f$ is?\n\n① $(2,2)$\n② $(0,0)$\n③ undefined\n④ $(1,1)$","mcon_3":"If $\\nabla f=0$ for a smooth $f(x,y)$, then the point is?\n\n① always a maximum\n② a **candidate** for max, min, or saddle\n③ always a minimum\n④ always linear","mcon_4":"Why $D_{\\mathbf{u}} f=\\nabla f\\cdot\\mathbf{u}$ for $\\|\\mathbf{u}\\|=1$?\n\n① definition of dot product only\n② the directional derivative **projects** the gradient onto $\\mathbf{u}$\n③ Taylor series\n④ determinant","mcon_5":"In training loss $L(\\mathbf{w})$, the gradient sign tells?\n\n① batch size\n② direction that **increases** $L$ (descent uses $-\\nabla L$)\n③ learning rate\n④ number of layers","mgeo_0":"Walking along a level curve, $f$ usually?\n\n① increases fastest\n② stays **constant** along that curve\n③ is always $0$\n④ equals the gradient","mgeo_1":"Why is $\\nabla f$ perpendicular to level curves?\n\n① coincidence\n② the **normal** direction changes height fastest\n③ gradient is tangent\n④ always horizontal","mgeo_2":"In a flat region ($\\nabla f\\approx0$)?\n\n① only steep ascent\n② **little change** in any direction\n③ must be a valley\n④ infinite gradient","mgeo_3":"Level curves of $f(x,y)=x^2+y^2$ are?\n\n① lines\n② **concentric circles**\n③ hyperbolas\n④ a single point only","mgeo_4":"On a loss **heatmap**, red regions usually mean?\n\n① low loss\n② **high loss** (a peak)\n③ zero gradient only\n④ random noise","mgeo_5":"Near a saddle point, $\\nabla f$ is?\n\n① always nonzero\n② **zero**, but curvature mixes ascent and descent directions\n③ always a maximum\n④ undefined","mcmp_0":"At $(1,1)$ for $f=x^2+y^2$, compare unit $\\mathbf{u}_1=(1,0)$ and $\\mathbf{u}_2=(1/\\sqrt{2},1/\\sqrt{2})$. Steeper ascent?\n\n① $\\mathbf{u}_1$\n② $\\mathbf{u}_2$ (aligned with $\\nabla f$)\n③ equal\n④ incomparable","mcmp_1":"$$\\nabla f=(4,0)$, unit $\\mathbf{u}=(0,1)$. $D_{\\mathbf{u}} f$ is?\n\n① $4$\n② $0$\n③ $-4$\n④ $16$","mcmp_2":"For $\\nabla f=(3,4)$, $\\|\\nabla f\\|$ is?\n\n① $7$\n② $5$\n③ $12$\n④ $1$","mcmp_3":"At one point, $D_{\\mathbf{u}_1} f=2$ and $D_{\\mathbf{u}_2} f=5$ (both unit). Then?\n\n① $\\mathbf{u}_1$ aligns with $\\nabla f$\n② $\\mathbf{u}_2$ is **closer** to $\\nabla f$\n③ both perpendicular\n④ gradient is zero","mcmp_4":"For $f=x+y$, $\\nabla f$ is?\n\n① $(0,0)$\n② $(1,1)$ everywhere\n③ $(1,-1)$\n④ depends on the point","mcmp_5":"$$\\nabla f=(2,-1)$, unit $\\mathbf{u}=(1,0)$. $D_{\\mathbf{u}} f$ is?\n\n① $-1$\n② $2$\n③ $0$\n④ $\\sqrt{5}$","hopt_0":"$$\\mathbf{w}=(2,1)$, $L=w_1^2+w_2^2$, $\\eta=0.25$. One gradient-descent step gives?\n\n① $(0,0)$\n② $(1,0.5)$\n③ $(3,1.5)$\n④ $(2,1)$","hopt_1":"If $\\eta<0$ in $\\mathbf{w}_{k+1}=\\mathbf{w}_k-\\eta\\nabla L$, you step?\n\n① surely to a minimum\n② **uphill** (worse loss)\n③ nowhere\n④ only when gradient is zero","hopt_2":"If $\\nabla L=0$, one update step?\n\n① always diverges\n② **stays at the same** $\\mathbf{w}$\n③ random jump\n④ always reaches global min","hopt_3":"$$L(w)=(w-3)^2$, $w=1$, $\\eta=0.125$. Next $w$ is?\n\n① $1$\n② $1.5$\n③ $3$\n④ $0$","hopt_4":"If learning rate $\\eta$ is **too large**, typically?\n\n① always converges\n② **oscillation or divergence**\n③ only vanishing gradients\n④ loss becomes exactly $0$","hopt_5":"Mini-batch gradient vs full gradient?\n\n① always identical\n② **noisy**, but expectation follows the true gradient\n③ always zero\n④ unusable","hloss_0":"For $L(w)=w^2$, $\\frac{dL}{dw}$ is?\n\n① $w$\n② $2w$\n③ $w^2$\n④ $0$","hloss_1":"MSE with $\\hat{y}_i=wx_i$: $\\partial L/\\partial w$ involves errors and?\n\n① nothing else\n② factors $x_i$ (**chain rule**)\n③ always $0$\n④ determinant only","hloss_2":"To move toward a **valley** on the loss surface, step along?\n\n① $\\nabla L$\n② $-\\nabla L$\n③ random direction\n④ level-curve tangent","hloss_3":"For $L(\\mathbf{w})=\\|\\mathbf{w}\\|^2$, $\\nabla L$ is?\n\n① $\\mathbf{0}$\n② $2\\mathbf{w}$\n③ $-\\mathbf{w}$\n④ only unit vectors","hloss_4":"For loss $L(w_1,w_2)=w_1^2+4w_2^2$ at $(1,1)$, $\\nabla L$ is?\n\n① $(1,4)$\n② $(2,8)$\n③ $(0,0)$\n④ $(8,2)$","hloss_5":"In a **flat narrow valley**, gradients are usually?\n\n① huge everywhere\n② **small**, so progress is slow\n③ infinite\n④ exactly zero everywhere","hscn_0":"In neural-net training, weights are mainly updated using?\n\n① determinants\n② the **gradient of the loss**\n③ eigenvalues only\n④ random guesses only","hscn_1":"If classification loss stalls (gradient $\\approx0$), one cause is?\n\n① learning rate too large only\n② already in a **flat / saturated** region\n③ only exploding gradients\n④ no data","hscn_2":"On a 3D **loss surface**, where contour lines are **tight**?\n\n① $\\|\\nabla L\\|$ is **large** (steep)\n② $\\|\\nabla L\\|$ is always $0$\n③ learning rate $\\eta=0$\n④ unrelated to the gradient","hscn_3":"When only coordinate $x_i$ changes slightly in $f(x_1,\\ldots,x_n)$, the rate of change is given by?\n\n① determinant\n② **partial derivative** $\\frac{\\partial f}{\\partial x_i}$ (directional derivative along that axis)\n③ all eigenvalues\n④ Hessian only","hscn_4":"Gradient descent stops at a local minimum because?\n\n① gradient grows\n② $\\nabla L\\approx\\mathbf{0}$\n③ infinite learning rate\n④ loss only increases","hscn_5":"Regularization $L+\\lambda\\|\\mathbf{w}\\|^2$ adds a term that?\n\n① zeroes the gradient\n② **shrinks** parameters via an extra gradient component\n③ removes level curves\n④ forbids learning"},"problemAnswers":{"gdef_0":2,"gdef_1":2,"gdef_2":2,"gdef_3":2,"gdef_4":2,"gdef_5":2,"gtf_0":1,"gtf_1":1,"gtf_2":2,"gtf_3":1,"gtf_4":2,"gtf_5":1,"gcalc_0":2,"gcalc_1":1,"gcalc_2":2,"gcalc_3":2,"gcalc_4":2,"gcalc_5":1,"gprop_0":2,"gprop_1":2,"gprop_2":2,"gprop_3":2,"gprop_4":2,"gprop_5":2,"mcon_0":2,"mcon_1":2,"mcon_2":2,"mcon_3":2,"mcon_4":2,"mcon_5":2,"mgeo_0":2,"mgeo_1":2,"mgeo_2":2,"mgeo_3":2,"mgeo_4":2,"mgeo_5":2,"mcmp_0":2,"mcmp_1":2,"mcmp_2":2,"mcmp_3":2,"mcmp_4":2,"mcmp_5":2,"hopt_0":2,"hopt_1":2,"hopt_2":2,"hopt_3":2,"hopt_4":2,"hopt_5":2,"hloss_0":2,"hloss_1":2,"hloss_2":2,"hloss_3":2,"hloss_4":2,"hloss_5":2,"hscn_0":2,"hscn_1":2,"hscn_2":1,"hscn_3":2,"hscn_4":2,"hscn_5":2},"problemSolutions":{"gdef_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gdef_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gdef_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gdef_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gdef_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gdef_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gtf_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","gtf_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","gtf_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gtf_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","gtf_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gtf_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","gcalc_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gcalc_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","gcalc_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gcalc_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gcalc_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gcalc_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","gprop_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gprop_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gprop_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gprop_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gprop_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","gprop_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcon_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcon_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcon_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcon_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcon_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcon_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mgeo_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mgeo_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mgeo_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mgeo_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mgeo_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mgeo_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hopt_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hopt_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hopt_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hopt_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hopt_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hopt_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hloss_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hloss_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hloss_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hloss_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hloss_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hloss_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hscn_0":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hscn_1":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hscn_2":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ①","hscn_3":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hscn_4":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②","hscn_5":"**1)** Apply the gradient/directional-derivative definition and verify with a numeric point. **2)** Check with a small numeric example. **3)** Answer ②"},"problemTestCodes":{"gdef_0":"answer = 2\nassert answer == 2","gdef_1":"answer = 2\nassert answer == 2","gdef_2":"answer = 2\nassert answer == 2","gdef_3":"answer = 2\nassert answer == 2","gdef_4":"answer = 2\nassert answer == 2","gdef_5":"answer = 2\nassert answer == 2","gtf_0":"answer = 1\nassert answer == 1","gtf_1":"answer = 1\nassert answer == 1","gtf_2":"answer = 2\nassert answer == 2","gtf_3":"answer = 1\nassert answer == 1","gtf_4":"answer = 2\nassert answer == 2","gtf_5":"answer = 1\nassert answer == 1","gcalc_0":"answer = 2\nassert answer == 2","gcalc_1":"answer = 1\nassert answer == 1","gcalc_2":"answer = 2\nassert answer == 2","gcalc_3":"answer = 2\nassert answer == 2","gcalc_4":"answer = 2\nassert answer == 2","gcalc_5":"answer = 1\nassert answer == 1","gprop_0":"answer = 2\nassert answer == 2","gprop_1":"answer = 2\nassert answer == 2","gprop_2":"answer = 2\nassert answer == 2","gprop_3":"answer = 2\nassert answer == 2","gprop_4":"answer = 2\nassert answer == 2","gprop_5":"answer = 2\nassert answer == 2","mcon_0":"answer = 2\nassert answer == 2","mcon_1":"answer = 2\nassert answer == 2","mcon_2":"answer = 2\nassert answer == 2","mcon_3":"answer = 2\nassert answer == 2","mcon_4":"answer = 2\nassert answer == 2","mcon_5":"answer = 2\nassert answer == 2","mgeo_0":"answer = 2\nassert answer == 2","mgeo_1":"answer = 2\nassert answer == 2","mgeo_2":"answer = 2\nassert answer == 2","mgeo_3":"answer = 2\nassert answer == 2","mgeo_4":"answer = 2\nassert answer == 2","mgeo_5":"answer = 2\nassert answer == 2","mcmp_0":"answer = 2\nassert answer == 2","mcmp_1":"answer = 2\nassert answer == 2","mcmp_2":"answer = 2\nassert answer == 2","mcmp_3":"answer = 2\nassert answer == 2","mcmp_4":"answer = 2\nassert answer == 2","mcmp_5":"answer = 2\nassert answer == 2","hopt_0":"answer = 2\nassert answer == 2","hopt_1":"answer = 2\nassert answer == 2","hopt_2":"answer = 2\nassert answer == 2","hopt_3":"answer = 2\nassert answer == 2","hopt_4":"answer = 2\nassert answer == 2","hopt_5":"answer = 2\nassert answer == 2","hloss_0":"answer = 2\nassert answer == 2","hloss_1":"answer = 2\nassert answer == 2","hloss_2":"answer = 2\nassert answer == 2","hloss_3":"answer = 2\nassert answer == 2","hloss_4":"answer = 2\nassert answer == 2","hloss_5":"answer = 2\nassert answer == 2","hscn_0":"answer = 2\nassert answer == 2","hscn_1":"answer = 2\nassert answer == 2","hscn_2":"answer = 1\nassert answer == 1","hscn_3":"answer = 2\nassert answer == 2","hscn_4":"answer = 2\nassert answer == 2","hscn_5":"answer = 2\nassert answer == 2"}},"midMathCh09":{"chapter":"Chapter 09","title":"Jacobian Matrix: When Multiple Inputs Move, What Happens to the Output?","description":"Picture a huge factory machine with many dials (inputs) and many gauges on the dashboard (outputs). Nudge dials 1 and 2 together—how much do gauges 3 and 4 move?\n\nCh.08 **gradient** was a **single compass** when there is only **one** output (e.g. loss)—the steepest uphill direction. The **Jacobian matrix** ($J$) is the **ultimate sensitivity dashboard**: one table capturing every tiny interaction between all inputs and all outputs.\n\nOur world and deep-learning models look like wrinkled, unpredictable **nonlinear surfaces** $\\mathbf{f}$ from far away. Zoom in at one point, though, and the surface looks like straight lines and flat planes. The Jacobian mimics that instant as a parallelogram grid (local linear approximation)—the mathematical tool that lets huge neural networks trace error and tune themselves.","sectionTitle":"Jacobian Matrix: When Multiple Inputs Move, What Happens to the Output?","sectionLabels":{"easyExplain":"In plain words again","whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving notes"},"visualShort":"Input → output: f (warp) vs J (linear patch)","visualIntro":"Left **input** grid maps to right **output**. From far: **warp** ($f$); up close: **sheared lines** ($J$). Similar orange and purple **▱** means the approximation works.","visualStep1":"Input grid","visualStep2":"f: warp","visualStep3":"Small ▱","visualStep4":"J: linear","visualStepsLabel":"Reading order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: vector map $\\mathbf{f}(\\mathbf{x})$","visualFlowStep1":"Intuition: input grid → warped output (top f)","visualFlowStep2":"Math: $J_{ij}=\\partial f_i/\\partial x_j$, $\\Delta\\mathbf{y}\\approx J\\Delta\\mathbf{x}$","visualFlowStep3":"Link: Ch.08 $\\nabla f$, Ch.05 $\\det J$","visualFlowStep4":"Use: backprop, Ch.10 Hessian","jacVisualAriaLabel":"Input grid maps to output; nonlinear warp f and linear shear J appear in stages.","jacVisualMainTitle":"Jacobian: when multiple inputs move, what about the output?","jacVisualSubtitle":"Ch.08 **gradient** for **one** output; Ch.09 **Jacobian $J$** for **several**. Animation: left input → right output. First **warp** ($f$), then **linear patch** ($J$).","jacVisualCaption":"**Red grid** = coordinates, **green arrow** = map, **orange ▱** = small patch under $f$, **purple ▱** = $J$ approximation. Similar ▱ shapes mean $f\\approx J\\Delta\\mathbf{x}$.","jacVisualHintFar":"→ Far: grid warps (nonlinear f)","jacVisualHintNear":"→ Close: sheared lines (linear J)","jacDiagramLabelInput":"Input","jacDiagramLabelOutput":"Output","jacDiagramLabelNonlinear":"Nonlinear f","jacDiagramLabelLinear":"Local linear J","whatIs":{"0":"**1. Vector-valued functions: many input sticks, many output screens**\n\nUntil now we often had **scalar** functions—turn many dials but only **one** number (loss) comes out. But pixel coordinates, hundreds of neuron values in a hidden layer, and similar outputs arrive as **vectors** far more often. When $n$ inputs feed $m$ outputs at once, we call that machine a **vector-valued function** $\\mathbf{f}:\\mathbb{R}^n\\to\\mathbb{R}^m$. Nudge one input stick and dozens of output screens wiggle together—each by a different amount.","1":"**2. Local linearization: Earth looks flat under a magnifier**\n\nLike the **top** panel of the visual, real change warps grids into complex curves (nonlinear). What if we zoom in thousands of times near one point? Curves look like short straight lines; curved surfaces like flat planes. Calculus lives in that narrow **local** patch; the Jacobian is the **slope table** for that flattened map.","2":"**3. Definition: the ultimate change summary**\n\n$J_{ij}=\\frac{\\partial f_i}{\\partial x_j}$ looks heavy, but means simply: **“If I turn dial $j$ one notch, how many notches does gauge $i$ move?”** Write every such number in row $i$, column $j$. Then many small dial turns $\\Delta\\mathbf{x}$ produce output change $\\Delta\\mathbf{y}$ via one clean line: **$\\Delta\\mathbf{y}\\approx J\\,\\Delta\\mathbf{x}$**.","3":"**4. Link to Ch.08 gradient: compasses stacked into a tower**\n\nWith **one** output, $J$ is a single row—the **gradient** as a row vector. With **three** outputs, stack three gradient rows. The Jacobian is **every output’s gradient, row by row, in one big compass collection**.","4":"**5. Geometry: squashed area scale ($\\det J$)**\n\nA small square input patch becomes a **parallelogram** under $J$. **$\\det J$** (Ch.05) tells how many times the **area** grew or shrank—like clay stretching. $|\\det J|=6$ means unit square area becomes **6×**."},"easyExplain":{"0":"**① Gradient vs Jacobian — one result vs many**\n\nCh.08 **gradient** is when there is **one** answer: “If I study 10 more minutes, how much does my score go up?” The **Jacobian** is when there are **several** answers: “If I nudge the shoulder and elbow a little, how much do the hand’s **X** and **Y** each move?”—all in one **table**. Ch.08 = one compass; Ch.09 = a big table linking many dials to many gauges.","1":"**② Reading one cell $J_{ij}$ — “dial j → needle i”**\n\n$J_{ij}$ looks scary but means simply: **“If I turn input $j$ a little, how much does output $i$ move?”** **Rows = outputs**, **columns = inputs**. In a 2×2 example, row 1 lists how the first output reacts to each input.","2":"**③ $\\Delta\\mathbf{y}\\approx J\\Delta\\mathbf{x}$ — “nudge in → about this much out”**\n\nIf inputs move **a tiny bit** ($\\Delta\\mathbf{x}$), outputs move **a tiny bit** too ($\\Delta\\mathbf{y}$). Roughly: **output change ≈ Jacobian × input change**. A winding road looks curved from far away, but **near one point** it looks like a **short straight line**—that is what the top vs bottom panels show.","3":"**④ Backprop — multiply tables to send error backward**\n\nA network has many layers. When the last layer says “wrong,” that signal must travel **backward** so we know what to fix. Each layer has a small **Jacobian table**; we **multiply** them ($J_{\\text{total}}=J_L\\cdots J_1$). Think **table × table × table**, like linking LEGO blocks in reverse.","4":"**⑤ $\\det J$ — how much area grew · robot-arm warning**\n\nA small **square** patch becomes a **parallelogram**; **$\\det J$** tells how many times the **area** scaled (Ch.05 determinant). $|\\det J|=6$ → **6× area**. If a robot arm is **fully straight**, $\\det J=0$ — a **paralysis (singularity) warning**: it cannot move even 1 mm in some directions."},"whyImportant":{"bridge":"**Deep learning is a factory woven from Jacobians.** After ReLU, sigmoid, and other nonlinearities, data bends—but backprop uses **local linearization** to compute each layer’s **Jacobian**. Ch.08 gradient was a compass for the final loss only; between layers we need **matrices** capturing how thousands of neurons influence one another.","similarity":"Learning is nothing but the **chain rule** in action: keep multiplying Jacobian matrices one after another. GAN generators that create human-like faces and autoencoders that compress data both ask, “If I twist the latent noise I fed in just a little, how does the image in front of me warp and change?”—and read the answer through the Jacobian $J$ to decide which way to learn."},"howUsed":{"ml":"**1. Backpropagation: error flowing upstream**\n\nTo shrink loss, propagate error from the last layer back to the inputs. With dozens of layers, **multiply local Jacobians** like stepping stones ($J_{\\text{total}}=J_L\\cdots J_1$)—even in huge models you get full sensitivity and can update weights.\n\n**2. Normalizing flows: density stretch and squeeze**\n\nGenerative AI can “knead” simple clay (a Gaussian) into a detailed sculpture (a complex data distribution). Stretch space 2× and density halves. **$\\log|\\det J|$** corrects probability when volume changes—the same picture as Ch.05 determinants.","geometry":"**3. Robot arms and singular (frozen) poses**\n\nImagine a shoulder-and-elbow arm. The Jacobian answers: “If I rotate a joint 1°, how much does the hand move in X and Y?” If the arm is **fully straight**, **$\\det J=0$**: the arm cannot move even 1 mm in some directions—a **singularity (paralysis)** warning in real robotics."},"summary":"**In one line:** The Jacobian $J$ is the **ultimate sensitivity dashboard**—how all dials move all gauges. Locally, $\\Delta\\mathbf{y}\\approx J\\Delta\\mathbf{x}$ straightens complex change; in deep learning it is the mathematical heart of backprop through stacked layers.","problemSolving":{"focus":"**When solving, follow these steps**\n\n**① Table size — how many outputs? inputs?**\nWith $m$ outputs and $n$ inputs, $J$ is an **$m \\times n$** table. **Rows = outputs**, **columns = inputs** (gauge number / dial number).\n\n**② Fill each cell — \"dial j → needle i\"**\nCell $(i,j)$ holds **\"if input $j$ moves a tiny bit, how much does output $i$ change?\"** The formal name is partial derivative ($J_{ij}$), but think **one reaction size**.\n\n**③ Small moves — tiny input → tiny output**\nWhen inputs change **just a little** ($\\Delta\\mathbf{x}$), output change ($\\Delta\\mathbf{y}$) is roughly **$J$ × input change**: $\\Delta\\mathbf{y}\\approx J\\Delta\\mathbf{x}$.\n\n**④ One output? Ch.08 gradient**\nIf there is only **one** result ($m=1$), $J$ is Ch.08 **gradient** $\\nabla f$ laid on its side ($\\nabla f^\\mathsf{T}$).\n\n**3-step workflow:** ① check size → ② fill partials → ③ if functions compose, multiply $J_g J_f$","examplesHeading":"Worked examples","examplesTable":"**Ex 1 — 2×2 J**\n\nProblem: $\\mathbf{f}(x,y)=(x+y,\\;x-y)$. Find $J$.\n\nSolution: $f_1=x+y\\Rightarrow(1,1)$, $f_2=x-y\\Rightarrow(1,-1)$. $J=\\begin{pmatrix}1&1\\\\1&-1\\end{pmatrix}$.\n\n---\n\n**Ex 2 — at a point**\n\nProblem: $\\mathbf{f}(x,y)=(x^2,\\;y)$ at $(1,0)$. Find $J$.\n\nSolution: $\\partial f_1/\\partial x=2x=2$, others 0; $\\partial f_2/\\partial y=1$. $J=\\begin{pmatrix}2&0\\\\0&1\\end{pmatrix}$.\n\n---\n\n**Ex 3 — linear**\n\nProblem: $\\mathbf{f}(\\mathbf{x})=A\\mathbf{x}$. Find $J$.\n\nSolution: **$A$** (same everywhere).\n\n---\n\n**Ex 4 — chain**\n\nProblem: $f(x)=x$, $g(u)=2u$. Find $J_{g\\circ f}$ (1×1).\n\nSolution: $J_f=1$, $J_g=2$, $J_{g\\circ f}=J_g J_f=2$.\n\n---\n\n**Ex 5 — det J**\n\nProblem: $J=\\begin{pmatrix}2&0\\\\0&3\\end{pmatrix}$. Unit square area?\n\nSolution: $|\\det J|=|6|=6$ → **6×**.\n\n---\n\n**Ex 6 — linear layer**\n\nProblem: $\\mathbf{z}=W\\mathbf{x}+\\mathbf{b}$. Find $J$.\n\nSolution: linear map, so **$W$**."},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"| In words | Meaning |\n| :--- | :--- |\n| **Jacobian** | $J_{ij}=\\partial f_i/\\partial x_j$, size $m\\times n$ |\n| **Local approx** | $\\Delta\\mathbf{y}\\approx J\\Delta\\mathbf{x}$ |\n| **Gradient** | $m=1$: $J=\\nabla f^\\mathsf{T}$ |\n| **Chain** | $J_{g\\circ f}=J_g J_f$ |\n| **det J** | area/volume scale (Ch.05) |\n| **Backprop** | $\\partial L/\\partial\\mathbf{x}=J^\\mathsf{T}(\\partial L/\\partial\\mathbf{f})$ |\n\n**①** rows=outputs. **②** linear $f=Ax$ ⇒ $J=A$. **③** small $\\Delta\\mathbf{x}$. **④** singular ⇒ collapse one way.","practiceProblemsTitle":"Practice","practiceProblemsIntro":"","practiceProblemsInstruction":"Read each question and choose the best answer.","problems":{"jdef_0":"A matrix collecting how vector output $\\mathbf{f}(\\mathbf{x})$ changes when input $\\mathbf{x}$ moves slightly is?\n\n① gradient\n② **Jacobian** $J$\n③ Hessian\n④ determinant","jdef_1":"In $J_{ij}=\\dfrac{\\partial f_i}{\\partial x_j}$, $J_{ij}$ means?\n\n① $j$th output / $i$th input\n② **$i$th output** / **$j$th input** partial\n③ defined only when $i=j$\n④ always $0$","jdef_2":"For $f:\\mathbb{R}^2\\to\\mathbb{R}^3$, Jacobian size (rows×cols) is?\n\n① $2\\times 2$\n② $2\\times 3$\n③ **$3\\times 2$**\n④ $3\\times 3$","jdef_3":"For scalar $f:\\mathbb{R}^n\\to\\mathbb{R}$, the Jacobian is usually?\n\n① $n\\times n$\n② **$1\\times n$** (gradient as a row)\n③ only $n\\times 1$\n④ undefined","jdef_4":"In $\\Delta\\mathbf{y}\\approx J\\,\\Delta\\mathbf{x}$, $J$ is?\n\n① always symmetric\n② the Jacobian **at that point**\n③ the Hessian\n④ only the identity","jdef_5":"For linear $\\mathbf{f}(\\mathbf{x})=A\\mathbf{x}$, the Jacobian $J$ is?\n\n① $A^{-1}$\n② **$A$** (everywhere)\n③ $A^\\mathsf{T}A$\n④ zero matrix","jtf_0":"For small input changes, $\\Delta\\mathbf{y}\\approx J\\,\\Delta\\mathbf{x}$ is a valid **first-order** approximation.\n\n① True\n② False","jtf_1":"A Jacobian is always **square** ($n\\times n$).\n\n① True\n② False","jtf_2":"When $m=1$, Jacobian $J$ matches putting gradient $\\nabla f$ in a **row**.\n\n① True\n② False","jtf_3":"$$\\det J$ relates to how much a small **area** scales (in 2D).\n\n① True\n② False","jtf_4":"Each **row** of $J$ is all partials of **one output**.\n\n① True\n② False","jtf_5":"For $h=g\\circ f$, $J_h=J_g\\,J_f$ (**chain rule**).\n\n① True\n② False","jcalc_0":"For $\\mathbf{f}(x,y)=(x+y,\\; x-y)$, $J$ is?\n\n① $\\begin{pmatrix}1&-1\\\\1&1\\end{pmatrix}$\n② $\\begin{pmatrix}1&1\\\\1&-1\\end{pmatrix}$\n③ $\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$\n④ $\\begin{pmatrix}2&0\\\\0&2\\end{pmatrix}$","jcalc_1":"For $\\mathbf{f}(x,y)=(2x,\\; 3y)$, $J$ is?\n\n① $\\begin{pmatrix}2&3\\\\0&0\\end{pmatrix}$\n② $\\begin{pmatrix}2&0\\\\0&3\\end{pmatrix}$\n③ $\\begin{pmatrix}3&2\\\\0&0\\end{pmatrix}$\n④ $\\begin{pmatrix}1&1\\\\1&1\\end{pmatrix}$","jcalc_2":"For $\\mathbf{f}(x,y)=(x,\\; y)$, $J$ is?\n\n① zero matrix\n② **identity** $I$\n③ $\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$\n④ $\\begin{pmatrix}2&0\\\\0&2\\end{pmatrix}$","jcalc_3":"For $\\mathbf{f}(x,y)=(3x,\\; y)$, $J$ is?\n\n① $\\begin{pmatrix}1&0\\\\0&3\\end{pmatrix}$\n② $\\begin{pmatrix}3&0\\\\0&1\\end{pmatrix}$\n③ $\\begin{pmatrix}3&1\\\\0&0\\end{pmatrix}$\n④ $\\begin{pmatrix}0&3\\\\1&0\\end{pmatrix}$","jcalc_4":"For $\\mathbf{f}(x,y)=(x,\\; 2y)$, $J$ is?\n\n① $\\begin{pmatrix}2&0\\\\0&1\\end{pmatrix}$\n② $\\begin{pmatrix}1&0\\\\0&2\\end{pmatrix}$\n③ $\\begin{pmatrix}1&2\\\\0&1\\end{pmatrix}$\n④ $\\begin{pmatrix}0&1\\\\2&0\\end{pmatrix}$","jcalc_5":"For $\\mathbf{f}(x,y)=(x^2,\\; y)$ at $(1,0)$, $J$ is?\n\n① $\\begin{pmatrix}1&0\\\\0&1\\end{pmatrix}$\n② $\\begin{pmatrix}2&0\\\\0&1\\end{pmatrix}$\n③ $\\begin{pmatrix}2&0\\\\0&0\\end{pmatrix}$\n④ $\\begin{pmatrix}0&2\\\\1&0\\end{pmatrix}$","jprop_0":"For $f:\\mathbb{R}^n\\to\\mathbb{R}^m$, number of **rows** in $J$ is?\n\n① $n$\n② **$m$** (outputs)\n③ $m+n$\n④ always $1$","jprop_1":"For $f:\\mathbb{R}^n\\to\\mathbb{R}^m$, number of **columns** in $J$ is?\n\n① $m$\n② **$n$** (inputs)\n③ $m-n$\n④ $1$","jprop_2":"If every entry of $J$ is $0$, near that point $\\mathbf{f}$ is?\n\n① necessarily nonlinear\n② **almost constant**\n③ necessarily diverging\n④ undefined","jprop_3":"For $\\mathbf{f}(\\mathbf{x})=A\\mathbf{x}+\\mathbf{b}$, $J$ is?\n\n① $\\mathbf{b}$\n② **$A$**\n③ $A\\mathbf{b}$\n④ $A^{-1}$","jprop_4":"The smaller $\\Delta\\mathbf{x}$, the approximation $\\Delta\\mathbf{y}\\approx J\\Delta\\mathbf{x}$ is?\n\n① always worse\n② **more accurate**\n③ unchanged\n④ always wrong","jprop_5":"For 2 inputs and 2 outputs, total entries in $J$ are?\n\n① $2$\n② **$4$**\n③ $8$\n④ $1$","jcon_0":"Link between Ch.08 **gradient** and Ch.09 **Jacobian**?\n\n① unrelated\n② scalar $f$: $J=\\nabla f^\\mathsf{T}$; vector $f$: **one row per output**\n③ Jacobian is always scalar\n④ gradient is the bigger matrix","jcon_1":"For $\\mathbf{f}:\\mathbb{R}^2\\to\\mathbb{R}^2$, one **row** of $J$ is?\n\n① partials of one input\n② partials of **one output** w.r.t. $(x,y)$\n③ one Hessian row\n④ determinant","jcon_2":"$$\\mathbf{f}(x,y)=(x^2,\\; y^2)$ is?\n\n① linear\n② **nonlinear** (products/squares)\n③ constant\n④ only $1\\to 1$","jcon_3":"In $\\mathbf{f}(\\mathbf{x}_0+\\Delta\\mathbf{x})\\approx\\mathbf{f}(\\mathbf{x}_0)+J\\Delta\\mathbf{x}$, $J$ is?\n\n① undefined at $\\mathbf{x}_0$\n② Jacobian **at $\\mathbf{x}_0$**\n③ Hessian\n④ any matrix","jcon_4":"For layer $\\mathbf{z}=W\\mathbf{x}+\\mathbf{b}$, $J$ is?\n\n① $W\\mathbf{b}$\n② **$W$**\n③ $W^\\mathsf{T}W$\n④ only $\\mathbf{b}$","jcon_5":"When there is **one output** ($m=1$), Jacobian size is?\n\n① $n\\times n$\n② **$1\\times n$**\n③ only $n\\times 1$\n④ only $1\\times 1$","jgeo_0":"In 2D, a small **square** mapped by $J$ usually becomes?\n\n① always a circle\n② a **parallelogram**\n③ a point\n④ a line","jgeo_1":"$$\\det J>0$ (small region) usually means?\n\n① area shrinks only\n② **area scale** $|\\det J|$ (Ch.05 determinant)\n③ no link to area\n④ det irrelevant","jgeo_2":"With $J=\\begin{pmatrix}2&0\\\\0&3\\end{pmatrix}$, unit square area becomes?\n\n① $1$\n② **$6$** ($2\\times 3$)\n③ $5$\n④ $0$","jgeo_3":"Nonlinear $f$ looks complex far away, but **near one point**?\n\n① always constant\n② **approximated by $J$** like a plane\n③ only Hessian matters\n④ no Jacobian","jgeo_4":"If $J$ is **singular** (det$=0$), a small patch?\n\n① keeps area\n② **collapses** (zero area)\n③ always expands\n④ rotates only","jgeo_5":"Robot joints $(\\theta_1,\\theta_2)\\mapsto hand $(x,y)$: $J$ tells?\n\n① hand position only\n② **how the hand moves** when joints move slightly\n③ mass\n④ battery","jcmp_0":"For $f:\\mathbb{R}^2\\to\\mathbb{R}$, $\\nabla f$ vs $J$ size?\n\n① $\\nabla f$ is $2\\times 2$\n② $\\nabla f$ is 2-vector, $J$ is **$1\\times 2$**\n③ same\n④ $J$ is $2\\times 2$","jcmp_1":"$$J$ for $f:\\mathbb{R}^2\\to\\mathbb{R}^2$ vs Ch.10 **Hessian** $H$?\n\n① $H$ is first-order\n② $J$ is **first-order**, $H$ is **second-order**\n③ same\n④ only $J$ is symmetric","jcmp_2":"For $\\mathbf{f}(x,y)=(x,y)$, $J$ is?\n\n① zero\n② **$I$** (identity)\n③ $\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$\n④ $\\begin{pmatrix}2&0\\\\0&2\\end{pmatrix}$","jcmp_3":"For $\\mathbf{f}(x,y)=(x+y,\\;0)$, $J$ is?\n\n① $\\begin{pmatrix}1&1\\\\1&1\\end{pmatrix}$\n② $\\begin{pmatrix}1&1\\\\0&0\\end{pmatrix}$\n③ $\\begin{pmatrix}0&0\\\\1&1\\end{pmatrix}$\n④ $\\begin{pmatrix}1&0\\\\1&0\\end{pmatrix}$","jcmp_4":"Ch.08 **directional derivative** vs one **row** of $J$?\n\n① unrelated\n② treat that **output** as scalar: its gradient is that row\n③ always zero\n④ Hessian","jcmp_5":"For $f:\\mathbb{R}^2\\to\\mathbb{R}^2$, $g:\\mathbb{R}^2\\to\\mathbb{R}^2$, size of $J_{g\\circ f}$ is?\n\n① $1\\times 2$\n② **$2\\times 2$**\n③ $4\\times 4$\n④ $1\\times 1$","jchain_0":"For $f(x)=3x$, $J$ (1×1) is?\n\n① $1$\n② **$3$**\n③ $0$\n④ $9$","jchain_1":"For $f(x)=x$, $g(u)=2u$, $J_{g\\circ f}$ (1×1) is?\n\n① $3$\n② **$2$** ($J_g=2$, $J_f=1$)\n③ $1$\n④ $0$","jchain_2":"For $h=g\\circ f$, the Jacobian is?\n\n① $J_f+J_g$\n② **$J_g\\,J_f$** (matrix product)\n③ $J_f-J_g$\n④ always identity","jchain_3":"Linear $\\mathbf{f}(\\mathbf{x})=A\\mathbf{x}$, $\\mathbf{g}(\\mathbf{u})=B\\mathbf{u}$. $J_{g\\circ f}$ is?\n\n① $A+B$\n② **$BA$**\n③ $AB$\n④ $A^{-1}$","jchain_4":"Two layers $\\mathbf{z}=W_1\\mathbf{x}$, $\\mathbf{y}=W_2\\mathbf{z}$. $\\partial\\mathbf{y}/\\partial\\mathbf{x}$ is?\n\n① $W_1+W_2$\n② **$W_2 W_1$**\n③ $W_1 W_2$\n④ $W_2^\\mathsf{T}$","jchain_5":"In backprop, layer Jacobians are ___ together.\n\n① only added\n② **chained (multiplied)**\n③ divided\n④ ignored","jloss_0":"For layer $\\mathbf{z}=W\\mathbf{x}+\\mathbf{b}$, $J$ is?\n\n① $\\mathbf{b}$\n② **$W$**\n③ $W\\mathbf{x}$\n④ $W^\\mathsf{T}W$","jloss_1":"Scalar $L=f(\\mathbf{x})$, $\\mathbf{x}\\in\\mathbb{R}^n$. Size of $J_L$ is?\n\n① $n\\times n$\n② **$1\\times n$**\n③ only $1\\times 1$\n④ only $n\\times 1$","jloss_2":"For linear $\\mathbf{f}(\\mathbf{x})=A\\mathbf{x}$, $J$ is?\n\n① different at each point\n② **$A$ everywhere**\n③ always $I$\n④ zero","jloss_3":"For $f:\\mathbb{R}^2\\to\\mathbb{R}$ (one output), $J$ size is?\n\n① $2\\times 2$\n② **$1\\times 2$**\n③ only $2\\times 1$\n④ $1\\times 1$","jloss_4":"In $\\Delta\\mathbf{y}\\approx J\\,\\Delta\\mathbf{x}$, $J$ is?\n\n① fixes output\n② **table of input→output rates**\n③ Hessian\n④ learning rate","jloss_5":"Multiplying layer Jacobians in a network gives?\n\n① one layer only\n② **input→final output** sensitivity\n③ always $I$\n④ determinant only","jscn_0":"Robot joints move **slightly**. Hand position change is approx?\n\n① random\n② **$J\\,\\Delta\\boldsymbol{\\theta}$**\n③ Hessian only\n④ constant","jscn_1":"Input $\\mathbf{x}$ changes **slightly**. Output change is approx?\n\n① always zero\n② **$J\\,\\Delta\\mathbf{x}$**\n③ det only\n④ second-order only","jscn_2":"Ch.08 **gradient** vs Jacobian $J$ for vector $\\mathbf{f}$?\n\n① no derivatives\n② Ch.08: **one output**; $J$: **one row per output**\n③ identical\n④ $J$ is second-order","jscn_3":"A small square mapped by $J$ usually becomes?\n\n① circle\n② **parallelogram**\n③ point\n④ line","jscn_4":"For layer $\\mathbf{z}=W\\mathbf{x}$, $J$ is?\n\n① $\\mathbf{x}$\n② **$W$**\n③ $W\\mathbf{x}$\n④ det $W$","jscn_5":"For linear $\\mathbf{f}(\\mathbf{x})=A\\mathbf{x}$, $J$ is?\n\n① $A^{-1}$\n② **$A$**\n③ $A^\\mathsf{T}A$\n④ zero matrix"},"problemAnswers":{"jdef_0":2,"jdef_1":2,"jdef_2":3,"jdef_3":2,"jdef_4":2,"jdef_5":2,"jtf_0":1,"jtf_1":2,"jtf_2":1,"jtf_3":1,"jtf_4":1,"jtf_5":1,"jcalc_0":2,"jcalc_1":2,"jcalc_2":2,"jcalc_3":2,"jcalc_4":2,"jcalc_5":2,"jprop_0":2,"jprop_1":2,"jprop_2":2,"jprop_3":2,"jprop_4":2,"jprop_5":2,"jcon_0":2,"jcon_1":2,"jcon_2":2,"jcon_3":2,"jcon_4":2,"jcon_5":2,"jgeo_0":2,"jgeo_1":2,"jgeo_2":2,"jgeo_3":2,"jgeo_4":2,"jgeo_5":2,"jcmp_0":2,"jcmp_1":2,"jcmp_2":2,"jcmp_3":2,"jcmp_4":2,"jcmp_5":2,"jchain_0":2,"jchain_1":2,"jchain_2":2,"jchain_3":2,"jchain_4":2,"jchain_5":2,"jloss_0":2,"jloss_1":2,"jloss_2":2,"jloss_3":2,"jloss_4":2,"jloss_5":2,"jscn_0":2,"jscn_1":2,"jscn_2":2,"jscn_3":2,"jscn_4":2,"jscn_5":2},"problemSolutions":{"jdef_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jdef_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jdef_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ③","jdef_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jdef_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jdef_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jtf_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ①","jtf_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jtf_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ①","jtf_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ①","jtf_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ①","jtf_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ①","jcalc_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcalc_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcalc_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcalc_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcalc_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcalc_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jprop_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jprop_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jprop_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jprop_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jprop_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jprop_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcon_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcon_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcon_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcon_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcon_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcon_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jgeo_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jgeo_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jgeo_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jgeo_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jgeo_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jgeo_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcmp_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcmp_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcmp_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcmp_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcmp_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jcmp_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jchain_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jchain_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jchain_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jchain_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jchain_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jchain_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jloss_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jloss_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jloss_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jloss_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jloss_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jloss_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jscn_0":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jscn_1":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jscn_2":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jscn_3":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jscn_4":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②","jscn_5":"**1)** Apply the Jacobian definition and verify with a numeric example. **2)** Check with a small numeric example. **3)** Answer ②"},"problemTestCodes":{"jdef_0":"answer = 2\nassert answer == 2","jdef_1":"answer = 2\nassert answer == 2","jdef_2":"answer = 3\nassert answer == 3","jdef_3":"answer = 2\nassert answer == 2","jdef_4":"answer = 2\nassert answer == 2","jdef_5":"answer = 2\nassert answer == 2","jtf_0":"answer = 1\nassert answer == 1","jtf_1":"answer = 2\nassert answer == 2","jtf_2":"answer = 1\nassert answer == 1","jtf_3":"answer = 1\nassert answer == 1","jtf_4":"answer = 1\nassert answer == 1","jtf_5":"answer = 1\nassert answer == 1","jcalc_0":"answer = 2\nassert answer == 2","jcalc_1":"answer = 2\nassert answer == 2","jcalc_2":"answer = 2\nassert answer == 2","jcalc_3":"answer = 2\nassert answer == 2","jcalc_4":"answer = 2\nassert answer == 2","jcalc_5":"answer = 2\nassert answer == 2","jprop_0":"answer = 2\nassert answer == 2","jprop_1":"answer = 2\nassert answer == 2","jprop_2":"answer = 2\nassert answer == 2","jprop_3":"answer = 2\nassert answer == 2","jprop_4":"answer = 2\nassert answer == 2","jprop_5":"answer = 2\nassert answer == 2","jcon_0":"answer = 2\nassert answer == 2","jcon_1":"answer = 2\nassert answer == 2","jcon_2":"answer = 2\nassert answer == 2","jcon_3":"answer = 2\nassert answer == 2","jcon_4":"answer = 2\nassert answer == 2","jcon_5":"answer = 2\nassert answer == 2","jgeo_0":"answer = 2\nassert answer == 2","jgeo_1":"answer = 2\nassert answer == 2","jgeo_2":"answer = 2\nassert answer == 2","jgeo_3":"answer = 2\nassert answer == 2","jgeo_4":"answer = 2\nassert answer == 2","jgeo_5":"answer = 2\nassert answer == 2","jcmp_0":"answer = 2\nassert answer == 2","jcmp_1":"answer = 2\nassert answer == 2","jcmp_2":"answer = 2\nassert answer == 2","jcmp_3":"answer = 2\nassert answer == 2","jcmp_4":"answer = 2\nassert answer == 2","jcmp_5":"answer = 2\nassert answer == 2","jchain_0":"answer = 2\nassert answer == 2","jchain_1":"answer = 2\nassert answer == 2","jchain_2":"answer = 2\nassert answer == 2","jchain_3":"answer = 2\nassert answer == 2","jchain_4":"answer = 2\nassert answer == 2","jchain_5":"answer = 2\nassert answer == 2","jloss_0":"answer = 2\nassert answer == 2","jloss_1":"answer = 2\nassert answer == 2","jloss_2":"answer = 2\nassert answer == 2","jloss_3":"answer = 2\nassert answer == 2","jloss_4":"answer = 2\nassert answer == 2","jloss_5":"answer = 2\nassert answer == 2","jscn_0":"answer = 2\nassert answer == 2","jscn_1":"answer = 2\nassert answer == 2","jscn_2":"answer = 2\nassert answer == 2","jscn_3":"answer = 2\nassert answer == 2","jscn_4":"answer = 2\nassert answer == 2","jscn_5":"answer = 2\nassert answer == 2"}},"midMathCh10":{"chapter":"Chapter 10","title":"Hessian Matrix: Reading the Curvature of Surfaces","description":"The Hessian matrix is a square matrix of second-order partial derivatives of a scalar function. It encodes how much a surface curves at a point and is used to classify minima, maxima, and saddle points in optimization, and forms the basis of Newton's method and trust-region methods.","sectionTitle":"Hessian Matrix: Reading the Curvature of Surfaces","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it's used","problemSolving":"Problem-solving guide"},"whatIs":{"intro":"**What is the Hessian matrix?** — Think of it as a table of numbers that describe how much the surface curves in every direction at the point where you stand. It is a square matrix built from second derivatives of the function, and it is symmetric (same on both sides of the diagonal).","plain":"Imagine walking downhill with your eyes closed. What you feel under your feet—\"this way is steeper down\"—is the first derivative (gradient). The sense of \"if I take one more step, will the ground bowl down or stay flat?\" is the second derivative, i.e. the Hessian. With it you can avoid cliffs and find the true bottom, like the bottom of a bowl.","definition":"More precisely, the Hessian $\\mathbf{H}$ is the table whose $(i,j)$ entry is $H_{ij} = \\frac{\\partial^2 f}{\\partial x_i \\partial x_j}$—the function $f$ differentiated twice, once in each of the $x_i$ and $x_j$ directions. The **eigenvalues** of this matrix are what matter: all positive → **local minimum** (bowl), all negative → **local maximum** (dome), mixed signs → **saddle point** (up in one direction, down in another).","inAI":"In machine learning, training is about finding the \"valley\" where the error is smallest. Moving only by gradient is slow. Using the Hessian to read curvature lets you take **Newton-style** jumps toward the bottom and learn much faster."},"whyImportant":{"fakeBottom":"On the way down you may hit a flat spot where the gradient is zero. That does not mean you have reached the true bottom—it could be a saddle (flat in one place but up one way and down another). The **eigenvalues** of the Hessian tell you whether it is a true minimum or a saddle. When there are many variables (as in AI), avoiding these fake bottoms is crucial.","smartStep":"You want small steps on narrow paths and larger steps on open ground. The Hessian tells you \"how steep each direction is,\" so you can set step size (learning rate) well and descend efficiently without wasted moves."},"howUsed":{"newton":"Newton's method moves a lot in one step with: $\\mathbf{x}_{k+1} = \\mathbf{x}_k - \\mathbf{H}^{-1} \\nabla f(\\mathbf{x}_k)$. Here $\\mathbf{x}_k$ is the current point, $\\nabla f(\\mathbf{x}_k)$ is the gradient there, $\\mathbf{H}$ is the Hessian at that point, and $\\mathbf{H}^{-1}$ is its inverse. So you look at both the gradient and the curvature (Hessian) and jump toward the bottom to $\\mathbf{x}_{k+1}$. That can reach the answer much faster than small gradient-only steps.","quasiNewton":"When there are many variables, computing the Hessian exactly is costly. In practice, **quasi-Newton** methods (e.g. BFGS) approximate the Hessian from past gradient information instead of computing it fully, and are used more often."},"summary":"The Hessian is a symmetric matrix of second partial derivatives of a scalar function and encodes curvature and the nature of critical points. At a point where the gradient is zero, all positive eigenvalues imply a local minimum, all negative a local maximum, and mixed signs a saddle point. In machine learning it underlies second-order optimization such as Newton's method, trust-region, and quasi-Newton methods.","problemSolving":{"focus":"The table below lists only **formulas and symbol meanings** needed for problem-solving. See the **worked examples** under the table for step-by-step solutions.","examplesHeading":"Worked examples","examplesTable":"$2e"},"problemSolvingLabel":"Problem-solving guide","problemSolvingTable":"$2f","problemSolvingExample1":"**Example (entry count)**\n\nFor $f(x_1,x_2)$, the Hessian is $2\\times2$, so 4 entries; 3 independent. → **Answer 4** (total) or **3** (independent, by context)","problemSolvingExample2":"**Example (extrema)**\n\nIf eigenvalues are 2 and 5 (both positive), the point is a local minimum. → **Answer 1** (minimum) or the number asked","problemSolvingExample3":"**Example (Newton step)**\n\n$f(x)=x^2$ gives $f'(x)=2x$, $f''(x)=2$. At $x_0=4$, $x_1 = x_0 - f'(x_0)/f''(x_0) = 4 - 8/2 = 0$. → **Answer 0**","visualShort":"Hessian: second partial derivatives → curvature and extrema","visualIntroShort":"The first derivative tells you \"which way is downhill\"; the second (Hessian) tells you \"will the surface bowl down, or go up in one direction and down in another (saddle point)?\" Follow the animation below.","visualWhyHessian":"The Hessian is the matrix of **second derivatives**, so the \"curvature\" in the figure below is exactly what the Hessian describes.","visualIntro":"The Hessian is the matrix of second partial derivatives of $f$ at $\\mathbf{x}$ and is used to read curvature and to classify minima, maxima, and saddle points.","visualConceptTitle":"Concept structure","visualConceptStep0":"Input: scalar function $f(\\mathbf{x})$, point $\\mathbf{x}$","visualConceptStep1":"Compute $\\frac{\\partial^2 f}{\\partial x_i \\partial x_j}$","visualConceptStep2":"Form Hessian $\\mathbf{H}$ (symmetric)","visualConceptStep3":"Eigenvalues → minimum (all +), maximum (all −), saddle (mixed)","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: second-derivative matrix","visualFlowStep1":"Intuition: curvature of the surface","visualFlowStep2":"Math: $H_{ij}$, symmetry, eigenvalues","visualFlowStep3":"Use: Newton, extrema, trust region","visualCaption":"Left: bowl (only curves down) → minimum. Inverted bowl (only curves up) → maximum. Saddle: one direction up, the other down → neither min nor max.","visualStep1":"Input: scalar function $f(\\mathbf{x})$, point $\\mathbf{x}$","visualStep2":"Compute 2nd partials $\\frac{\\partial^2 f}{\\partial x_i \\partial x_j}$","visualStep3":"Form Hessian matrix $\\mathbf{H}$ (symmetric)","visualStepsLabel":"Order to read","visualBowlTitle":"Bowl: curves only down → minimum","visualSaddleTitle":"Saddle: value ↑ this way, value ↓ that way","visualCurveDown":"↓ curvature","visualFppMin":"f″=2 > 0 → min","visualMinPoint":"Minimum","visualValueUp":"value↑","visualValueDown":"value↓","visualSaddleOrangeGreen":"Orange direction: value goes up · Green direction: value goes down","visualSaddleNeither":"Saddle: neither minimum nor maximum","visualSummary1":"Bowl curves only down → here is the minimum","visualSummary2":"Inverted bowl curves only up → here is the maximum","visualSummary3":"Saddle: one direction up, the other down → neither min nor max","problemPromptIntro":"Read each problem and enter the Hessian/extrema-related value.","promptDefinition":"If the statement is **true**, choose **1**; if **false**, choose **0**.","promptDefinitionChoice":"Three options (a), (b), and (c) are listed below. Choose the correct one.","promptElementCount":"For a scalar function $f$ with {n} input variables, how many entries does the Hessian have in total?","promptIndependentCount":"For a symmetric Hessian with $n={n}$ variables, how many independent entries are there?","promptMatrixSize":"For a function of $n={n}$ variables, how many rows (or columns) does the Hessian have?","promptEigenvalueType":"The Hessian eigenvalues are $\\lambda_1={ev1}$ and $\\lambda_2={ev2}$. What kind of critical point is it?","promptNewton1D":"For $f(x)={a}x^2{bVal}x+{c}$ with $x_0={x0}$, what is $x_1$ after one Newton step?","promptScalarSecondDeriv":"For $f(x)={a}x^2+bx+c$, what is $f''(x)$?","promptDefault":"Choose the correct option below.","mcDefChoice1":"(a)","mcDefChoice2":"(b)","mcDefChoice3":"(c)","mcDefChoice4":"(d) None of (a)–(c)","mcEigenChoice1":"Local min","mcEigenChoice2":"Local max","mcEigenChoice3":"Saddle","mcEigenChoice4":"None of the above","definitionStatements":{"0":"For a $C^2$ scalar function, the Hessian matrix is symmetric.","1":"At a critical point, if all eigenvalues of the Hessian are positive, the point is a local minimum.","2":"At a critical point, if all eigenvalues of the Hessian are negative, the point is a local maximum.","3":"The $(i,j)$ entry of the Hessian is $\\partial^2 f/\\partial x_i\\partial x_j$.","4":"If $f$ is $C^2$, then $\\partial^2 f/\\partial x_i\\partial x_j = \\partial^2 f/\\partial x_j\\partial x_i$.","5":"For a scalar function of $n$ variables, the Hessian is an $n\\times n$ matrix.","6":"If the Hessian is positive definite, all eigenvalues are positive.","7":"If the Hessian is negative definite, all eigenvalues are negative.","10":"If the eigenvalues of the Hessian are all distinct, the critical point must be a saddle.","11":"Every scalar function has a Hessian equal to the identity matrix.","12":"For a one-variable function $f(x)$, the Hessian is always a $2\\times 2$ matrix.","13":"If some eigenvalue is 0, the critical point must be a local minimum.","14":"If the Hessian is the zero matrix, the critical point must be a local extremum."},"definitionChoiceQuestions":{"0":"(a) For a function of 2 variables, the Hessian has **4** entries in total.\n(b) **9**.\n(c) **6**.","1":"(a) For a symmetric Hessian with 3 variables, the number of independent entries is **9**.\n(b) **6**.\n(c) **3**.","2":"(a) Local minimum\n(b) Local maximum\n(c) Saddle point\n\n(Hint) Eigenvalues are $\\lambda_1=2$, $\\lambda_2=-1$.","3":"(a) Local minimum\n(b) Local maximum\n(c) Saddle point\n\n(Hint) Eigenvalues are $\\lambda_1=3$, $\\lambda_2=5$.","4":"(a) Local minimum\n(b) Local maximum\n(c) Saddle point\n\n(Hint) Eigenvalues are $\\lambda_1=-2$, $\\lambda_2=-4$.","5":"(a) $f''(x)=2$\n(b) $f''(x)=0$\n(c) $f''(x)=1$\n\n(Hint) $f(x)=x^2+1$.","6":"(a) Number of rows (or columns) **4**\n(b) **3**\n(c) **2**\n\n(Hint) The Hessian is $2\\times 2$.","7":"(a) **9**\n(b) **3**\n(c) **6**\n\n(Hint) How many rows does the Hessian have when there are 3 variables?"}},"midMathCh11":{"chapter":"Chapter 11","title":"Taylor Series: Following Complex Functions with Polynomials","description":"A Taylor series rewrites a complicated function as a polynomial near the point where you are standing. A first-order approximation follows like a tangent line; second and third orders hug the curve more tightly. In AI, this idea appears when we approximate loss or activation functions and when we understand Newton’s method and second-order optimization.","sectionTitle":"Taylor Series: Following Complex Functions with Polynomials","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it's used","problemSolving":"Problem-solving guide"},"visualShort":"Taylor: polynomial approximation near a point","visualIntroShort":"Even a complex curve looks like a line (1st order) or parabola (2nd order) when you zoom in near one point. Watch below how raising the degree makes the approximation stick to the original function.","visualWhyTaylor":"Taylor series stores how the function moves at this point using derivatives, and draws a polynomial that matches nearby values.","visualIntro":"At center $a$, collect $f(a), f'(a), f''(a), \\ldots$ to build $T_n(x)=\\sum_{k=0}^{n}\\frac{f^{(k)}(a)}{k!}(x-a)^k$. When $a=0$, it is called a Maclaurin series.","visualCaption":"Purple is the original function; orange, green, and blue are 1st, 2nd, and 3rd-order Taylor approximations. Error grows as you move away from the center.","visualStep1":"Read function values and derivatives at center $a$","visualStep2":"Stack 1st (tangent) → 2nd (parabola) → 3rd-degree polynomial","visualStep3":"Check whether the approximation hugs the original curve","visualStepsLabel":"Watch in order","whatIs":{"intro":"**What is a Taylor series?** It is a formula that replaces a difficult function $f(x)$ with a **polynomial** valid near a single point $a$. It combines the value, slope, and curvature at $a$ so that nearby $x$ values are almost the same as the original function.","plain":"When you zoom a map, a curved road looks straight. Taylor series works the same way: **zoom near $a$** and a complex function behaves like $1 + x + x^2/2 + \\cdots$. Use only 1st order and you get the tangent; go to 2nd order and you get something close to a parabola.","definition":"Formally, $T_n(x)=f(a)+f'(a)(x-a)+\\frac{f''(a)}{2!}(x-a)^2+\\cdots+\\frac{f^{(n)}(a)}{n!}(x-a)^n$. Here $f^{(k)}(a)$ is the $k$th derivative at $a$ and $k!$ is $k$ factorial. A **Maclaurin series** is the special case $a=0$. Higher $n$ usually gives a better fit farther from $a$.","inAI":"In machine learning, approximating the **loss** quadratically near parameters links to Newton’s method and the Hessian (Ch.10). **Activation functions** can also be read as linear or quadratic on small intervals, helping you reason about learning rate and approximation error."},"whyImportant":{"localView":"Derivatives alone tell the slope now, not the whole curve. Taylor series uses **higher derivatives at the same point** to summarize local shape in one polynomial—ideal for optimization, error analysis, and numerical computation.","bridgeToNewton":"Ch.10’s Hessian generalizes the quadratic term $\\frac{f''(a)}{2}(x-a)^2$ to many variables. Understanding 2nd-order Taylor explains why second derivatives drive curvature and Newton steps."},"howUsed":{"gradientDescent":"**1. How is it used in gradient descent? (1st-order approximation)**\n\nGradient descent moves little by little in the direction that reduces loss the fastest at the current point. This is exactly the first-order Taylor idea: near the current point, we treat the loss like a tangent line and choose the next step from that local linear view.\n\nA practical flow is:\n- compute the current loss and gradient,\n- use the 1st-order approximation to decide a descending direction,\n- move one step with the learning rate.\n\nThis view matters because the update rule is not random memorization; it is a decision grounded in first-order approximation. It also explains why too large a learning rate can oscillate and why too small a one can be slow.","newtonAndHessian":"**2. How is it used in Newton's method and Hessian methods? (2nd-order approximation)**\n\nNewton-style methods use not only first-order information (slope) but also second-order information (curvature). In second-order Taylor form, we approximate the local loss surface as a paraboloid and choose a step size that is more efficient for that shape.\n\nA simple comparison:\n- 1st order (gradient descent): strong at choosing direction,\n- 2nd order (Newton): chooses both direction and step size more intelligently.\n\nThe Hessian is the key tool that stores curvature. It tells you which directions are steep and which are flat, so you can scale movement differently by direction. When conditions are good, Newton-type methods can converge much faster.","numericalAndDL":"**3. How is it used in numerical computing and deep learning practice?**\n\nIn practice, we often reduce computation by approximating complicated functions on the interval we need, instead of evaluating them exactly every time. Taylor series is a core tool for that.\n\nTypical uses include:\n- approximating functions like $e^x$, $\\sin x$, and $\\log(1+x)$ with polynomials on small ranges,\n- simplifying activation or loss behavior on specific operating ranges,\n- designing stable optimization steps from a local approximation of the current landscape.\n\nIn deep learning, global exact analysis is often less useful than understanding the neighborhood of the current parameters. The Taylor view makes computation lighter and improves interpretability of why updates move the way they do. So this is not only textbook math; it is practical reasoning used in real training pipelines."},"summary":"In short, Taylor series approximates complicated functions with polynomials near a point. 1st order connects to gradients; 2nd order connects to the Hessian and Newton; higher order improves accuracy. Together with Ch.08 gradients and Ch.10 Hessian, it forms the math backbone of optimization.","problemSolving":{"focus":"The table below lists **formulas and symbols** for solving problems. See **worked examples** under the table for step-by-step solutions.","examplesHeading":"Worked examples","examplesTable":"$30"},"problemSolvingLabel":"Problem-solving guide","problemSolvingTable":"| Formula | Meaning |\n| :--- | :--- |\n| $T_n(x)=\\sum_{k=0}^{n}\\frac{f^{(k)}(a)}{k!}(x-a)^k$ | $T_n$ = degree-$n$ Taylor polynomial. $f^{(k)}(a)$ = $k$th derivative at $a$. $k!$ = factorial. $(x-a)^k$ = distance from center raised to $k$. |\n| $T_1(x)=f(a)+f'(a)(x-a)$ | **Linear** approximation = tangent. Links to gradient steps in ML. |\n| $T_2(x)=f(a)+f'(a)(x-a)+\\frac{f''(a)}{2}(x-a)^2$ | **Quadratic** approximation. Links to Newton and Hessian. |\n| $a=0$ (Maclaurin) | Center at origin: $T_n(x)=\\sum_{k=0}^{n}\\frac{f^{(k)}(0)}{k!}x^k$. |\n| Term count at degree $n$ | From $x^0$ to $x^n$ → **$n+1$** terms. |\n| Coefficient of $x^n$ ($f(x)=c\\cdot x^m$) | If $n=m$, coefficient is $c$; otherwise 0 (low-degree Maclaurin). |\n| Remainder (concept) | After degree $n$, error is roughly order $(x-a)^{n+1}$. |\n| ML link | 1st → SGD/gradient. 2nd → Newton/Hessian. Higher → numerical/function approximation. |","problemSolvingExample1":"**Example (1st order)**\n\n$f(x)=2x+3$, $a=0$, $x=2$ → $T_1(2)=7$. **Answer 7**","problemSolvingExample2":"**Example (2nd order)**\n\n$f(x)=x^2$, $a=0$, $x=2$ → $T_2(2)=4$. **Answer 4**","problemSolvingExample3":"**Example (term count)**\n\n3rd-order Taylor → $3+1=4$ terms. **Answer 4**","visualConceptTitle":"Concept structure","visualConceptStep0":"Input: function $f$, center $a$","visualConceptStep1":"Collect derivatives $f^{(k)}(a)$","visualConceptStep2":"Assemble polynomial $T_n(x)$","visualConceptStep3":"Small $|x-a|$ ⇒ $f(x)\\approx T_n(x)$","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: local polynomial fit","visualFlowStep1":"Intuition: zoom map, tangent, parabola","visualFlowStep2":"Math: Taylor, Maclaurin, degree","visualFlowStep3":"Application: gradient, Newton, numerics","visualOriginalTitle":"Original function (conceptual)","visualOrder1Title":"1st order: follow with tangent","visualOrder2Title":"2nd order: bend like a parabola","visualOrder3Title":"3rd order: match farther out","visualCenterLabel":"Center $a$","visualErrorGrow":"Error grows farther away","visualMatchGood":"Good match nearby","visualSummary1":"1st = tangent = gradient direction","visualSummary2":"2nd = curvature = Hessian & Newton","visualSummary3":"Higher degree → more accurate, heavier formula","problemPromptIntro":"Read the problem and enter the Taylor-series value.","promptDefinition":"If the statement is **true**, choose **1**; if **false**, choose **0**.","promptDefinitionChoice":"Choose the correct option among (a), (b), and (c).","promptLinearApprox":"For $f(x)={b}x+{c}$ with center $a={center}$, what is the 1st-order Taylor approximation $T_1({x})$?","promptQuadraticApprox":"For $f(x)={a}x^2{bVal}x+{c}$ with center $a={center}$, what is $T_2({x})$?","promptMaclaurinCoeff":"In the Maclaurin series of $f(x)={coef}x^{power}$, what is the coefficient of $x^{askPower}$?","promptDerivativeAtCenter":"For $f(x)=x^{power}$, what is $f^{({order})}(0)$ (the ${order}$th derivative at $x=0$)?","promptTermCount":"How many terms does the degree-${order}$ Taylor polynomial $T_{order}(x)$ have?","promptRemainderOrder":"After a degree-$n={order}$ Taylor approximation, what is the typical order of the remainder term?","promptMlConcept":"Taylor series and machine learning — choose the correct (a), (b), or (c).","promptDefault":"Choose the correct answer.","mcDefChoice1":"(a)","mcDefChoice2":"(b)","mcDefChoice3":"(c)","mcDefChoice4":"(d) None of (a)–(c) is correct","mcRemainderChoice1":"$$(x-a)^{n}$","mcRemainderChoice2":"$$(x-a)^{n+1}$","mcRemainderChoice3":"$$(x-a)^{n+2}$","mcRemainderChoice4":"None of the above","definitionStatements":{"0":"The Taylor polynomial at center $a$ has the same value $f(a)$ as the original function at $a$.","1":"The first-order Taylor polynomial equals the tangent line at $a$.","2":"A Maclaurin series is a Taylor series centered at $a=0$.","3":"A degree-$n$ Taylor polynomial has at most $n+1$ terms.","4":"In a Taylor series, the coefficient of $(x-a)^k$ is $\\frac{f^{(k)}(a)}{k!}$.","5":"Increasing the degree $n$ generally improves accuracy near $a$.","10":"A Taylor polynomial always equals the original function on the entire real line.","11":"A first-order approximation is always more accurate than a second-order one.","12":"A Maclaurin series is defined only at $a=1$.","13":"Taylor series always apply to non-differentiable functions.","14":"$$f^{(k)}(a)$ means the $k$th integral of $f$ at $a$."},"definitionChoiceQuestions":{"0":"(a) $T_1(x)=f(a)+f'(a)(x-a)$.\n(b) $T_1(x)=f(a)+\\frac{f''(a)}{2}(x-a)^2$.\n(c) $T_1(x)$ always equals $f(x)$.","1":"(a) A Maclaurin series is Taylor series at $a=0$.\n(b) Maclaurin series exist only at $a=1$.\n(c) Maclaurin series mean first-order only.","2":"(a) A 3rd-order Taylor polynomial has 4 terms.\n(b) It has only 3 terms.\n(c) Term count is unrelated to degree.","3":"(a) 1st-order Taylor links to gradient descent.\n(b) 1st-order Taylor equals Hessian eigenvalues.\n(c) Taylor series are unrelated to ML.","4":"(a) For $f(x)=x^2$, the 2nd Maclaurin polynomial is $x^2$.\n(b) The 1st Maclaurin polynomial is the full $x^2$.\n(c) $f(x)=x^2$ cannot be approximated by Taylor series.","5":"(a) After degree $n$, the remainder is roughly order $(x-a)^{n+1}$.\n(b) The remainder is always 0.\n(c) The remainder equals $(x-a)^n$.","6":"(a) 2nd-order Taylor links to Newton’s method.\n(b) 2nd-order Taylor only describes forward pass.\n(c) 2nd-order Taylor equals dot product.","7":"(a) For $f(x)=5x^2$, the $x^2$ coefficient is 5.\n(b) It is 10.\n(c) It is 0."},"mlConceptQuestions":{"0":"(a) 1st-order Taylor matches the idea of one gradient step.\n(b) 1st-order Taylor always equals the inverse Hessian.\n(c) 1st-order Taylor defines batch normalization.","1":"(a) 2nd-order Taylor carries curvature (Hessian) of the loss.\n(b) 2nd-order Taylor only describes data augmentation.\n(c) 2nd-order Taylor always equals linear regression.","2":"(a) Taylor approximation can simplify activations on small intervals.\n(b) Taylor approximation only saves GPU memory.\n(c) Taylor approximation replaces backpropagation.","3":"(a) Raising degree often improves local accuracy.\n(b) Raising degree always makes the global fit perfect.\n(c) Degree and accuracy are unrelated.","4":"(a) Ch.10 Newton’s method connects to 2nd-order Taylor.\n(b) Newton’s method is unrelated to Taylor series.\n(c) Newton’s method uses only 0th-order approximation."}},"advMathChapters":{"advMath00":{"chapter":"Chapter 00","title":"Advanced Math and AI: Generative Theory and Complex-System Modeling","description":"Advanced math for AI: multidimensional analysis, complex distributions, and deep learning. Curriculum for generative models and reinforcement learning."},"advMath01":{"chapter":"Chapter 01","title":"SVD and Pseudoinverse: Extracting Latent Patterns from Data","description":"SVD and pseudoinverse for latent patterns. PCA, recommendation systems. Advanced math Ch.01."},"advMath02":{"chapter":"Chapter 02","title":"Tensor Algebra and Einstein Notation","description":"Tensor algebra, Einsum, contraction. Neural network and attention notation. Advanced math Ch.02."},"advMath03":{"chapter":"Chapter 03","title":"Lagrange Multipliers and KKT: Constrained Optimization","description":"Lagrange multipliers and KKT for constrained optimization. SVM and constrained RL. Advanced math Ch.03."},"advMath04":{"chapter":"Chapter 04","title":"Markov Chain: State Transitions and Stochastic Processes","description":"Markov chains, transition matrix, stationarity. MCMC and RL basics. Advanced math Ch.04."},"advMath05":{"chapter":"Chapter 05","title":"Monte Carlo Integration: Numerical Approximation","description":"Monte Carlo integration for high-dimensional expectations. Used in RL and Bayesian inference. Advanced math Ch.05."},"advMath06":{"chapter":"Chapter 06","title":"MCMC: Sampling from Complex Probability Distributions","description":"MCMC, Gibbs and Metropolis-Hastings. Sampling from complex posteriors. Advanced math Ch.06."},"advMath07":{"chapter":"Chapter 07","title":"EM Algorithm: Inference with Latent Variables","description":"EM algorithm: E-step, M-step, latent variable models. GMM, HMM. Advanced math Ch.07."},"advMath08":{"chapter":"Chapter 08","title":"MAP Estimation: Bayesian Optimization and Regularization","description":"MAP estimation, priors, L1/L2 regularization. Bayesian deep learning. Advanced math Ch.08."},"advMath09":{"chapter":"Chapter 09","title":"Conjugate Prior: Analytical Bayesian Inference","description":"Conjugate priors for tractable posteriors. Beta, Dirichlet. Advanced math Ch.09."},"advMath10":{"chapter":"Chapter 10","title":"JS Divergence and Mutual Information","description":"JS divergence and mutual information. GANs and information theory. Advanced math Ch.10."},"advMath11":{"chapter":"Chapter 11","title":"Variational Inference: Approximating Intractable Probabilities","description":"Variational inference, KL minimization, approximate posteriors. Core of VAE. Advanced math Ch.11."},"advMath12":{"chapter":"Chapter 12","title":"Reparameterization Trick: Differentiating Randomness","description":"Reparameterization trick for differentiable sampling. VAE training. Advanced math Ch.12."},"advMath13":{"chapter":"Chapter 13","title":"Optimal Transport and Wasserstein Distance","description":"Wasserstein distance, Earth Mover. WGAN when supports do not overlap. Advanced math Ch.13."},"advMath14":{"chapter":"Chapter 14","title":"MDP and Bellman Equation: Mathematical Basis of Reinforcement Learning","description":"MDP and Bellman equation. States, actions, rewards, value functions. RL math. Advanced math Ch.14."},"advMath15":{"chapter":"Chapter 15","title":"Fourier Transform and Spectral Analysis","description":"Fourier transform and frequency domain. Time series, images, CNN. Advanced math Ch.15."},"advMath16":{"chapter":"Chapter 16","title":"Graph Laplacian: Mathematizing Network Structure","description":"Graph Laplacian, adjacency, degree. GNN, smoothing. Advanced math Ch.16."},"advMath17":{"chapter":"Chapter 17","title":"SDE Basics: Continuous Injection of Noise","description":"SDE and Brownian motion. Diffusion forward process. Advanced math Ch.17."},"advMath18":{"chapter":"Chapter 18","title":"Langevin Dynamics and Score Matching","description":"Langevin dynamics and score matching. Diffusion reverse process. Advanced math Ch.18."},"advMath19":{"chapter":"Chapter 19","title":"Information Geometry and Natural Gradient","description":"Information geometry, Fisher matrix, natural gradient. Optimization on manifolds. Advanced math Ch.19."},"advMath20":{"chapter":"Chapter 20","title":"Advanced Math Summary: Generative Models and Deep Optimization","description":"How SDE, VI, optimal transport, and information geometry appear in VAE, GAN, Diffusion, LLM. Advanced math Ch.20."}},"midDlChapters":{"midDl00":{"chapter":"Chapter 00","title":"Intermediate DL: Stable Training and Unstructured Data"},"midDl01":{"chapter":"Chapter 01","title":"Weight Initialization"},"midDl02":{"chapter":"Chapter 02","title":"Optimization: Momentum and Adaptive Learning Rate"},"midDl03":{"chapter":"Chapter 03","title":"Learning Rate Scheduling: Slow Down When Needed"},"midDl04":{"chapter":"Chapter 04","title":"Loss Functions: Class Imbalance and Metric Learning"},"midDl05":{"chapter":"Chapter 05","title":"Regularization and Overfitting Prevention"},"midDl06":{"chapter":"Chapter 06","title":"Batch & Layer Normalization"},"midDl07":{"chapter":"Chapter 07","title":"Data Augmentation and Noise Robustness"},"midDl08":{"chapter":"Chapter 08","title":"CNN Basics: Spatial Feature Extraction"},"midDl09":{"chapter":"Chapter 09","title":"Pooling and Multi-Channel"},"midDl10":{"chapter":"Chapter 10","title":"Skip Connection and ResNet"},"midDl11":{"chapter":"Chapter 11","title":"Efficient Convolution: MobileNet"},"midDl12":{"chapter":"Chapter 12","title":"Vision Transfer Learning"},"midDl13":{"chapter":"Chapter 13","title":"Object Detection (YOLO, SSD)"},"midDl14":{"chapter":"Chapter 14","title":"Image Segmentation (U-Net)"},"midDl15":{"chapter":"Chapter 15","title":"NLP Preprocessing and Tokenization"},"midDl16":{"chapter":"Chapter 16","title":"Word Embedding (Word2Vec, GloVe)"},"midDl17":{"chapter":"Chapter 17","title":"1D CNN for Sequence Processing"},"midDl18":{"chapter":"Chapter 18","title":"RNN: Sequential State"},"midDl19":{"chapter":"Chapter 19","title":"LSTM and GRU: Long-Range Dependencies"},"midDl20":{"chapter":"Chapter 20","title":"Encoder-Decoder and Attention"},"midDl21":{"chapter":"Chapter 21","title":"Intermediate DL Summary"}},"midDlCh00":{"description":"Learn what intermediate deep learning covers: stable training and handling images and text, from Ch01 to Ch21.","roadmapTitle":"Intermediate deep learning diagram by chapter","roadmapDescription":"As you complete each chapter, the diagram below fills in. This is the structure so far.","roadmapListHeading":"What you learn in Ch01–Ch21","sectionTitle":"What is Intermediate Deep Learning?","paragraphs":{"0":"**Basic deep learning** introduced neurons, layers, and gradients. **Intermediate deep learning** adds **ways to stabilize training** and to handle **images and text**. You will learn **weight initialization**, **optimizers** (momentum, Adam), **learning rate scheduling**, **regularization and overfitting prevention**, **batch normalization**, and more so that training converges well. Then you move on to **convolutional networks (CNN)**, **ResNet**, **transfer learning**, **object detection and segmentation**, **NLP preprocessing and embeddings**, **RNN, LSTM, GRU**, and **encoder-decoder with attention**.","1":"**Images** are pixel grids, so we use **convolutions** to capture spatial patterns, **pooling** to summarize, and **skip connections** to train deep networks stably. **Text** is sequential, so we use **tokenization and embedding**, then **1D convolutions** or **RNN/LSTM** for context, and **attention** to focus on important parts.","2":"**Why training stability matters**: poor initialization can stall learning; a learning rate that is too high causes divergence, too low makes progress slow. **Optimizers** use not only the current gradient but also past updates (momentum) or per-parameter step sizes (Adam) to reach a good minimum faster and more reliably. **Learning rate scheduling** starts with larger steps and then reduces them for fine convergence; **regularization** and **batch normalization** keep activations and gradients at a sensible scale and reduce vanishing or exploding gradients.","3":"In **vision**, local patterns (edges, textures) matter, so **convolutions** are a natural fit. **Pooling** compresses information while making the representation somewhat invariant to small shifts. **ResNet**’s skip connections add previous layer outputs so that even very deep networks can be trained without the signal dying out. **Transfer learning** reuses models trained on large datasets and fine-tunes them for your task, which is especially useful when you have limited data.","4":"For **language and sequences**, we split text into **tokens**, turn them into **embeddings**, then use **RNN** or **LSTM/GRU** to carry context over time and predict the next token. **Attention** lets the model learn which parts of the input matter most for each prediction, which is central to translation, summarization, and question answering. After this course, you will understand the basics of image classification, detection, segmentation, and text generation, translation, and summarization.","5":"This course is organized as follows: Ch01–Ch07 cover **training stability** (initialization, optimization, scheduling, loss, regularization, normalization layers, data augmentation); Ch08–Ch14 cover **vision** (CNN, pooling, ResNet, efficient convolutions, transfer learning, detection, segmentation); Ch15–Ch21 cover **language and sequences** (preprocessing, embedding, 1D CNN, RNN, LSTM/GRU, encoder-decoder and attention, and a final summary)."}},"midDlCh01":{"chapter":"Chapter 01","title":"Weight Initialization: A Good Start Is Half the Battle","description":"**Weight initialization** is choosing the initial values of each layer's weights and biases before training. A bad start leads to vanishing or exploding gradients and often makes learning nearly impossible; a good start leads to faster convergence and stable training. This chapter covers the concept of initialization, the intuition and formulas behind Xavier and He initialization, and how they are used in practice.","sectionTitle":"Weight Initialization: A Good Start Is Half the Battle","whatIs":{"0":"**What is weight initialization?** — Each layer of a neural network has **weights $W$** and **biases $b$. Before training, these values are undefined, so we must choose **what numbers to use initially**. This process is called **weight initialization**. Intuitively, it is like choosing where to start a marathon: too far back (weights too small) and progress is slow; too far forward (weights too large) and training can explode and diverge.","1":"**Mathematically** — In one layer the linear combination is $z = W \\mathbf{x} + b$, where $\\mathbf{x}$ is the input vector, $W$ the weight matrix, and $b$ the bias. If all elements of $W$ are zero, every neuron in that layer gives the same output, **symmetry** is preserved, and gradients do not spread properly during backprop. So we usually initialize with **small random numbers**, but the **distribution (scale)** of those numbers matters. We adjust variance using the layer's input dimension $n_{in}$ and output dimension $n_{out}$ so that activations do not grow or shrink too much as they pass through.","2":"**In practice** — With bad initialization, a spam classifier may show almost no decrease in loss or NaNs. In deep CNNs (e.g. medical imaging or fraud detection), skipping Xavier or He often leads to **vanishing gradients** in early layers and training appears stuck. If the scale is too large, gradients explode and training becomes unstable. So in practice **Xavier** (for tanh·sigmoid) or **He** (for ReLU) initialization is standard."},"whyImportant":{"0":"**Vanishing and exploding gradients** — In deeper networks, backpropagated gradients are products of many numbers (chain rule). If weights are too small, this product tends to zero (**vanishing gradient**) and early layers barely update; if too large, it explodes (**exploding gradient**) and you get NaN or Inf. Good initialization keeps **variance** stable across layers so that gradients stay at a reasonable scale even in deep networks.","1":"**Convergence speed** — With proper initialization you start at a **good point** on the loss surface. A bad starting point can trap you in poor local minima or make convergence very slow. In practice, initialization is tuned together with learning rate by monitoring validation loss."},"howUsed":{"0":"**Xavier (Glorot) initialization** — So that the variance of $z$ does not depend on input/output size, $W$ is sampled from a **uniform** distribution $U(-\\sqrt{6/(n_{in}+n_{out})},\\ \\sqrt{6/(n_{in}+n_{out})})$ or a **normal** distribution $\\mathcal{N}(0,\\ \\sigma^2)$ with $\\sigma^2 = 2/(n_{in}+n_{out})$. It fits symmetric activations like tanh and sigmoid.","1":"**He initialization** — ReLU zeros out negative inputs, so output variance is about half the input variance. **He** initialization uses $\\sigma^2 = 2/n_{in}$ to compensate. It is the default in modern CNNs and MLPs that use ReLU or Leaky ReLU.","2":"**Practical choice** — Use He for ReLU-family activations and Xavier for tanh·sigmoid. Frameworks (PyTorch, TensorFlow) usually apply one of these by default depending on the layer type."},"problemSolving":{"0":"**Summary** — Weight initialization is the step of choosing initial values for each layer's $W$ and $b$ before training. Zero initialization breaks learning due to symmetry, so we usually use small random numbers and adjust **variance (scale)**. Xavier uses $\\sigma^2 = 2/(n_{in}+n_{out})$ for tanh·sigmoid; He uses $\\sigma^2 = 2/n_{in}$ for ReLU-family. Good initialization reduces vanishing/exploding gradients and speeds convergence.","2":"**Example (definition)**\n\n\"What is the main purpose of weight initialization? ① Match layer scale before training ② Increase learning rate ③ Data augmentation\"\n\nPurpose is to keep activation and gradient scale stable across layers. → **Answer 1**\n\n---\n\n**Example (Xavier vs He)**\n\n\"Common initialization for layers using ReLU? ① Xavier ② He ③ Zero\"\n\nHe is used for ReLU-family. → **Answer 2**\n\n---\n\n**Example (calculation)**\n\nIf $n_{in}+n_{out}=6$, what is the integer value of $6/(n_{in}+n_{out})$ (uniform Xavier ratio)?\n\n$6/6=1$. → **Answer 1**","3":"**Definition example** — \"What is the main purpose of weight initialization? ① Match layer scale before training ② Increase learning rate ③ Data augmentation\" → Purpose is to keep scale stable across layers. **Answer 1**\n\n**True/False example** — \"Weight initialization is the process of setting $W$, $b$ before training.\" → True. **Answer 1**\n\n**Application example** — \"When loss barely decreases in a spam classifier, what to check first? ① Initialization·learning rate ② Data size only ③ Batch size only\" → Check initialization·learning rate first. **Answer 1**\n\n**Choice example** — \"In He initialization, $\\sigma^2$ is? ① $2/n_{in}$ ② $2/(n_{in}+n_{out})$ ③ $1/n_{in}$\" → He uses $\\sigma^2=2/n_{in}$. **Answer 1**\n\n**Concept example** — \"In Xavier, if $n_{in}+n_{out}=6$, the value (integer) of $6/(n_{in}+n_{out})$ is? ① 1 ② 2 ③ 3\" → $6/6=1$. **Answer 1**\n\n**Calc example** — \"If $n_{in}+n_{out}=6$, what is the integer value of $6/(n_{in}+n_{out})$?\" → $6/6=1$. **Answer 1**"},"summary":"Weight initialization is the process of setting the initial values of each layer's weights and biases before training. Initializing everything to zero keeps neurons symmetric and prevents proper learning; random values that are too large or too small lead to exploding or vanishing activations and gradients. Xavier and He initialization adjust variance based on layer size and are widely used: Xavier for symmetric activations like tanh·sigmoid, He for ReLU-family. A good start reduces vanishing and exploding gradients and makes convergence faster and more stable.","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Understanding the formulas","linear":"**Formula $z = W\\mathbf{x}+b$ (linear sum in one layer)**\n\nThis is the pre-activation output of a layer. **$z$** is the raw vector before the activation function. **$W$** scales how each input affects each output; its variance is what initialization controls. **$\\mathbf{x}$** is the layer input (features or the previous layer’s output). **$b$** shifts the baseline and is often set to 0 at init. If $W$ is too large, activations explode; too small, they vanish.","xavierVariance":"**Xavier variance $\\sigma^2 = \\frac{2}{n_{in}+n_{out}}$**\n\nXavier draws weights from a normal with this variance. Larger $\\sigma^2$ spreads weights farther from 0. **$n_{in}$** counts inputs to the layer; **$n_{out}$** counts outputs (neurons). As **$n_{in}+n_{out}$** grows, $\\sigma^2$ **shrinks**, so wide layers use smaller weights so sums stay stable. The numerator **2** comes from matching variance through tanh/sigmoid-like activations.","heVariance":"**He variance $\\sigma^2 = \\frac{2}{n_{in}}$**\n\nHe targets ReLU, which zeros negative inputs so output variance is roughly halved. He uses **only $n_{in}$** (not $n_{out}$). The factor **2** compensates for that halving so activations stay scaled after ReLU.","xavierUniform":"**Xavier uniform on $[-a,\\ a]$, $a = \\sqrt{\\frac{6}{n_{in}+n_{out}}}$**\n\nWeights can be drawn uniformly on $[-a,a]$ with the same scaling idea as the normal Xavier form. When $n_{in}+n_{out}$ is given, compute $6/(n_{in}+n_{out})$; for integer-friendly drills, e.g. $n_{in}+n_{out}=6$ gives $6/6=1$."},"visual":"Visualization of how weight initialization affects gradient flow.","problemSolvingLabel":"Problem-solving guide","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are sample questions to check your understanding. Choose your answer using the buttons below.","practiceProblemsInstruction":"Read each problem and select the correct option.","midDlCh01VisualIntro":"Weight initialization is the first step of training: set $W$ and $b$ for each layer at an appropriate scale so that variance is preserved during forward and back propagation.","midDlCh01VisualStep0":"① Initialize: set $W$, $b$ for each layer (e.g. Xavier/He)","midDlCh01VisualStep1":"② Forward: input → linear sum $z$ → activation $a$ → next layer","midDlCh01VisualStep2":"③ Loss then backprop: gradients pass through layers","midDlCh01VisualStep3":"④ Update: update $W$, $b$ from gradients. Good init keeps gradient scale stable","midDlCh01VisualConceptTitle":"Concept: initialize → forward → loss → backprop → update","midDlCh01VisualFlowTitle":"Training flow: initialize so input·weight·output scale match per layer","midDlCh01VisualModelTitle":"Layer: set variance of $W$ so variance of $z=Wx+b$ stays similar to input variance","midDlCh01VisualScaleTitle":"Effect of initialization scale","midDlCh01VisualScaleSmall":"W too small → vanishing gradient","midDlCh01VisualScaleLarge":"W too large → exploding gradient","midDlCh01VisualScaleGood":"Reasonable W → variance preserved","midDlCh01VisualSegInput":"Input","midDlCh01VisualSegLayer1":"Layer 1","midDlCh01VisualSegLayer2":"Layer 2","midDlCh01VisualSegLayer3":"Layer 3","midDlCh01VisualSegOutput":"Output","midDlCh01VisualRowLabelVanishing":"Vanishing","midDlCh01VisualRowLabelStable":"Stable","midDlCh01VisualRowLabelExploding":"Exploding","midDlCh01VisualScaleCaption":"Good initialization sets W, b scale so that **variance is preserved** across layers.","midDlCh01VisualBannerShort":"A good start is half the battle","midDlCh01VisualBannerSub":"Proper initialization → fast convergence · stable training","problems":{"definition_0":"What is the main purpose of weight initialization? ① Match layer scale before training ② Increase learning rate ③ Data augmentation","definition_1":"The process of setting $W$, $b$ for each layer before training is called? ① Weight initialization ② Gradient descent ③ Regularization","definition_2":"Common initialization for ReLU-family activations? ① Xavier ② He ③ Zero initialization","definition_3":"Common initialization for tanh·sigmoid? ① Xavier ② He ③ Zero initialization","definition_4":"When gradients approach 0 and early layers barely update, this is called? ① Vanishing gradient ② Exploding gradient ③ Overfitting","definition_5":"When weights are too large and gradients explode, this is called? ① Vanishing gradient ② Exploding gradient ③ Underfitting","definition_6":"In Xavier initialization, how is variance set using $n_{in}$, $n_{out}$? ① $2/(n_{in}+n_{out})$ ② $2/n_{in}$ ③ $1/n_{in}$","definition_7":"In He initialization, variance is? ① $2/(n_{in}+n_{out})$ ② $2/n_{in}$ ③ $1/(n_{in}+n_{out})$","definition_8":"Main reason not to initialize all weights to 0? ① Symmetry: neurons give same output, learning fails ② Slower compute ③ Memory shortage","definition_9":"In one layer $z = W\\mathbf{x}+b$, if $W$ is too small? ① Tends to vanishing gradient ② Exploding gradient ③ No effect","trueFalse_0":"Weight initialization is the process of setting $W$, $b$ before training. True: 1, False: 0.","trueFalse_1":"Xavier initialization is used only for ReLU. True: 1, False: 0.","trueFalse_2":"He initialization is suitable for ReLU-family activations. True: 1, False: 0.","trueFalse_3":"Good initialization keeps variance stable across layers. True: 1, False: 0.","trueFalse_4":"Initializing all weights to 0 is recommended. True: 1, False: 0.","trueFalse_5":"Vanishing gradient occurs when weights are too large. True: 1, False: 0.","trueFalse_6":"Exploding gradient can occur when weights are too large. True: 1, False: 0.","trueFalse_7":"Initialization affects convergence speed. True: 1, False: 0.","trueFalse_8":"In Xavier, $\\sigma^2 = 2/(n_{in}+n_{out})$. True: 1, False: 0.","trueFalse_9":"In He, $\\sigma^2 = 2/n_{in}$. True: 1, False: 0.","scenario_0":"In a spam classifier, when loss barely decreases, what to suspect first? ① Initialization·learning rate ② Data size only ③ Batch size only","scenario_1":"In a deep CNN, when early layers barely update, the most common cause is? ① Vanishing gradient ② Overfitting ③ Lack of data","scenario_2":"When implementing an MLP with ReLU, a good default initialization is? ① Xavier ② He ③ Zero","scenario_3":"For a layer with tanh, initialization with variance $2/(n_{in}+n_{out})$ is? ① Xavier ② He ③ Neither","scenario_4":"When you get NaN during training, from an initialization perspective you suspect? ① Exploding gradient (scale too large) ② Data only ③ Batch size only","scenario_5":"Why try changing initialization when a medical image model converges very slowly? ① Bad starting point can slow convergence ② Only lack of data ③ Only learning rate matters","scenario_6":"Which initialization is PyTorch default Linear layer closest to? ① Xavier/He family ② Always zero ③ Random only","scenario_7":"The goal of keeping activation variance stable across layers is called? ① Variance preservation (scale matching) ② Regularization ③ Dropout","scenario_8":"Why care about initialization when a fraud detection model is deep? ① Avoid vanishing·exploding gradients ② Data only matters ③ Batch size only matters","scenario_9":"For a layer with $n_{in}=8$, $n_{out}=8$ using Xavier, $n_{in}+n_{out}$ is? ① 16 ② 8 ③ 64","choice_0":"Why not initialize weights to 0? ① Symmetry prevents proper learning ② Save memory ③ Slower speed","choice_1":"In He initialization, $\\sigma^2$ is? ① $2/n_{in}$ ② $2/(n_{in}+n_{out})$ ③ $1/n_{in}$","choice_2":"A good way to mitigate vanishing gradient? ① Proper initialization (e.g. Xavier/He) ② Only increase learning rate ③ Only increase batch size","choice_3":"Activation that fits Xavier initialization? ① tanh·sigmoid ② ReLU only ③ None","choice_4":"In one layer $z=W\\mathbf{x}+b$, if $W$ scale is too large? ① Exploding gradient possible ② Only vanishing ③ No effect","choice_5":"Effect of initialization on training? ① Convergence speed·stability ② Only data amount ③ Only loss shape","choice_6":"In He for ReLU layers, variance is ___ to input dimension $n_{in}$? ① Inversely proportional ($2/n_{in}$) ② Proportional ③ Unrelated","choice_7":"When gradients approach 0 during backprop, this is? ① Vanishing gradient ② Exploding gradient ③ Regularization","choice_8":"In Xavier, if $n_{in}=4$, $n_{out}=6$, then $n_{in}+n_{out}$ is? ① 10 ② 24 ③ 2","choice_9":"Closest to the goal of good initialization? ① Preserve variance across layers ② Weights to 0 ③ Only increase learning rate","concept_0":"In $z=W\\mathbf{x}+b$, if variance of $W$ is too large, during backprop the gradient? ① Can explode ② Always 0 ③ Unchanged","concept_1":"In Xavier, uniform range is $[-a,a]$ with $a=\\sqrt{6/(n_{in}+n_{out})}$. If $n_{in}+n_{out}=12$, $6/(n_{in}+n_{out})$ as integer? ① 0 ② 1 ③ 2","concept_2":"Main reason to use He initialization? ① ReLU zeros negatives, reducing variance ② Faster than Xavier ③ Always better","concept_3":"Why is initialization more important in deep networks? ① Gradients are multiplied through many layers ② Only data matters ③ Only first layer matters","concept_4":"How is bias $b$ usually initialized? ① To 0 ② To 1 ③ Randomly","concept_5":"Why use He-like initialization with Leaky ReLU? ① ReLU family, similar variance behavior ② Only Xavier ③ Zero init","concept_6":"When learning rate is fine but loss barely decreases? ① Suspect initialization or structure (vanishing) ② Data only ③ Batch only","concept_7":"What do Xavier and He have in common? ① Set variance by layer size ② Both zero init ③ ReLU only","concept_8":"In backprop, multiplying gradients (chain rule): 0.5^10 ≈ 0.001. Similar phenomenon? ① Vanishing gradient ② Exploding gradient ③ Regularization","concept_9":"Default initialization for ReLU CNN in practice? ① He family ② Zero ③ Xavier only","calc_0":"If $n_{in}+n_{out}=6$, what is the integer value of $6/(n_{in}+n_{out})$ (uniform Xavier ratio)?","calc_1":"For He initialization $\\sigma^2=2/n_{in}$ with $n_{in}=8$, what is the denominator $n_{in}$ (integer)?","calc_2":"For Xavier variance $\\sigma^2=2/(n_{in}+n_{out})$ with $n_{in}=2$, $n_{out}=8$, what is the denominator $n_{in}+n_{out}$ (integer)?","calc_3":"For He initialization, $\\sigma^2=2/n_{in}$ uses $n_{in}$ as the denominator. If $n_{in}=32$, what is that denominator (integer)?","calc_4":"For Xavier with $n_{in}=5$, $n_{out}=5$, what is $n_{in}+n_{out}$ (integer)?","calc_5":"If $n_{in}+n_{out}=3$, what is the integer value of $6/(n_{in}+n_{out})$?","calc_6":"For Xavier variance denominator $n_{in}+n_{out}$ with $n_{in}=1$, $n_{out}=7$, what is the value (integer)?","calc_7":"For He initialization $\\sigma^2=2/n_{in}$ with $n_{in}=20$, what is the denominator (integer)?","calc_8":"For Xavier with $n_{in}=4$, $n_{out}=12$, what is $n_{in}+n_{out}$ (integer)?","calc_9":"If $n_{in}+n_{out}=2$, what is the integer value of $6/(n_{in}+n_{out})$?"},"problemAnswers":{"definition_0":1,"definition_1":1,"definition_2":2,"definition_3":1,"definition_4":1,"definition_5":2,"definition_6":1,"definition_7":2,"definition_8":1,"definition_9":1,"trueFalse_0":1,"trueFalse_1":0,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":0,"trueFalse_5":0,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":2,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"choice_0":1,"choice_1":1,"choice_2":1,"choice_3":1,"choice_4":1,"choice_5":1,"choice_6":1,"choice_7":1,"choice_8":1,"choice_9":1,"concept_0":1,"concept_1":1,"concept_2":1,"concept_3":1,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"calc_0":1,"calc_1":8,"calc_2":10,"calc_3":32,"calc_4":10,"calc_5":2,"calc_6":8,"calc_7":20,"calc_8":16,"calc_9":3},"problemSolutions":{"definition_0":"**Concept**: The main goal of weight initialization is to set each layer’s $W$ and $b$ to a **reasonable scale** before training so activations and gradients stay stable in the forward and backward passes. Learning rate and data augmentation are different topics. **Answer: ①**.","definition_1":"The process of choosing initial $W$ and $b$ for each layer **before** training is **weight initialization**. Gradient descent updates weights **during** training; regularization fights overfitting. **Answer: ①**.","definition_2":"For ReLU-family activations, **He initialization** is standard; it uses $\\sigma^2=2/n_{in}$ to match ReLU’s variance behavior. **Answer: ②**.","definition_3":"For tanh/sigmoid-like symmetric activations, **Xavier (Glorot)** fits well, with $\\sigma^2=2/(n_{in}+n_{out})$. **Answer: ①**.","definition_4":"If weights are **too small**, gradients shrink through layers toward zero (**vanishing gradient**). **Answer: ①**.","definition_5":"If weights are **too large**, gradients can explode through layers (**exploding gradient**). **Answer: ②**.","definition_6":"Xavier uses $\\sigma^2 = 2/(n_{in}+n_{out})$. **Answer: ①**.","definition_7":"He uses $\\sigma^2 = 2/n_{in}$. **Answer: ②**.","definition_8":"All-zero weights break symmetry: neurons output the same values and learning stalls. **Answer: ①**.","definition_9":"If $W$ is too small, $z$ and gradients stay small—close to **vanishing gradient**. **Answer: ①**.","trueFalse_0":"Weight initialization sets $W$ and $b$ **before** training starts. **Answer: 1 (true).**","trueFalse_1":"Xavier targets tanh/sigmoid; ReLU typically uses **He**, so “Xavier is only for ReLU” is false. **Answer: 0.**","trueFalse_2":"He initialization suits ReLU and Leaky ReLU. **Answer: 1.**","trueFalse_3":"Good init keeps variance stable across layers. **Answer: 1.**","trueFalse_4":"All-zero weights are **not** recommended. **Answer: 0.**","trueFalse_5":"Vanishing gradient is tied to weights that are **too small**, not too large. **Answer: 0.**","trueFalse_6":"Exploding gradients can happen when weights are too large. **Answer: 1.**","trueFalse_7":"Initialization affects convergence speed. **Answer: 1.**","trueFalse_8":"Xavier variance $\\sigma^2 = 2/(n_{in}+n_{out})$ is correct. **Answer: 1.**","trueFalse_9":"He variance $\\sigma^2 = 2/n_{in}$ is correct. **Answer: 1.**","scenario_0":"If loss barely moves, check **initialization** and **learning rate** first. **Answer: 1.**","scenario_1":"Early layers not updating in a deep CNN often indicates **vanishing gradients**; try He/Xavier and related fixes. **Answer: 1.**","scenario_2":"For an MLP with ReLU, **He** is the usual default. **Answer: 2.**","scenario_3":"Variance $2/(n_{in}+n_{out})$ for tanh matches **Xavier**. **Answer: 1.**","scenario_4":"NaNs often suggest **exploding** gradients or too-large scales/lr. **Answer: 1.**","scenario_5":"Very slow convergence can come from a **bad starting point**; better init can help. **Answer: 1.**","scenario_6":"PyTorch `Linear` layers typically use **Xavier/He-style** defaults. **Answer: 1.**","scenario_7":"Keeping activation variance stable across layers is a core init goal (**variance preservation**). **Answer: 1.**","scenario_8":"Deep models need careful init to avoid **vanishing/exploding** gradients. **Answer: 1.**","scenario_9":"$$n_{in}+n_{out}=8+8=16$; pick the option that matches **16**. **Answer: 1.**","choice_0":"Zero weights cause **symmetry**—neurons behave identically. **Answer: 1.**","choice_1":"He uses $\\sigma^2 = 2/n_{in}$. **Answer: 1.**","choice_2":"Proper **initialization** (Xavier/He) helps mitigate vanishing gradients. **Answer: 1.**","choice_3":"**Xavier** pairs with tanh/sigmoid. **Answer: 1.**","choice_4":"If $W$ is too large, **exploding** gradients are possible. **Answer: 1.**","choice_5":"Initialization affects **convergence speed and stability**. **Answer: 1.**","choice_6":"In He, variance scales **inversely** with $n_{in}$ via $2/n_{in}$. **Answer: 1.**","choice_7":"Gradients near zero in backprop indicate **vanishing gradient**. **Answer: 1.**","choice_8":"$$n_{in}+n_{out}=4+6=10$. **Answer: 1** (option ① = 10).","choice_9":"A good init aims to **preserve variance** across layers. **Answer: 1.**","concept_0":"If variance of $W$ is too large, gradients can **explode**. **Answer: 1.**","concept_1":"With $n_{in}+n_{out}=12$, we have $6/(n_{in}+n_{out})=6/12=0.5$; follow the problem’s integer rule and choices (see prompt). **Answer: 1.**","concept_2":"ReLU zeros half the mass on average; **He** compensates with $\\sigma^2=2/n_{in}$. **Answer: 1.**","concept_3":"In deep nets, gradients multiply through many layers (chain rule), so bad init hurts badly. **Answer: 1.**","concept_4":"Bias $b$ is usually initialized to **0**. **Answer: 1.**","concept_5":"Leaky ReLU behaves like a ReLU family activation; **He-like** init is common. **Answer: 1.**","concept_6":"If lr looks fine but loss stalls, suspect **init** or **vanishing** structure. **Answer: 1.**","concept_7":"**Xavier** and **He** both set variance using layer dimensions. **Answer: 1.**","concept_8":"Multiplying many small factors toward zero mirrors **vanishing** gradients. **Answer: 1.**","concept_9":"ReLU CNNs typically default to **He-style** init. **Answer: 1.**","calc_0":"$$n_{in}+n_{out}=6 \\Rightarrow 6/6=1$. **Answer: 1.**","calc_1":"He uses denominator $n_{in}=8$. **Answer: 8.**","calc_2":"$$n_{in}+n_{out}=2+8=10$. **Answer: 10.**","calc_3":"Denominator $n_{in}=32$. **Answer: 32.**","calc_4":"$$5+5=10$. **Answer: 10.**","calc_5":"$$6/3=2$. **Answer: 2.**","calc_6":"$$1+7=8$. **Answer: 8.**","calc_7":"Denominator $n_{in}=20$. **Answer: 20.**","calc_8":"$$4+12=16$. **Answer: 16.**","calc_9":"$$6/2=3$. **Answer: 3.**"},"problemTestCodes":{"definition_0":"answer = 1\nassert answer == 1","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 1\nassert answer == 1","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 0\nassert answer == 0","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 0\nassert answer == 0","trueFalse_5":"answer = 0\nassert answer == 0","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 2\nassert answer == 2","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","choice_0":"answer = 1\nassert answer == 1","choice_1":"answer = 1\nassert answer == 1","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 1\nassert answer == 1","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 1\nassert answer == 1","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 1\nassert answer == 1","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","calc_0":"s = 6\nanswer = 6 // s\nassert answer == 1","calc_1":"n_in = 8\nanswer = n_in\nassert answer == 8","calc_2":"n_in, n_out = 2, 8\nanswer = n_in + n_out\nassert answer == 10","calc_3":"n_in = 32\nanswer = n_in\nassert answer == 32","calc_4":"n_in, n_out = 5, 5\nanswer = n_in + n_out\nassert answer == 10","calc_5":"s = 3\nanswer = 6 // s\nassert answer == 2","calc_6":"n_in, n_out = 1, 7\nanswer = n_in + n_out\nassert answer == 8","calc_7":"n_in = 20\nanswer = n_in\nassert answer == 20","calc_8":"n_in, n_out = 4, 12\nanswer = n_in + n_out\nassert answer == 16","calc_9":"s = 2\nanswer = 6 // s\nassert answer == 3"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midDlCh02":{"chapter":"Chapter 02","title":"Optimization Algorithms: Tuning Speed and Direction Wisely","description":"Training an AI model is like **wearing a blindfold while hiking a huge mountain range toward the deepest valley (the minimum error)**. **Optimization** is the **navigation** that picks **which direction** and **how large a step** to take from where you stand.\n\nAfter Ch.01 set the starting point, this chapter teaches skills to descend safely and quickly: walking step by step with **SGD**, sledding with **Momentum**, and self-driving with **Adam** that adapts its stride to the terrain. We unpack the core optimizers you will use every day—intuitively and clearly.","sectionTitle":"Optimization Algorithms: Tuning Speed and Direction Wisely","whatIs":{"0":"**1. Gradient descent & SGD: walk against the uphill gradient**\n\n**Concept:** The reliable way downhill is to feel the slope under your feet and take steps along the **steepest descent** — that is the heart of gradient descent.\n\n**Intuition:** Picture descending Hallasan in thick fog. If your **stride (learning rate)** is too wide, you may fall off a cliff or bounce onto the opposite ridge; if it is too narrow, sunset may arrive before you reach the valley.\n\n**Core equation:**\n$\\theta \\leftarrow \\theta - \\eta \\nabla L(\\theta)$\n- **$\\theta$**: where you stand (model weights)\n- **$\\eta$**: step size — the **learning rate** (often 0.01, 0.001, …)\n- **$\\nabla L$**: the slope (gradient) at the current point\n\n**Practical tip:** Scanning the full map every time is slow, so we usually follow **stochastic gradient descent (SGD)** — pick a **minibatch**, estimate $\\hat{g}$, and step quickly.","1":"**2. Momentum: a bowling ball on ice**\n\n**Concept:** Plain SGD only looks at the local slope, so in a bumpy narrow valley it **zig-zags** and wastes time. **Momentum** adds **inertia from past moves**.\n\n**Intuition:** A paper cup turns at every pebble; a **heavy bowling ball** keeps rolling through small bumps. Momentum gives the optimizer that kind of “mass.”\n\n**Core updates:**\n$v \\leftarrow \\beta v + (1-\\beta)g$\n$\\theta \\leftarrow \\theta - \\eta v$\n- **$v$**: velocity (accumulated direction)\n- **$\\beta$**: how much past motion to keep (often **0.9** — keep ~90% of the old velocity)\n- **$g$**: gradient at the current point\n\n**Extra:** **Nesterov** momentum evaluates $g$ at a **lookahead** point along $v$.","2":"**3. Adaptive optimizers (AdaGrad, RMSProp, Adam): brake each wheel separately**\n\n**Concept:** Some parameters are almost there; others still have far to go. Instead of one global $\\eta$, **adaptive** methods **rescale each coordinate** from gradient statistics.\n\n**How they evolved:**\n- **AdaGrad:** “Paths we walked a lot — shrink the step there.” It accumulates squared gradients so busy coordinates slow down.\n- **RMSProp:** Fixes AdaGrad’s issue (steps can shrink to ~0 forever) by **forgetting** very old history with an **EMA**.\n- **Adam:** Combines **momentum (direction)** and **RMSProp-like scaling** — a default choice in modern deep learning.\n\n**Practical tip:** Papers often use **AdamW**, which **decouples weight decay** from the loss for better regularization.","3":"**4. Three goals: stability, speed, generalization**\n\n**Concept:** Choosing an optimizer is not only about **reaching the bottom fast**. **Which valley** you land in changes **test performance**.\n\n**Intuition:** A bullet train (**Adam**) may arrive first; a local train (**SGD+momentum**) can discover **quieter minima** with better generalization — both stories appear in practice.\n\n**Practical tip:** Pair optimizers with **warmup** (gentle strides early) and **learning-rate schedulers** (smaller steps near the end)."},"whyImportant":{"0":"**Time and money**\n\nIf the learning rate is too large, optimization may **diverge**; if too small, a one-hour run can stretch to a week. Good optimizer + LR settings are the “magic” that saves **GPU bills** and **late nights**.","1":"**Generalization — your “test score”**\n\nWith the same data, different optimizers can yield **different quality**. Which minimum you settle in changes **test accuracy**. Strong engineers match the tool to the problem.","2":"**First thermometer when the model “gets sick”**\n\nIf loss won’t drop or **NaNs** appear, suspect **learning rate** and **optimizer** first. Knowing this lets you debug calmly instead of panicking."},"howUsed":{"0":"**① Keep a lab notebook — change one knob at a time**\n\nAPIs differ by library, but the workflow is similar: record **learning rate, batch size, optimizer, and random seed**. When training misbehaves, change **one setting at a time** to isolate the cause. Jittery loss → revisit batch, LR, and momentum; updates that fade away after many epochs → consider moving from AdaGrad-style accumulators to **RMSProp / Adam**. Practice pairing **symptoms with levers**.","1":"**② Optimizer cheat sheet**\n\n| Situation | Pick | Why |\n| :--- | :--- | :--- |\n| **Need a quick baseline** | `Adam` or `AdamW` | Adaptive steps — less sensitive to initial LR |\n| **NLP / transformers** | `AdamW` | Often very stable on sparse, structured objectives |\n| **Push CNN accuracy to the limit** | `SGD + Momentum` | Harder to tune but can **generalize** better at the sweet spot |","2":"**③ Monitoring — don’t look away**\n\nLaunch isn’t the end of the flight. Watch the **loss curve** live (TensorBoard, Weights & Biases). If it **saws** like a gear, it may be time to **lower the learning rate**."},"problemSolving":{"0":"Optimization is the process of deciding how to update parameters $\\theta$ using gradients from backpropagation to reduce the loss $L(\\theta)$. Basic **SGD** takes a step $\\theta \\leftarrow \\theta - \\eta \\hat{g}$ with a minibatch gradient $\\hat{g}$, and the **learning rate $\\eta$** sets the step size. **Momentum** accumulates velocity $v$ to reduce zig-zag in narrow valleys, while **Adam/AdamW** adapt per-coordinate steps using first and second moments. When loss oscillates or diverges, check **learning rate, batch size, and LR scheduler** together—not only the optimizer name.","2":"**Example (definition)**\n\n\"What is the core role of Momentum? ① It sets LR to 0 ② It accumulates past directions to reduce oscillation ③ It skips backprop\"\n\nMomentum keeps directional inertia through velocity $v$. → **Answer 2**\n\n---\n\n**Example (scenario)**\n\n\"When training loss oscillates heavily, what should be checked first? ① Learning rate, momentum, batch size ② Zero training data ③ Delete all layers\"\n\nOscillation is tied to step size and gradient noise, so check ① first. → **Answer 1**\n\n---\n\n**Example (calculation)**\n\nIf $\\eta=0.001$ and $g=20$, what is the SGD update magnitude $\\eta g$?\n\n$0.001 \\times 20 = 0.02$. → **Answer 0.02**","3":"**Definition example** — \"Which does Adam use together? ① 1st and 2nd moments ② Batch index only ③ Dropout mask only\" → Adam uses first and second moments. **Answer 1**\n\n---\n\n**True/False example** — \"RMSProp uses an EMA of squared gradients.\" → True. **Answer 1**\n\n---\n\n**Application example** — \"If early training is unstable, what to check first? ① warmup + LR schedule ② disable backprop ③ delete data\" → Check warmup and schedule first. **Answer 1**\n\n---\n\n**Choice example** — \"A defining trait of Nesterov is? ① gradient at lookahead point ② current point only ③ no gradient\" → Nesterov uses lookahead. **Answer 1**\n\n---\n\n**Concept example** — \"In AdaGrad, the effective step size of frequently updated coordinates tends to? ① decrease ② stay constant ③ increase\" → It tends to decrease due to accumulation. **Answer 1**\n\n---\n\n**Calculation example** — \"If sample count is 64 and batch size is 16, how many steps per epoch?\" → $64/16=4$. **Answer 4**"},"summary":"**Optimization** converts gradient information into update steps to reduce loss $L(\\theta)$.\n\n**SGD** updates with minibatch gradient $\\hat{g}$, **Momentum** smooths zig-zag via velocity $v$, and **Adam/AdamW** adapts per-coordinate step size using first/second moments.\n\n**Practical debugging summary (symptom → first checks)**\n- Loss oscillation: `lr`, momentum, batch size\n- Early divergence/NaN: initialization, `lr`, `grad_norm`, clipping\n- Slow/plateaued learning: scheduler (with warmup), optimizer switch (SGD↔AdamW)\n- Validation stagnation: weight decay, augmentation, early stopping\n\n**Tuning order (quick decision)**\n1) Validate logs → 2) tune `lr` first → 3) choose optimizer → 4) combine with scheduler → 5) add stabilizers → 6) pick by mean performance + variance + reproducibility\n\n**Operating rule**: change one variable at a time, and record `optimizer/lr/batch_size/weight_decay/seed/scheduler` for comparison.","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Formulas in plain words","sgd":"**SGD step** $\\theta \\leftarrow \\theta - \\eta \\hat{g}$ — $\\hat{g}$ is a minibatch estimate; $\\eta$ is step size.","momentum":"**Momentum** $v \\leftarrow \\beta v + (1-\\beta)g$, $\\theta \\leftarrow \\theta - \\eta v$ — past directions accumulate in $v$ to smooth zig-zags.","adam":"**Adam (idea)** — EMA of gradients and squared gradients per coordinate; **bias correction** in early steps.","adaptive":"**Adaptive intuition** — large historical gradients → smaller effective steps per coordinate."},"visual":"Animation comparing SGD, Momentum, and Adam on a loss **mountain**: the same slopes can yield **different paths**.","problemSolvingLabel":"How to approach problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **10 problems** sampled from a bank of **60** (easy 4 · medium 3 · hard 3; order easy→medium→hard). For **multiple choice (①②③)**, enter **1, 2, or 3**. For **short answer**, follow the prompt: **1/0** for true/false, or a **single integer** for calculations.","practiceProblemsInstruction":"Read each problem and choose the correct option number.","midDlCh02VisualIntro":"Blindfolded on the same **loss mountain**, SGD, Momentum, and Adam pick **different routes** — simplified valley comparison below.","midDlCh02VisualStep0":"① **SGD**: step opposite the gradient each time (noisy minibatches → zig-zag).","midDlCh02VisualStep1":"② **Momentum**: accumulate velocity $v$ — smoother turns.","midDlCh02VisualStep2":"③ **Adam**: adaptive per-coordinate step sizes.","midDlCh02VisualStep3":"④ **Practice**: tune with logs, schedules, initialization (Ch.01).","midDlCh02VisualConceptTitle":"Concept: gradient →(transform)→ update","midDlCh02VisualFlowTitle":"Flow: forward → loss → backward → optimizer step","midDlCh02VisualModelTitle":"Update: $\\theta \\leftarrow \\theta - \\eta \\cdot(\\text{step from Adam, etc.})$","midDlCh02VisualLegendSgd":"SGD","midDlCh02VisualLegendMom":"Momentum","midDlCh02VisualLegendAdam":"Adam","midDlCh02VisualCaption":"**Red (SGD)** zig-zags more while descending and its sideways wiggle lasts longer. **Green (momentum)** damps the oscillation but still ends **slightly off the valley center**. **Blue (Adam)** reaches the **bottom center** fastest—so descent speed and final x differ clearly (illustrative).","problems":{"definition_0":"To reduce loss in one GD step, how should $\\theta$ move relative to $\\nabla L$?\n1) same direction as $\\nabla L$\n2) **opposite** to $\\nabla L$\n3) perpendicular to $\\nabla L$","definition_1":"After `loss.backward()` in PyTorch, which matches typical minibatch-SGD training? ① exact full-dataset gradient every step ② update with $\\hat{g}$ from a **minibatch** ③ skip backprop","definition_2":"To smooth zig-zags in a narrow valley by accumulating past gradients in velocity $v$?\n1) only increase dropout\n2) use **momentum**\n3) fix batch size to 1 forever","definition_3":"What does Nesterov momentum evaluate differently from vanilla momentum?\n1) $g$ only at current $\\theta$\n2) $g$ after a **lookahead** along momentum\n3) validation loss only","definition_4":"What does AdaGrad accumulate to shrink per-coordinate steps? ① absolute weights ② **squared** gradients ③ epoch index","definition_5":"RMSProp’s hallmark vs unbounded AdaGrad sums: ① store gradient signs only ② **EMA** of squared gradients ③ fixed $\\eta$ only","definition_6":"Which pair best matches what Adam maintains?\n1) **1st & 2nd moments** (momentum + adaptive scaling)\n2) dropout masks only\n3) pooling sizes only","definition_7":"When $\\eta$ is too large, which is **NOT** a typical symptom? ① loss oscillation ② **guaranteed faster convergence only** ③ NaNs","definition_8":"Adam’s first moment $m$ is closest to: ① **EMA** of recent gradients ② always zero ③ validation accuracy","definition_9":"When picking an optimizer under data/model/log constraints, what to prioritize?\n1) monitor resolution\n2) **stability, speed, generalization**\n3) file extension","trueFalse_0":"[T/F] Typical `optimizer.step()` for GD moves $\\theta$ **opposite** the gradient to reduce loss. 1=true, 0=false.","trueFalse_1":"[T/F] Momentum forces the LR hyperparameter to always be 0. 1=true, 0=false.","trueFalse_2":"[T/F] Adam often combines adaptive denominators with momentum-like 1st moments. 1=true, 0=false.","trueFalse_3":"[T/F] AdaGrad can make effective updates tiny on some coordinates after long training. 1=true, 0=false.","trueFalse_4":"[T/F] RMSProp’s core is an EMA of squared gradients. 1=true, 0=false.","trueFalse_5":"[T/F] Larger minibatches **always** increase gradient-variance of the estimate. 1=true, 0=false.","trueFalse_6":"[T/F] A cosine LR schedule changes $\\eta$ over time. 1=true, 0=false.","trueFalse_7":"[T/F] Nesterov evaluates gradients after a momentum lookahead step. 1=true, 0=false.","trueFalse_8":"[T/F] Adam’s $\\varepsilon$ stabilizes denominators near $\\sqrt{\\hat{v}}$. 1=true, 0=false.","trueFalse_9":"[T/F] Adam is always better than SGD+momentum on every dataset. 1=true, 0=false.","scenario_0":"[Application] ResNet training: loss oscillates badly. **Best first** lever? ① retune **LR·momentum·batch** ② use 0 images ③ remove all BatchNorm","scenario_1":"Sparse BoW text classification — quick first optimizer family?\n1) **Adam/AdamW**-style adaptive\n2) pure full-batch GD only\n3) k-means","scenario_2":"Image CNNs with validation in mind often use: ① **SGD+momentum (+schedule)** or Adam ② disable backward ③ no optimizer","scenario_3":"AdaGrad updates feel frozen after many epochs. Natural next step?\n1) switch to **RMSProp/Adam** and revisit LR\n2) keep batch=1 forever\n3) delete all inputs","scenario_4":"Want large LR early, smaller later — mainly requires: ① **scheduler/warmup** design ② infinite LR ③ skip `step()`","scenario_5":"`grad_norm` explodes — check alongside Ch.01 init: ① **LR·clipping·scale** ② log filenames ③ theme color","scenario_6":"Momentum $\\beta=0.99$ means:\n1) **longer memory** of past directions\n2) instant global optimum\n3) cannot train","scenario_7":"Using L2 with Adam, a common **decoupled** variant is: ① **AdamW** ② SGD only ③ AdaGrad only","scenario_8":"Small data overfitting — can you fix it **only** by swapping optimizers?\n1) **unlikely**; regularization/data first\n2) Adam always fixes it\n3) infinite LR","scenario_9":"Multi-GPU: suspected epoch-wise shuffle bias — inspect: ① **shuffling & synchronization** ② icons ③ remove GPU","choice_0":"vs pure batch GD, a hallmark of minibatch SGD is?\n1) no difference\n2) **sampling noise** in $\\hat{g}$ can help escape sharp regions\n3) no backprop","choice_1":"As $\\beta\\to 0$ in momentum, updates resemble: ① **plain SGD-like** steps ② guaranteed divergence ③ LR=0","choice_2":"Tutorial-favorite $(\\beta_1,\\beta_2)$ for Adam — closest?\n1) **$(0.9,\\,0.999)$**\n2) $(0,0)$\n3) $(1,1)$","choice_3":"Warmup slowly raises LR early in transformer finetuning to reduce: ① **early instability** ② keep LR=0 ③ delete data","choice_4":"Adam’s 2nd moment tracks something closest to: ① **EMA of squared gradients** ② absolute weights ③ batch index","choice_5":"Common “decoupled weight decay” variant with Adam:\n1) **AdamW**\n2) remove softmax\n3) batch 0","choice_6":"Reducing zig-zags in a narrow valley — most direct lever: ① **momentum** ② LR=0 ③ inference-only mode","choice_7":"Typical default scale for Adam’s $\\varepsilon$:\n1) **around $10^{-8}$**\n2) $10^{2}$\n3) exactly 0","choice_8":"If you only increase batch size (same model), gradient variance tends to: ① **decrease** ② stay identical ③ always increase","choice_9":"Exploding RNN language-model gradients — common fix: ① **gradient clipping** ② always harmful ③ inference only","concept_0":"Sharp, narrow loss canyon — most directly tied combo: ① augmentation only ② **momentum·LR·schedule/conditioning** ③ batch=1 lock","concept_1":"Bias correction in Adam mainly fixes:\n1) early **near-zero** $m,\\hat{v}$ bias\n2) LR always 0\n3) pooling size","concept_2":"With sparse features, AdaGrad tends to make steps on frequently updated coords: ① **smaller** ② identical ③ infinite","concept_3":"Same training loss, different validation — plausible?\n1) **different trajectories / implicit reg**\n2) optimizer changes the loss formula\n3) always identical","concept_4":"Nesterov differs from standard momentum in **where** the gradient is evaluated: ① identical ② **different** ③ no backprop","concept_5":"RMSProp targets AdaGrad’s issue of: ① **divergent** squared-gradient accumulation ② always larger LR ③ softmax","concept_6":"After moving to a very large batch, teams often revisit:\n1) **LR scaling** (e.g. linear rule)\n2) LR=0 lock\n3) delete data","concept_7":"Convention before each fresh `backward()`: ① **`optimizer.zero_grad()`** ② delete weights ③ freeze loss","concept_8":"Dividing by $\\sqrt{\\hat{v}}+\\epsilon$ makes effective steps where gradients are large: ① **relatively smaller** ② identical ③ always larger","concept_9":"ImageNet-style CNNs often use:\n1) **SGD+momentum + LR schedule**\n2) Adam only forever\n3) no optimizer","calc_0":"[Calc] **48** train samples, batch **16** — minibatch steps per epoch, one integer.","calc_1":"[Calc] **4** epochs, **25** steps/epoch — total parameter updates?","calc_2":"[Calc] $\\eta=3$, $g=2$ — integer $\\eta g$?","calc_3":"[Calc] $\\beta=0.9$, $v=10$, $g=10$ — integer $v \\leftarrow \\beta v + (1-\\beta)g$?","calc_4":"[Calc] $m=0$, $\\beta_1=0.9$, $g=20$ — integer $m \\leftarrow \\beta_1 m + (1-\\beta_1)g$?","calc_5":"[Calc] $\\beta=0.5$, $v=6$, $g=2$ — integer $v \\leftarrow \\beta v + (1-\\beta)g$?","calc_6":"[Calc] $\\beta_1=0.9$, $m=10$, $g=0$ — integer $m \\leftarrow \\beta_1 m + (1-\\beta_1)g$?","calc_7":"[Calc] $t=1$, $\\beta_1=0.9$ — integer $1/(1-\\beta_1^t)$?","calc_8":"[Calc] **2048** samples, batch **256** — steps per epoch?","calc_9":"[Calc] LR **0.002** times scale **500** — integer?"},"problemAnswers":{"definition_0":2,"definition_1":2,"definition_2":2,"definition_3":2,"definition_4":2,"definition_5":2,"definition_6":1,"definition_7":2,"definition_8":1,"definition_9":2,"trueFalse_0":1,"trueFalse_1":0,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":0,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":0,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"choice_0":2,"choice_1":1,"choice_2":1,"choice_3":1,"choice_4":1,"choice_5":1,"choice_6":1,"choice_7":1,"choice_8":1,"choice_9":1,"concept_0":2,"concept_1":1,"concept_2":2,"concept_3":1,"concept_4":2,"concept_5":1,"concept_6":2,"concept_7":1,"concept_8":2,"concept_9":1,"calc_0":3,"calc_1":100,"calc_2":6,"calc_3":10,"calc_4":2,"calc_5":4,"calc_6":9,"calc_7":10,"calc_8":8,"calc_9":1},"problemSolutions":{"definition_0":"**1) Concept:** descend opposite $\\nabla L$. **2) Example:** logistic regression $\\theta\\leftarrow\\theta-\\eta\\nabla L$. **3) Answer 2**","definition_1":"**1) Minibatch SGD uses a subset for $\\hat{g}$. **2) Example:** batch 64 uses 64 examples. **3) Answer 2**","definition_2":"**1) Momentum accumulates velocity. **2) Example:** smoother in narrow valleys. **3) Answer 2**","definition_3":"**1) Nesterov uses lookahead. **2) Example:** `SGD(..., nesterov=True)`. **3) Answer 2**","definition_4":"**1) AdaGrad sums squared gradients. **2) Answer 2**","definition_5":"**1) RMSProp uses EMA of squared grads. **2) Answer 2**","definition_6":"**1) Adam uses 1st and 2nd moments. **2) Answer 1**","definition_7":"**1)** Large $\\eta$ can cause oscillation/NaNs, but does **not** guarantee \"only faster convergence.\" **2)** Option ② is that misconception. **3) Answer 2**","definition_8":"**1) 1st moment ≈ EMA of gradients. **2) Answer 1**","definition_9":"**1) Consider data, model, stability, speed. **2) Answer 2**","trueFalse_0":"Opposite to gradient is correct. **Answer 1**","trueFalse_1":"Momentum does not zero LR. **Answer 0**","trueFalse_2":"Adam combines adaptive + momentum-like pieces. **Answer 1**","trueFalse_3":"AdaGrad steps can vanish. **Answer 1**","trueFalse_4":"RMSProp uses squared-gradient EMA. **Answer 1**","trueFalse_5":"Larger batch usually **lowers** variance — statement false. **Answer 0**","trueFalse_6":"Scheduling changes LR over time. **Answer 1**","trueFalse_7":"Nesterov uses lookahead. **Answer 1**","trueFalse_8":"$$\\varepsilon$ is for numerical stability. **Answer 1**","trueFalse_9":"Not always better. **Answer 0**","scenario_0":"**1) Oscillation → tune LR/momentum. **2) Example:** CNN loss spikes → reduce LR 10×. **3) Answer 1**","scenario_1":"**1) Adam common for quick NLP experiments. **2) Answer 1**","scenario_2":"**1) Vision: SGD+momentum or Adam. **2) Answer 1**","scenario_3":"**1) Switch to RMSProp/Adam if AdaGrad stalls. **2) Answer 1**","scenario_4":"**1) Use a scheduler. **2) Answer 1**","scenario_5":"**1) Clip gradients, check LR/init. **2) Answer 1**","scenario_6":"**1) Higher $\\beta$ → stronger inertia. **2) Answer 1**","scenario_7":"**1) AdamW for decoupled WD. **2) Answer 1**","scenario_8":"**1) Overfitting needs reg/data, not optimizer alone. **2) Answer 1**","scenario_9":"**1) Shuffle / sync in distributed training. **2) Answer 1**","choice_0":"**1) Minibatch noise can help exploration. **2) Answer 2**","choice_1":"**1) $\\beta\\approx0$ ≈ SGD. **2) Answer 1**","choice_2":"**1) 0.9 / 0.999 typical. **2) Answer 1**","choice_3":"**1) Warmup stabilizes early training. **2) Answer 1**","choice_4":"**1) 2nd moment ≈ squared-gradient EMA. **2) Answer 1**","choice_5":"**1) AdamW decouples decay. **2) Answer 1**","choice_6":"**1) Momentum smooths zig-zags. **2) Answer 1**","choice_7":"**1) $\\varepsilon\\sim10^{-8}$. **2) Answer 1**","choice_8":"**1) Larger batch → lower variance trend. **2) Answer 1**","choice_9":"**1) Clipping helps exploding grads. **2) Answer 1**","concept_0":"**1) Zig-zag → momentum/LR. **2) Example:** transformer finetuning with Adam+warmup. **3) Answer 2**","concept_1":"**1) Bias correction fixes early near-zero moments. **2) Answer 1**","concept_2":"**1) AdaGrad shrinks frequent coordinates. **2) Answer 2**","concept_3":"**1) Different trajectories → different generalization. **2) Answer 1**","concept_4":"**1) Nesterov changes gradient location. **2) Answer 1**","concept_5":"**1) RMSProp limits unbounded AdaGrad growth. **2) Answer 1**","concept_6":"**1) Large batch may need LR scaling. **2) Answer 2**","concept_7":"**1) `zero_grad()` standard. **2) Answer 1**","concept_8":"**1) Large $v$ shrinks steps. **2) Answer 2**","concept_9":"**1) Competitions often SGD+momentum+schedule. **2) Answer 1**","calc_0":"**1) $48/16=3$. **2) Example:** batch 16 covers 48 samples in 3 steps. **3) Answer 3**","calc_1":"**1) $5\\times20=100$. **2) Answer 100**","calc_2":"**1) $2\\times3=6$. **2) Answer 6**","calc_3":"**1) $0.9\\cdot10+0.1\\cdot10=10$. **2) Answer 10**","calc_4":"**1) $0.1\\cdot20=2$. **2) Answer 2**","calc_5":"**1) $0.5\\cdot6+0.5\\cdot2=4$. **2) Answer 4**","calc_6":"**1) $0.9\\cdot10+0.1\\cdot0=9$. **2) Answer 9**","calc_7":"**1) $1/(1-0.9)=10$. **2) Answer 10**","calc_8":"**1) $4096/512=8$. **2) Answer 8**","calc_9":"**1) $0.001\\cdot1000=1$. **2) Answer 1**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 2\nassert answer == 2","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 2\nassert answer == 2","definition_4":"answer = 2\nassert answer == 2","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 0\nassert answer == 0","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 0\nassert answer == 0","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 0\nassert answer == 0","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","choice_0":"answer = 2\nassert answer == 2","choice_1":"answer = 1\nassert answer == 1","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 1\nassert answer == 1","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 1\nassert answer == 1","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 1\nassert answer == 1","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 1\nassert answer == 1","concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 2\nassert answer == 2","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 2\nassert answer == 2","concept_9":"answer = 1\nassert answer == 1","calc_0":"n, b = 48, 16\nanswer = n // b\nassert answer == 3","calc_1":"answer = 5 * 20\nassert answer == 100","calc_2":"eta, g = 2, 3\nanswer = eta * g\nassert answer == 6","calc_3":"beta, v, g = 0.9, 10, 10\nanswer = int(beta * v + (1 - beta) * g)\nassert answer == 10","calc_4":"beta1, m, g = 0.9, 0, 20\nanswer = int((1 - beta1) * g)\nassert answer == 2","calc_5":"beta, v, g = 0.5, 6, 2\nanswer = int(beta * v + (1 - beta) * g)\nassert answer == 4","calc_6":"beta1, m, g = 0.9, 10, 0\nanswer = int(beta1 * m + (1 - beta1) * g)\nassert answer == 9","calc_7":"beta1, t = 0.9, 1\nanswer = int(1 / (1 - beta1 ** t))\nassert answer == 10","calc_8":"n, b = 2048, 256\nanswer = n // b\nassert answer == 8","calc_9":"lr, k = 0.001, 1000\nanswer = int(round(lr * k))\nassert answer == 1"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midDlCh03":{"chapter":"Chapter 03","title":"Learning Rate Scheduling: Slow Down When Needed","description":"Imagine hiking blindfolded from a mountain peak toward the valley floor. At first, long strides get you down quickly. But near the bottom, the same big steps can **overshoot the minimum** and send you zigzagging up the opposite slope.\n\nIn Ch.02, the **optimizer** acted like a compass — choosing **which way** to step. In this chapter, the **learning rate scheduler** is the **brake and accelerator** — deciding **when to narrow your stride**. We explore practical strategies: wide strides for early exploration, small precise steps near the goal, so the model reaches its full potential.","sectionTitle":"Learning Rate Scheduling: Slow Down When Needed","whatIs":{"0":"**1. What is learning rate scheduling? (The fixed-stride trap)**\n\n**Concept:** Ch.02 **optimizers** pick **direction**; scheduling picks **how large each stride ($\\eta$) is** over time. Each step, the optimizer receives a different $\\eta_t$, so the **step size** changes even in similar situations.\n\n**Analogy:** Hike fast on wide trails early, then take tiny steps near the valley floor. One fixed $\\eta$ the whole run is too slow early and **overshoots** the minimum late, causing endless oscillation.\n\n**Tip:** After Ch.02 sets direction, Ch.03 is **stride over time**. **Jot down ‘stride + error’ together** in your notes so you can see when braking happened and how scores responded.","1":"**2. Step decay: stair-step braking**\n\n**Concept:** At fixed **intervals (epochs)**, cut $\\eta$ sharply — often **×0.1 ($\\gamma$)**. Cut every N intervals, or at **several checkpoints** like 30, 60, 90.\n\n**Analogy:** An elevator slowing at set floors. You only pick **when** to brake; the scheduler handles the rest.\n\n**Tip:** Common for **long** training such as image classification. $\\gamma$ too small or too early → **under-learning**. Note **how often and when** you brake.","2":"**3. Cosine & Warmup: smooth slowdown and warm-up**\n\n**Concept:** **Cosine** shrinks $\\eta$ **smoothly** for late fine-tuning. **Warmup** only at the **start**, ramping $\\eta$ from 0 to target to avoid a huge early shock.\n\n**Analogy:** Cosine = roller coaster easing to a stop; Warmup = **revving a cold engine** instead of flooring the gas.\n\n**Tip:** For **large models (text/image AI)**, **warm-up + smooth decay** is very common. Warmup is often **5–10%** of total training; with Ch.02 optimizer it cuts early **error spikes**.","3":"**4. Plateau & OneCycle**\n\n**Concept:** **Plateau** watches **test scores (validation loss/accuracy)**, not the clock. After **patience** intervals with no improvement, multiply $\\eta$ by **factor** (e.g. 0.1). **OneCycle** briefly **raises then lowers** $\\eta$ in one short cycle.\n\n**Analogy:** Plateau = GPS says “slow down” only in a jam; OneCycle = a short sprint course — accelerate to explore, then glide to a stop.\n\n**Tip:** **Plateau** is safe when data/model size is uncertain. **OneCycle** suits **short practice runs**. With Ch.02 optimizer you get direction **and** speed."},"whyImportant":{"0":"**1. Better results in the same training time**\n\nGood schedules often yield **lower error and higher accuracy for the same training time**. Without late braking, **final polish** is hard; shrinking stride late helps **land on the minimum**. Faster convergence saves **time and experiment rounds**.","1":"**2. Completes the Ch.02 optimizer**\n\nThe optimizer sets **direction and per-spot stride**, but **shrinking overall stride over time** is the scheduler’s job. No warmup + huge early $\\eta$ → **error explosions**; no late decay → **endless zigzag** around the minimum. Many “stuck” runs converge after adding one schedule.","2":"**3. First graph to check when scores stall**\n\nThe **stride (LR) plot** tells a story. **Perfectly flat (━)** → schedule **not updating or wrong timing**; **drops too early** → under-learning; **stays high** → late oscillation. Check scheduler with Ch.02 **lr and optimizer** first."},"howUsed":{"0":"**① Start, middle, end — stride must change**\n\nLike exploring a town: main street → alleys → one step at the door. Training has **explore → converge → land** acts. Large $\\eta$ early to scan widely, then brake, then small $\\eta$ to polish near the minimum. The scheduler is the **timer** for which act you are in.","1":"**② Each scheduler has a personality — pick for the job**\n\n**Step** = **elevator** cuts at set times (long training). **Cosine** = **smooth** late decay (long runs, text AI). **Warmup** = **engine warm-up** only early (large models). **Plateau** = acts only on **score stall**. **OneCycle** = **short** practice runs. Pick the **driving style** for your situation.","2":"**③ The golden stride rhythm — order matters**\n\nLike hiking: **① check how steep it is (loss/error)** → **② take one step that way (Ch.02 optimizer)** → **③ set the next stride (scheduler)**. **Plateau**-style schedules watch **validation scores** and brake only when stuck.","3":"**④ Read stride and error plots together**\n\nWhen error stalls and **drops right after you shrink stride**, scheduling worked. **Too early** braking → under-learning.\n\n**Symptom-first:** early error spike → lower stride + **Warmup**; late zigzag → **Cosine/Step/Plateau**; **flat stride plot (━)** → check **schedule updates**. Change one thing at a time and **take notes**."},"problemSolving":{"0":"Learning rate scheduling changes the **stride (learning rate)** the Ch.02 **optimizer** uses over **time or scores**. **Step** cuts at fixed intervals; **Cosine** decays smoothly; **Warmup** raises stride early to avoid shock; **Plateau** acts only when scores stall. One stride cycle: **check error → move (optimizer) → set next stride (scheduler)**; **Plateau** also watches validation. Early error spike → **Warmup**; late zigzag → **Cosine/Step/Plateau**; flat stride plot (━) → **schedule not updating**.","2":"**Example (definition)**\n\n\"In step decay, what is the factor multiplied to $\\eta$ every N intervals? ① patience ② gamma ③ beta\"\n\nThe decay factor is **$\\gamma$ (gamma)**. → **Answer 2**\n\n---\n\n**Example (scenario)**\n\n\"Text AI training shows unstable error early. With Ch.02 optimizer, which combo to check first? ① warm-up + smooth decay ② infinite stride increase ③ remove scheduler only\"\n\nEarly instability → try **warm-up + smooth decay** first. → **Answer 1**\n\n---\n\n**Example (calculation)**\n\nWith $\\eta_0=0.1$, $\\gamma=0.1$, after **2** Step decays, what is $\\eta$ (×1000, integer)?\n\n$0.1 \\times 0.1^2 = 0.001$ → ×1000 = **1**. → **Answer 1**","3":"**Definition example** — \"What is the main purpose of Warmup? ① data augmentation ② ease early instability from large LR ③ zero weights\" → Warmup slowly raises $\\eta$ early to reduce shock. **Answer 2**\n\n---\n\n**True/False example** — \"ReduceLROnPlateau uses validation metrics.\" → True. **Answer 1**\n\n---\n\n**Application example** — \"Validation loss fails to improve for 5 consecutive epochs\" → Plateau acts on validation stall. **Answer 1**\n\n---\n\n**Choice example** — \"Which gives smoother late decay? ① StepLR only ② CosineAnnealing\" → Cosine decays smoothly along a cos curve. **Answer 2**\n\n---\n\n**Concept example** — \"Why is Ch.02 optimizer + scheduler common? ① direction and time-varying stride complement ② replaces backprop\" → Optimizer and scheduler complement each other. **Answer 1**\n\n---\n\n**Calculation example** — \"128 samples, batch size 32 — steps per epoch (integer)?\" → $128/32=4$. **Answer 4**"},"summary":"**One line:** Scheduling adjusts **stride ($\\eta$)** over training for **fast exploration early** and **precise landing late**.\n\n**Key schedulers:** **Step**, **Cosine**, **Warmup**, **Plateau**, **OneCycle** — pick for your scenario.\n\n**Checks (symptom → look first)**\n- Early error spike: lower stride, **Warmup**\n- Late zigzag / no convergence: **Cosine/Step/Plateau**, $\\gamma$, patience\n- Flat stride plot (━): **schedule not updating**, interval confusion\n- Scores stall only: **Plateau**, data/model settings\n\n**Tuning order:** 1) **notes** on error + stride → 2) base stride + Warmup → 3) Ch.02 optimizer → 4) schedule → 5) change one thing at a time\n\nCh.02 **optimizer (compass)** + Ch.03 **scheduler (brake & gas)** = the everyday stack.","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Schedulers at a glance","step":"**Step / MultiStep** — every N intervals (or checkpoint): $\\eta \\leftarrow \\gamma \\cdot \\eta$. $\\gamma$ is often **0.1**. Common for **long** training.","cosine":"**Cosine Annealing** — $\\eta_t = \\eta_{\\min} + \\frac{1}{2}(\\eta_{\\max}-\\eta_{\\min})(1+\\cos\\frac{\\pi t}{T})$. Smooth late fine-tuning.","warmup":"**Warmup** — linear rise early: $\\eta \\approx \\frac{s}{S}\\eta_{\\text{target}}$. Eases early **error spikes** on large models.","plateau":"**ReduceLROnPlateau** — if no val improvement for **patience** epochs: $\\eta \\leftarrow \\text{factor}\\cdot\\eta$. Brake only when stuck."},"visual":"Animated comparison of **fixed, Step, Cosine, Warmup+Cosine** LR curves over epochs. Optimizer picks direction; scheduler picks stride.","problemSolvingLabel":"How to approach problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **5 problems** sampled from a bank of **60** (easy 2 · medium 2 · hard 1; order easy→medium→hard). Pick the option number.","practiceProblemsInstruction":"Read each problem and choose the best option.","midDlCh03VisualIntro":"Ch.02 **optimizer** = compass (direction). Ch.03 **scheduler** = brake & gas (stride). Compare **LR curves vs epoch** below.","midDlCh03VisualStep0":"① **Fixed LR** — same stride throughout; may overshoot the minimum late","midDlCh03VisualStep1":"② **Step decay** — sharp cuts at set epochs, like floors 30/60/90","midDlCh03VisualStep2":"③ **Cosine** — smooth slowdown — common for long runs & text AI","midDlCh03VisualStep3":"④ **Warmup+Cosine** — warm up then glide down — pairs with Ch.02 optimizer","midDlCh03VisualConceptTitle":"Concept: optimizer (direction) + scheduler (stride) → one training step","midDlCh03VisualFlowTitle":"Stride rhythm: check error → take a step → set next stride","midDlCh03VisualModelTitle":"Each epoch/step, the scheduler sets the **next stride** for the optimizer","midDlCh03VisualLegendFixed":"Fixed","midDlCh03VisualLegendStep":"Step","midDlCh03VisualLegendCosine":"Cosine","midDlCh03VisualLegendWarmup":"Warmup+Cosine","midDlCh03VisualCaption":"Cycles **Fixed → Step → Cosine → Warmup+Cosine**. The dot tracks **how large the stride is now**.","problems":{"definition_0":"What is the core goal of learning rate scheduling?\n1) Always force Adam\n2) **Adjust $\\eta$ over time (or metrics)** to balance explore vs converge\n3) Skip backprop","definition_1":"In StepLR, the factor multiplied into $\\eta$ every N epochs is called?\n1) patience\n2) **$\\gamma$ (gamma)**\n3) $\\beta$","definition_2":"In cosine annealing at $t=0$, which is $\\eta$ closest to?\n1) **$\\eta_{\\max}$**\n2) $\\eta_{\\min}$\n3) always 0","definition_3":"The main purpose of warmup is?\n1) data augmentation\n2) **ease early instability from large $\\eta$**\n3) zero all weights","definition_4":"ReduceLROnPlateau mainly watches?\n1) **validation loss/accuracy etc.**\n2) GPU temperature\n3) file size","definition_5":"For epoch-based Step decay, when is the **most natural** time to shrink stride?\n1) before seeing **any** batch\n2) **after finishing one full epoch of training**\n3) only when saving the model","definition_6":"With a scheduler, the update relation is?\n1) **$\\theta \\leftarrow \\theta - \\eta_t g$** — $\\eta_t$ sets step size\n2) $\\eta_t$ changes the loss\n3) scheduler replaces backprop","definition_7":"Which is **NOT** true of LR scheduling?\n1) Cosine can decay smoothly\n2) **The point is to only increase $\\eta$ forever**\n3) Plateau decays on metric stall","definition_8":"MultiStepLR characteristic?\n1) random $\\eta$ every minibatch\n2) **decay at listed milestone epochs**\n3) never uses validation","definition_9":"Which description is closest to OneCycleLR?\n1) $\\eta$ **rises then falls** in one cycle\n2) $\\eta$ forever 0\n3) only changes batch size","trueFalse_0":"[T/F] Scheduling can change $\\eta$ over training time. 1=true, 0=false.","trueFalse_1":"[T/F] StepLR multiplies $\\eta$ by $\\gamma$ every step_size epochs. 1=true, 0=false.","trueFalse_2":"[T/F] Cosine annealing can reach near $\\eta_{\\min}$ at epoch $T$. 1=true, 0=false.","trueFalse_3":"[T/F] Warmup slowly increases $\\eta$ in early steps. 1=true, 0=false.","trueFalse_4":"[T/F] ReduceLROnPlateau watches **validation metrics** like val loss/accuracy. 1=true, 0=false.","trueFalse_5":"[T/F] The **next stride** is usually set **before** taking the current optimizer step. 1=true, 0=false.","trueFalse_6":"[T/F] Fixed $\\eta$ is always better than any schedule. 1=true, 0=false.","trueFalse_7":"[T/F] CosineAnnealingWarmRestarts can **restart** cosine cycles. 1=true, 0=false.","trueFalse_8":"[T/F] ExponentialLR often does $\\eta \\leftarrow \\gamma \\eta$ each epoch. 1=true, 0=false.","trueFalse_9":"[T/F] A scheduler alone updates weights without an optimizer. 1=true, 0=false.","scenario_0":"[App] Text AI training: early error unstable with Ch.02 optimizer. First combo to try? ① **Warmup + smooth decay** ② infinite $\\eta$ ③ remove scheduler only","scenario_1":"[App] Long image-classification training. Step decay timing close to? ① **checkpoints like 30/60/90** ② first batch only ③ +10× every step","scenario_2":"[App] Val loss flat 5 epochs. Auto cut $\\eta$? ① **ReduceLROnPlateau** ② fixed $\\eta$ only ③ delete data","scenario_3":"[App] Want **smooth** late decay vs stairs? ① StepLR only ② **CosineAnnealingLR** ③ batch=0","scenario_4":"[App] LR curve in training logs is **flat**. First suspect? ① **schedule not updating or wrong timing** ② GPU color ③ optimizer name","scenario_5":"[App] Batch size ×4 — tune with Ch.02? ① **base $\\eta$, warmup, linear scaling** ② $\\eta$=0 ③ disable backprop","scenario_6":"[App] Small-data fine-tune — avoid harsh decay? ① **small base $\\eta$ + Plateau or short Cosine** ② ×10 every step ③ infinite $\\eta$","scenario_7":"[App] Short training time, quick practice run? ① **OneCycle** ② always $\\eta$=0 ③ skip validation only","scenario_8":"[App] Early error spike — cut together with schedule? ① **base stride (learning rate)** ② layer count only ③ only increase batch size","scenario_9":"[App] Text AI run without warmup at large $\\eta$. Next try? ① **add warmup** ② 100× larger $\\eta$ ③ remove optimizer","choice_0":"Smoother late $\\eta$ decay? ① StepLR ② **CosineAnnealing** ③ fixed $\\eta$","choice_1":"Plateau **patience** means? ① **epochs without improvement to wait** ② batch size ③ $\\gamma$","choice_2":"Plateau factor=0.1 means $\\eta$? ① **multiplied by 0.1** ② ×10 ③ unchanged","choice_3":"LinearLR trait? ① $\\eta$ changes **linearly** ② only cosine ③ no backprop","choice_4":"Non-zero $\\eta_{\\min}$ in cosine helps? ① **keep tiny updates vs full stop** ② diverge ③ augmentation","choice_5":"CyclicLR intuition? ① $\\eta$ **cycles up and down** ② one drop only ③ needs val metric","choice_6":"Warmup ~5–10% of training? ① **stable early phase** ② always under-learn ③ cool the computer","choice_7":"LambdaLR trait? ① **custom multiplier function** ② Adam-only ③ batch only","choice_8":"In step decay, **decay checkpoint list** means? ① **intervals (epochs) to shrink stride** ② loss name ③ dropout rate","choice_9":"Why Ch.02 optimizer + schedule is common? ① **direction and time-varying stride complement** ② schedule replaces backprop ③ fixed stride only","concept_0":"Warmup helps Ch.02 optimizer early because? ① **eases large-stride instability** ② removes data ③ batch=1 required","concept_1":"Linear LR scaling when batch ×2? ① **try ~2× base $\\eta$** ② half $\\eta$ only ③ no schedule","concept_2":"Cosine $T=20$, $t=10$, $\\eta_{\\max}=0.2$, $\\eta_{\\min}=0$ — what is $\\eta$ × 1000 (int)? ① 200 ② **100** ③ 0","concept_3":"Plateau `mode='min'` for? ① **minimizing loss** ② never for accuracy ③ maximize LR","concept_4":"OneCycle vs Step? ① same ② **OneCycle up-then-down; Step stair decay** ③ Step is metric-only","concept_5":"WarmRestarts sometimes helps? ① **periodic LR bump to escape shallow minima** ② always diverge ③ inference only","concept_6":"Too early/large $\\gamma$ risk? ① **underfitting** ② guaranteed faster converge ③ only NaN","concept_7":"Pass `last_epoch` on resume? ① **continue schedule state** ② delete optimizer ③ forbid shuffle","concept_8":"Higher polynomial power means? ① **sharper late decay** ② LR increase ③ no warmup","concept_9":"Note error and stride together because? ① **link decay timing to scores** ② filename ③ screen color","calc_0":"[Calc] $\\eta_0=0.1$, $\\gamma=0.1$, **2** Step decays — $\\eta$×1000 (int)?","calc_1":"[Calc] **60** epochs, step_size=**20** — number of decays (int)?","calc_2":"[Calc] Warmup **10**, step=**5**, $\\eta_{target}=0.002$ — $\\eta$×1000 (int)?","calc_3":"[Calc] Plateau: $\\eta$×1000=**1000**, factor=**0.5**, **2** times — result×1000 (int)?","calc_4":"[Calc] Cosine $T=10$, $t=5$, $\\eta_{\\max}=0.1$, $\\eta_{\\min}=0$ — $\\eta$×1000 (int)?","calc_5":"[Calc] **128** samples, batch **32** — steps per epoch (int)?","calc_6":"[Calc] $\\eta=0.01$ × schedule **0.1** — $\\eta$×1000 (int)?","calc_7":"[Calc] Plateau once: $\\eta$×1000=**100**, factor=**0.5** — result×1000 (int)?","calc_8":"[Calc] Exponential $\\eta_0$×1000=**100**, $\\gamma=0.9$, **3** epochs — $\\eta$×1000 (round int)?","calc_9":"[Calc] Warmup **500**, step **250**, peak×1000=**10** — current×1000 (int)?"},"problemAnswers":{"definition_0":2,"definition_1":2,"definition_2":1,"definition_3":2,"definition_4":1,"definition_5":2,"definition_6":1,"definition_7":2,"definition_8":2,"definition_9":1,"trueFalse_0":1,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":0,"trueFalse_6":0,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":0,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":2,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"choice_0":2,"choice_1":1,"choice_2":1,"choice_3":1,"choice_4":1,"choice_5":1,"choice_6":1,"choice_7":1,"choice_8":1,"choice_9":1,"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":1,"concept_4":2,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"calc_0":1,"calc_1":3,"calc_2":1,"calc_3":250,"calc_4":50,"calc_5":4,"calc_6":1,"calc_7":50,"calc_8":73,"calc_9":5},"problemSolutions":{"definition_0":"**1) Scheduling adjusts $\\eta_t$. **2) Answer 2**","definition_1":"**1) Step factor is $\\gamma$. **2) Answer 2**","definition_2":"**1) At $t=0$, $\\eta_{\\max}$. **2) Answer 1**","definition_3":"**1) Warmup stabilizes early training. **2) Answer 2**","definition_4":"**1) Plateau watches val metrics. **2) Answer 1**","definition_5":"**1) Call after each epoch. **2) Answer 2**","definition_6":"**1) $\\eta_t$ scales the step. **2) Answer 1**","definition_7":"**1) Schedules usually decay — not only increase. **2) Answer 2**","definition_8":"**1) MultiStep uses milestone epochs. **2) Answer 2**","definition_9":"**1) OneCycle ramps then drops. **2) Answer 1**","trueFalse_0":"True. **Answer 1**","trueFalse_1":"True. **Answer 1**","trueFalse_2":"True. **Answer 1**","trueFalse_3":"True. **Answer 1**","trueFalse_4":"True. **Answer 1**","trueFalse_5":"Next stride is usually set **after** the step. **Answer 0**","trueFalse_6":"Fixed LR not always best. **Answer 0**","trueFalse_7":"Restarts exist. **Answer 1**","trueFalse_8":"Exponential multiplies by $\\gamma$. **Answer 1**","trueFalse_9":"Need optimizer. **Answer 0**","scenario_0":"**Text AI: warmup+smooth decay. Answer 1**","scenario_1":"**Long training: 30/60/90. Answer 1**","scenario_2":"**Plateau on stall. Answer 1**","scenario_3":"**Cosine smoother. Answer 2**","scenario_4":"**Flat LR → schedule update issue. Answer 1**","scenario_5":"**LR scaling with batch. Answer 1**","scenario_6":"**Gentle decay for fine-tune. Answer 1**","scenario_7":"**OneCycle for quick runs. Answer 1**","scenario_8":"**Cut base stride. Answer 1**","scenario_9":"**Add warmup. Answer 1**","choice_0":"**Cosine smoother. Answer 2**","choice_1":"**Patience = wait epochs. Answer 1**","choice_2":"**×0.1. Answer 1**","choice_3":"**Linear change. Answer 1**","choice_4":"**Tiny updates. Answer 1**","choice_5":"**Periodic LR. Answer 1**","choice_6":"**Early stability. Answer 1**","choice_7":"**Custom lambda. Answer 1**","choice_8":"**Decay checkpoint list. Answer 1**","choice_9":"**Optimizer+schedule complement. Answer 1**","concept_0":"**Early instability ease. Answer 1**","concept_1":"**Linear scaling. Answer 1**","concept_2":"**Mid cosine → 100. Answer 2**","concept_3":"**mode=min for loss. Answer 1**","concept_4":"**Different mechanisms. Answer 2**","concept_5":"**Restart escape. Answer 1**","concept_6":"**Underfit risk. Answer 1**","concept_7":"**Resume schedule. Answer 1**","concept_8":"**Steeper decay. Answer 1**","concept_9":"**Note error+stride. Answer 1**","calc_0":"**$0.001$×1000=1. Answer 1**","calc_1":"**60/20=3. Answer 3**","calc_2":"**0.001×1000=1. Answer 1**","calc_3":"**250. Answer 250**","calc_4":"**50. Answer 50**","calc_5":"**4. Answer 4**","calc_6":"**1. Answer 1**","calc_7":"**50. Answer 50**","calc_8":"**73. Answer 73**","calc_9":"**5. Answer 5**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 2\nassert answer == 2","definition_2":"answer = 1\nassert answer == 1","definition_3":"answer = 2\nassert answer == 2","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 2\nassert answer == 2","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 0\nassert answer == 0","trueFalse_6":"answer = 0\nassert answer == 0","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 0\nassert answer == 0","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 2\nassert answer == 2","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","choice_0":"answer = 2\nassert answer == 2","choice_1":"answer = 1\nassert answer == 1","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 1\nassert answer == 1","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 1\nassert answer == 1","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 1\nassert answer == 1","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","calc_0":"eta0, gamma, decays = 0.1, 0.1, 2\nanswer = int(round(eta0 * (gamma ** decays) * 1000))\nassert answer == 1","calc_1":"epochs, step_size = 60, 20\nanswer = epochs // step_size\nassert answer == 3","calc_2":"warmup, step, target = 10, 5, 0.002\nanswer = int(round(target * step / warmup * 1000))\nassert answer == 1","calc_3":"eta0, factor, times = 1000, 0.5, 2\nanswer = int(eta0 * (factor ** times))\nassert answer == 250","calc_4":"import math\nt, T, eta_max, eta_min = 5, 10, 0.1, 0\neta = eta_min + 0.5 * (eta_max - eta_min) * (1 + math.cos(math.pi * t / T))\nanswer = int(round(eta * 1000))\nassert answer == 50","calc_5":"n, b = 128, 32\nanswer = n // b\nassert answer == 4","calc_6":"eta, scale = 0.01, 0.1\nanswer = int(round(eta * scale * 1000))\nassert answer == 1","calc_7":"eta, factor = 100, 0.5\nanswer = int(eta * factor)\nassert answer == 50","calc_8":"eta0, gamma, epochs = 100, 0.9, 3\nanswer = int(round(eta0 * (gamma ** epochs)))\nassert answer == 73","calc_9":"warmup, step, peak = 500, 250, 10\nanswer = int(peak * step / warmup)\nassert answer == 5"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midDlCh04":{"chapter":"Chapter 04","title":"Loss Functions Deep Dive: Learning from Imbalance and Distance","description":"Imagine a school exam with 100 questions: 99 easy arithmetic problems and just one very hard essay question. Get all the arithmetic right and you score 99 — it feels like you're doing great. AI faces the same trap with **class imbalance**: follow the majority and you can hit 99% accuracy while missing every rare but critical case (rare disease, defective parts, fraud).\n\nThis chapter **redesigns the loss function itself** to handle imbalance, then goes further into **metric learning** — teaching similarity through **distance** between data points, not only class labels.\n\nStudy **weighted CE**, **Focal loss**, and **Triplet/Contrastive loss** in concept → intuitive analogy → formula → practice tips order, and avoid the trap of \"high score, wrong priorities.\"","sectionTitle":"Loss Functions Deep Dive: Learning from Imbalance and Distance","whatIs":{"0":"**1. What is class imbalance? (AI trapped by majority vote)**\n\n**Concept:** Some classes have far more samples than others. With plain training, the model focuses on the **majority class** because it is easiest to get right.\n\n**Intuitive analogy:** In defect detection, 99% of items are fine. Predict \"all normal\" and accuracy is 99% — but the real goal is finding the hidden 1% **defects (minority class)**.\n\n**Practice tip:** Add **per-class weights** or **Focal loss** so the model pays attention to rare, important cases.","1":"**2. Weighted cross-entropy (Weighted CE)**\n\n**Concept:** Multiply each class $c$ by a weight $w_c$. Misclassifying a **minority class** costs more. Core formula: $L = - w_c \\log(p_c)$.\n\n**Intuitive analogy:** Change the grading rubric: 99 easy questions worth 1 point each, one rare essay worth 100 points — the student (AI) cannot ignore the essay.\n\n**Practice tip:** Weights that are **too large** can overfit noisy minority samples. Tune gradually and watch per-class metrics.","2":"**3. Focal loss — skip what you know, focus on what you don't**\n\n**Concept:** **Easy samples** (high predicted probability $p_t$) get much smaller loss; **hard samples** dominate training. Core formula: $L_{\\text{focal}} = - (1-p_t)^\\gamma \\log(p_t)$, where $\\gamma$ controls focus.\n\n**Intuitive analogy:** Efficient cramming: skip chapters you already ace; pour time into weak topics you keep missing.\n\n**Practice tip:** Very effective when background (majority) vs object (minority) ratios are extreme, e.g. object detection.","3":"**4. Metric learning — cluster friends, separate strangers**\n\n**Concept:** Instead of only picking a class label, the model learns **distance** — cat photos near cats, far from dogs. **Triplet loss** uses anchor, positive, and negative. Core formula: $L = \\max(0, d(a,p) - d(a,n) + \\alpha)$.\n\n**Intuitive analogy:** Wedding seating: **close friends (positive)** at the same table; **awkward pairs (negative)** kept at least safety distance $\\alpha$ apart.\n\n**Practice tip:** Widely used for face recognition, similar-item search, and any task comparing similarity between examples."},"whyImportant":{"0":"**1. Filter out fake 100% report cards**\n\n99% accuracy on imbalanced data can be meaningless. To see whether the model solves what matters, encode priorities in the loss with **weighted CE** or **Focal loss** — not accuracy alone.","1":"**2. The loss sets your priorities**\n\nWhat the model should care about most is encoded in the **shape of the loss**. With the wrong objective, more training rarely fixes the real problem. For imbalance and distance learning, the key is telling the model where penalties should bite.","2":"**3. Backbone of search and recommendation**\n\nWhen you search \"similar clothes\" or unlock your phone with your face, the model compares **how alike two images are (distance)** — not multiple-choice labels. Metric learning powers these services."},"howUsed":{"0":"**① Fix imbalance — diagnose in order**\n\n**Symptom:** Model nails common classes, fails rare ones.\n**Steps:**\n1. **Check distribution** — how skewed are class counts?\n2. **Swap the loss** — try **weighted CE** or **Focal loss**.\n3. **Review training settings** — if the loss changed, revisit learning rate and batch size too.","1":"**② Metric learning — train in pairs/triplets**\n\nMetric learning needs data prepared as sets:\n- **Triplet:** (reference photo, same person, different person).\n- **Contrastive:** (original, augmented original) as positive — pull together; push others away.\nThe model learns distances so similar items cluster.","2":"**③ Drop easy problems (Hard Negative Mining)**\n\nTraining only on apple vs car — too easy, learning stalls. Find **hard negatives** that look similar but differ — like tough mock exams before the real test.","3":"**④ Pick your ruler — L2 vs cosine**\n\n- **L2 distance:** straight-line gap; cares about magnitude and position.\n- **Cosine similarity:** angle between arrows; good for meaning or \"vibe\" of embeddings.\n\n**Quick fixes:** ignores minority → **weights/Focal**; only easy samples → raise **Focal $\\gamma$**; poor search → add **hard negatives**."},"problemSolving":{"0":"Problems in this chapter fall into two lines: **imbalanced classification** and **distance-based similarity learning**. With imbalance, high accuracy can hide the fact that rare classes are missed — plain cross-entropy often rewards the majority. **Weighted CE** assigns $w_c$ per class to penalize minority mistakes more; **Focal loss** uses $(1-p_t)^\\gamma$ to down-weight easy samples so hard ones drive training. Metric learning, by contrast, teaches **distance** in embedding space rather than a single label. **Triplet loss** pulls anchor–positive together and pushes negative past margin $\\alpha$ via $L=\\max(0,d(a,p)-d(a,n)+\\alpha)$; **Contrastive loss** pulls positive pairs and pushes negatives apart. If the stem says the model only gets the majority right, think **weights/Focal**; if negatives are too easy, think **hard mining**; for face auth or search, think **metric embeddings** first.","2":"**Definition problems** — ask what the loss penalizes most. For \"Why give a large $w_c$ to the minority in weighted CE?\", option ① skips backprop (unrelated) and ③ fixes the batch (not the point). Weighted CE raises the **cost of misclassifying the minority** → **Answer 2**.\n\n---\n\n**Scenario problems** — read the data setup first. With 99% normal and 1% fraud, keeping plain CE only (①) or deleting the minority (③) misses the goal. The first loss to try is **weighted CE or Focal** (②). → **Answer 2**\n\n---\n\n**Calculation problems** — plug into the formula step by step. With $w_B=N/(K\\cdot n_B)$, $N=1000$, $K=2$, $n_B=100$: $1000/(2\\cdot100)=5$. → **Answer ②**","3":"**Definition example** — \"What does $(1-p_t)^\\gamma$ do in Focal loss?\" It **down-weights easy samples** so training focuses on hard ones — not ① equal loss or ③ raising LR. → **Answer 2**\n\n---\n\n**True/False example** — \"Triplet loss needs anchor, positive, and negative.\" True — it learns from three-point distance relations. → **Answer 1**\n\n---\n\n**Application example** — \"Train embeddings for face verification\" → **Triplet/Contrastive** distance learning fits better than plain classification. → **Answer 1**\n\n---\n\n**Choice example** — With **extreme imbalance** (e.g. detection backgrounds), **Focal** (②) is often preferred over CE only (①). → **Answer 2**\n\n---\n\n**Concept example** — With $d(a,p)=1$, $d(a,n)=4$, $\\alpha=0.5$: $\\max(0,1-4+0.5)=0$ — margin already satisfied, no further penalty. → **Answer 3**\n\n---\n\n**Calculation example** — L2 distance $(0,0)$–$(3,4)$ is $\\sqrt{9+16}=5$. → **Answer ②**"},"summary":"$31","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Losses at a glance","weightedCE":"**Weighted CE** — penalizes minority misclassification more heavily.\n\n**Core formula:** $L = -w_c \\log(p_c)$\n\n**Symbols** — $p_c$ is the model's predicted probability for the **true class $c$** (0–1). $w_c$ is the **point value** for class $c$; rare classes get larger weights. $\\log(p_c)$ grows when the model is wrong (low $p_c$), and the leading **minus** makes training **raise** $p_c$.\n\n**Weight rule:** with $N$ total samples and $K$ classes, $w_c \\propto N/(K n_c)$ — smaller class count $n_c$ → larger $w_c$.\n\n**Numeric example:** $p_c=0.2$, $w_c=5$ gives loss $\\approx -5\\log(0.2) \\approx 8$ — the same mistake hurts more when weighted.\n\n**Analogy:** plain **cross-entropy** with **per-class grading weights**.","focal":"**Focal loss** — down-weights easy samples, focuses on hard ones.\n\n**Core formula:** $L_{\\text{focal}} = -(1-p_t)^\\gamma \\log(p_t)$\n\n**Symbols** — $p_t$ is the model's **confidence** on this sample (probability of the true label). $(1-p_t)$ measures **uncertainty**; $(1-p_t)^\\gamma$ shrinks easy samples ($p_t$ high) more when $\\gamma$ is large. $\\log(p_t)$ is the same CE backbone.\n\n**Numeric example:** $p_t=0.9$, $\\gamma=2$ → $(1-0.9)^2=0.01$ — loss drops to **~1%**, so mastered samples barely matter. At $p_t=0.3$, $(0.7)^2=0.49$ — still a strong penalty.\n\n**Analogy:** skip chapters you already ace; drill the ones you keep missing.","triplet":"**Triplet loss** — pull friends close, push strangers apart.\n\n**Core formula:** $L = \\max(0,\\, d(a,p) - d(a,n) + \\alpha)$\n\n**Symbols** — $a$ is the **anchor**, $p$ a **positive** (same identity), $n$ a **negative** (different identity). $d(\\cdot,\\cdot)$ is **distance** (e.g. L2). $\\alpha$ is the **margin**: negative must stay at least this far from the anchor. $\\max(0,\\cdot)$ means **no penalty** once the layout is already good enough.\n\n**Numeric example:** $d(a,p)=1$, $d(a,n)=4$, $\\alpha=0.5$ → $1-4+0.5=-2.5$ → $\\max(0,-2.5)=0$ (negative far enough). If $d(a,n)=1.2$ → $1-1.2+0.5=0.3$ — still too close, **0.3 penalty**.\n\n**Analogy:** wedding seating — friend ($p$) beside anchor, awkward guest ($n$) at least $\\alpha$ away.","contrastive":"**Contrastive loss** — pull same identity (or augmentations) together, push different identities apart.\n\n**Core formula** — positive pair: $L_+ = 0.5\\,d^2$ · negative pair: $L_- = 0.5\\,\\max(0,\\, m-d)^2$\n\n**Symbols** — $d$ is **distance** between two embeddings. For positives, smaller $d$ → smaller loss (**pull**). For negatives, $m$ is the **margin**; when $d \\ge m$, $\\max(0,m-d)=0$ so no further push is needed. The $0.5$ is a scaling constant.\n\n**Numeric example:** positive with $d=0.4$ → $0.5\\times0.16=0.08$ (close = low cost). Negative with $d=0.6$, $m=1$ → $0.5\\times(0.4)^2=0.08$ — still too close, keep pushing.\n\n**Analogy:** cluster photos of the same person; keep different people at least $m$ apart."},"visual":"","problemSolvingLabel":"Problem-solving guide","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **5 problems** sampled from a bank of **60** (easy 2 · medium 2 · hard 1; order easy→medium→hard). Pick the option number.","practiceProblemsInstruction":"Read each problem and choose the best option.","midDlCh04VisualIntro":"The 2×2 diagram below shows how **four losses** reshape embedding space **before → after** training.","midDlCh04VisualConceptTitle":"Loss functions 2×2: Contrastive · Triplet · Weighted CE · Focal","midDlCh04VisualFlowTitle":"Pull similar, push different — learning by distance","midDlCh04VisualModelTitle":"Imbalance: weights & Focal; similarity: Contrastive & Triplet","midDlCh04VisualPanelContrastive":"Contrastive Loss","midDlCh04VisualPanelTriplet":"Triplet Loss","midDlCh04VisualPanelWeightedCE":"Weighted CE","midDlCh04VisualPanelFocal":"Focal Loss","midDlCh04VisualLabelBefore":"Before","midDlCh04VisualLabelAfter":"After","midDlCh04VisualLabelHard":"Hard","midDlCh04VisualCaption":"Each panel shows **before (left) → after (right)**. **Contrastive** pulls positive pairs and pushes negatives apart. **Triplet** pulls anchor–positive close and keeps negative beyond the margin. **Weighted CE** adds **w↑** for minorities; **Focal** shrinks **easy** dots and enlarges **hard** ones.","problems":{"definition_0":"In **class imbalance**, the most common problem is?\n1) All classes have equal counts\n2) **The minority class is ignored or under-valued in training/eval**\n3) Only GPU memory is insufficient","definition_1":"In **weighted cross-entropy**, why give a **large** weight $w_c$ to the minority class?\n1) Skip backprop\n2) **Raise the cost of misclassifying that class in the loss**\n3) Fix batch size","definition_2":"What does the $(1-p_t)^\\gamma$ term in **Focal loss** mainly do?\n1) Make all sample losses equal\n2) **Down-weight easy samples and focus on hard ones**\n3) Auto-increase learning rate","definition_3":"The core goal of **metric learning** is?\n1) Maximize only the number of classes\n2) **Similar items close, dissimilar items far in embedding space**\n3) Always zero weights","definition_4":"In **Triplet loss** $(a,p,n)$, which relation is correct?\n1) Pull anchor and negative together\n2) **Anchor–positive close, anchor–negative far**\n3) All three at random distances","definition_5":"In **Contrastive loss**, which is closest to a **positive pair**?\n1) Always different classes\n2) **Same class / same identity (or augmentation)**\n3) Always distant pairs","definition_6":"Which is closest to the **Euclidean (L2)** distance between embeddings $a,b$?\n1) **$\\|a-b\\|_2=\\sqrt{\\sum_i (a_i-b_i)^2}$**\n2) Only $\\sum_i |a_i-b_i|$\n3) Always 0","definition_7":"With a **large** $\\gamma$ in Focal loss?\n1) Easy samples matter more\n2) **$(1-p_t)^\\gamma$ shrinks more for easy samples**\n3) All sample losses become equal","definition_8":"Which is closest to the role of **margin** $\\alpha$ in Triplet loss?\n1) **Minimum gap between positive and negative distances**\n2) Number of classes\n3) Batch norm momentum","definition_9":"A **hard negative** is?\n1) A negative very far from the anchor\n2) **A negative close to the anchor and hard to separate**\n3) The same as a positive","trueFalse_0":"[T/F] Class imbalance means very different sample counts per class. 1=true, 0=false.","trueFalse_1":"[T/F] Weighted CE can assign larger $w_c$ to minority classes. 1=true, 0=false.","trueFalse_2":"[T/F] Focal loss was proposed to ease imbalance and too many easy samples. 1=true, 0=false.","trueFalse_3":"[T/F] Metric learning trains similarity via embedding **distance**. 1=true, 0=false.","trueFalse_4":"[T/F] Triplet loss needs anchor, positive, and negative. 1=true, 0=false.","trueFalse_5":"[T/F] Contrastive loss pulls positive pairs together and pushes negatives apart. 1=true, 0=false.","trueFalse_6":"[T/F] L2 distance can be negative. 1=true, 0=false.","trueFalse_7":"[T/F] With $\\gamma=0$ in Focal loss, $(1-p_t)^0=1$ so reweighting vanishes. 1=true, 0=false.","trueFalse_8":"[T/F] A hard negative is always the farthest sample from the anchor. 1=true, 0=false.","trueFalse_9":"[T/F] With a large $\\gamma$ in Focal loss, easy samples contribute more to the loss. 1=true, 0=false.","scenario_0":"[App] Fraud detection: 99% normal, 1% fraud. Plain CE alone tends to miss the minority. First loss to try? ① plain CE only ② **weighted CE or Focal loss** ③ delete minority class","scenario_1":"[App] Face verification: same person close, different people far in embedding space. Best approach? ① **Triplet/Contrastive metric learning** ② softmax classification only ③ remove embedding layer","scenario_2":"[App] Rare disease: model almost never predicts the rare class. What to try? ① **class weights / Focal** ② batch size ×2 only ③ keep majority only","scenario_3":"[App] Triplet training: only easy negatives → loss ≈ 0. Next step? ① **Hard negative mining** ② positives only ③ LR = 0","scenario_4":"[App] Embedding clusters overlap → poor retrieval. Best fix? ① **Triplet/Contrastive with margin** ② zero embeddings ③ remove loss","scenario_5":"[App] Rare-class F1 stays low even after tuning training settings. Also check? ① **loss reweighting (weighted CE / Focal)** ② GPU color ③ batch = 1 only","scenario_6":"[App] Self-supervised contrastive (SimCLR-style): how should two augmentations of one image be treated? ① **pulled as a positive pair** ② always negatives ③ train without encoder","scenario_7":"[App] Focal $\\gamma$ too large → unstable training. Most likely cause? ① **only hard examples dominate** ② always better ③ backprop stops","scenario_8":"[App] Product image similarity search. Recommended approach? ① **learn a metric embedding** ② random features ③ remove the loss function","scenario_9":"[App] With **inverse-frequency** weights, how does minority $w_c$ change? ① shrinks with frequency ② **grows inversely with frequency** ③ unchanged","choice_0":"Better for extreme imbalance and many easy majority samples? ① plain CE only ② **Focal loss** ③ MSE only","choice_1":"What is minority weight in weighted CE usually? ① proportional to frequency ② **inverse frequency (or balanced weight)** ③ always 0","choice_2":"Which expression is closest to Triplet loss? ① **$\\max(0, d(a,p)-d(a,n)+\\alpha)$** ② $d(a,p)+d(a,n)$ ③ CE only","choice_3":"Contrastive positive pair? ① random different images ② **same identity / augmentation** ③ always far","choice_4":"Common embedding distances for retrieval/verification? ① **L2 or cosine distance** ② epoch count ③ dropout rate","choice_5":"What does Focal loss $\\alpha$ mainly balance? ① learning rate ② **class / positive ratio** ③ weight decay","choice_6":"Compared with plain cross-entropy classification, metric learning optimizes? ① logits only ② **relative distances in embedding space** ③ init only","choice_7":"Purpose of hard negative mining? ① only easier training ② **informative hard negatives** ③ remove anchor","choice_8":"Siamese network feature? ① **weight-shared twin encoder** ② two unrelated nets ③ no loss","choice_9":"What weighted CE and Focal loss both aim to ease? ① remove the minority class from data ② **imbalance and too many easy majority samples** ③ make Triplet margin unnecessary","concept_0":"Why Focal can beat weighted CE? ① never ② **further down-weights easy majority samples** ③ deletes minority","concept_1":"Triplet margin $\\alpha$ enforces? ① negative closer ② **$d(a,n)-d(a,p)$ at least $\\alpha$** ③ infinite loss","concept_2":"$$d(a,p)=1$, $d(a,n)=4$, $\\alpha=0.5$ — what is Triplet loss $\\times 10$ (int)? ① 35 ② 15 ③ **0**","concept_3":"Classes A 800, B 200; $w_B=N/(K\\cdot n_B)$, $N=1000$, $K=2$ — what is $w_B\\times 10$ (int)? ① 20 ② **25** ③ 40","concept_4":"Focal $(1-p_t)^\\gamma$, $p_t=0.9$, $\\gamma=2$ — what is the value $\\times 1000$ (int)? ① **10** ② 100 ③ 900","concept_5":"Why metric learning for face verification? ① **compare unseen identities by distance** ② fixed classes only ③ no embedding","concept_6":"For easy samples with high $p_t$, larger $\\gamma$ makes $(1-p_t)^\\gamma$? ① larger ② **smaller** ③ unchanged","concept_7":"Why cosine distance on normalized embeddings? ① **direction (angle) similarity** ② always unrelated to L2 ③ batch size","concept_8":"If minority weight is **too** large? ① only majority underfit ② **minority overfit / noisy amplification** ③ always faster","concept_9":"Triplet loss aims for which embedding layout? ① anchor–negative close ② **anchor–positive close, anchor–negative beyond margin** ③ random distances for all three","calc_0":"[Calc] Class A **900**, B **100**; $w_B=N/(K\\cdot n_B)$, $N=1000$, $K=2$ — $w_B$ (int)?\n① 2\n② **5**\n③ 10\n④ 50","calc_1":"[Calc] Focal $(1-p_t)^\\gamma$, $p_t=0.8$, $\\gamma=2$ — value $\\times 1000$ (int)?\n① 20\n② **40**\n③ 80\n④ 160","calc_2":"[Calc] Weighted CE: $w=3$, base CE $=0.2$ — weighted loss $\\times 10$ (int)?\n① 3\n② **6**\n③ 9\n④ 12","calc_3":"[Calc] Triplet: $d(a,p)=2$, $d(a,n)=5$, $\\alpha=1$ — loss (int)?\n① **0**\n② 1\n③ 2\n④ 4","calc_4":"[Calc] L2 distance: $(0,0)$ to $(3,4)$ (int)?\n① 4\n② **5**\n③ 7\n④ 25","calc_5":"[Calc] A **950**, B **50** — inverse-freq ratio $w_B/w_A$ (int)?\n① 9\n② 10\n③ **19**\n④ 95","calc_6":"[Calc] Triplet: $d(a,p)=1.2$, $d(a,n)=1.5$, $\\alpha=0.2$ — loss $\\times 100$ (int)?\n① **0**\n② 10\n③ 20\n④ 50","calc_7":"[Calc] Focal $\\gamma=0$, $p_t=0.7$ — $(1-p_t)^\\gamma\\times 100$ (int)?\n① 30\n② 70\n③ **100**\n④ 0","calc_8":"[Calc] Contrastive positive: $d=0.4$, loss $=0.5\\,d^2\\times 100$ (int)?\n① 4\n② **8**\n③ 16\n④ 40","calc_9":"[Calc] A **600**, B **400**; $w_B=N/(K\\cdot n_B)$, $N=1000$, $K=2$ — $w_B\\times 100$ (int)?\n① 100\n② **125**\n③ 150\n④ 250"},"problemAnswers":{"definition_0":2,"definition_1":2,"definition_2":2,"definition_3":2,"definition_4":2,"definition_5":2,"definition_6":1,"definition_7":2,"definition_8":1,"definition_9":2,"trueFalse_0":1,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":1,"trueFalse_6":0,"trueFalse_7":1,"trueFalse_8":0,"trueFalse_9":0,"scenario_0":2,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":2,"choice_0":2,"choice_1":2,"choice_2":1,"choice_3":2,"choice_4":1,"choice_5":2,"choice_6":2,"choice_7":2,"choice_8":1,"choice_9":2,"concept_0":2,"concept_1":2,"concept_2":3,"concept_3":2,"concept_4":1,"concept_5":1,"concept_6":2,"concept_7":1,"concept_8":2,"concept_9":2,"calc_0":2,"calc_1":2,"calc_2":2,"calc_3":1,"calc_4":2,"calc_5":3,"calc_6":1,"calc_7":3,"calc_8":2,"calc_9":2},"problemSolutions":{"definition_0":"**1) Concept:** Minority classes are easily ignored. **2) Example:** 1% fraud. **3) Answer 2**","definition_1":"**1) Weighted CE raises minority misclassification cost. **2) Answer 2**","definition_2":"**1) Focal down-weights easy samples. **2) Answer 2**","definition_3":"**1) Metric learning uses distance. **2) Answer 2**","definition_4":"**1) anchor–pos close, anchor–neg far. **2) Answer 2**","definition_5":"**1) Positive = same identity/aug. **2) Answer 2**","definition_6":"**1) L2 = sqrt of sum of squares. **2) Answer 1**","definition_7":"**1) Larger $\\gamma$ → smaller easy weight. **2) Answer 2**","definition_8":"**1) Margin = distance gap. **2) Answer 1**","definition_9":"**1) Hard negative is close. **2) Answer 2**","trueFalse_0":"Imbalance = unequal class counts. **Answer 1**","trueFalse_1":"Weighted CE can upweight minority. **Answer 1**","trueFalse_2":"Focal targets imbalance/easy samples. **Answer 1**","trueFalse_3":"Metric learning uses distance. **Answer 1**","trueFalse_4":"Triplet needs a,p,n. **Answer 1**","trueFalse_5":"Contrastive pulls pos, pushes neg. **Answer 1**","trueFalse_6":"L2 cannot be negative. **Answer 0**","trueFalse_7":"$$\\gamma=0$ → $(1-p_t)^0=1$. **Answer 1**","trueFalse_8":"Hard negative is close, not farthest. **Answer 0**","trueFalse_9":"Larger $\\gamma$ → easy samples contribute less. **Answer 0**","scenario_0":"**1) Imbalance → weighted CE/Focal. **2) Answer 2**","scenario_1":"**1) Verification → metric learning. **2) Answer 1**","scenario_2":"**1) Rare class → weights/Focal. **2) Answer 1**","scenario_3":"**1) Easy neg → hard mining. **2) Answer 1**","scenario_4":"**1) Overlap → margin loss. **2) Answer 1**","scenario_5":"**1) Low F1 → reweight loss. **2) Answer 1**","scenario_6":"**1) Aug views = positive. **2) Answer 1**","scenario_7":"**1) Huge $\\gamma$ → unstable. **2) Answer 1**","scenario_8":"**1) Search → metric embedding. **2) Answer 1**","scenario_9":"**1) Inverse freq → higher minority weight. **2) Answer 2**","choice_0":"**1) Extreme imbalance → Focal. **2) Answer 2**","choice_1":"**1) Inverse-frequency weight. **2) Answer 2**","choice_2":"**1) Triplet max(0,·). **2) Answer 1**","choice_3":"**1) Positive = same identity. **2) Answer 2**","choice_4":"**1) L2/cosine. **2) Answer 1**","choice_5":"**1) $\\alpha$ balances classes. **2) Answer 2**","choice_6":"**1) Optimizes distances. **2) Answer 2**","choice_7":"**1) Informative negatives. **2) Answer 2**","choice_8":"**1) Shared weights. **2) Answer 1**","choice_9":"**1) Weighted CE & Focal ease imbalance. **2) Answer 2**","concept_0":"**1) Down-weight easy majority. **2) Answer 2**","concept_1":"**1) Margin enforces gap. **2) Answer 2**","concept_2":"**1) $\\max(0,1-4+0.5)=0$. **2) Answer 3**","concept_3":"**1) $1000/(2\\cdot200)=2.5$ → ×10=25. **2) Answer 2**","concept_4":"**1) $0.1^2=0.01$ → ×1000=10. **2) Answer 1**","concept_5":"**1) Unseen IDs by distance. **2) Answer 1**","concept_6":"**1) High $p$ → smaller $(1-p)^\\gamma$. **2) Answer 2**","concept_7":"**1) Cosine = direction. **2) Answer 1**","concept_8":"**1) Too large → minority overfit. **2) Answer 2**","concept_9":"**1) Triplet: a–p close, a–n beyond margin. **2) Answer 2**","calc_0":"**1) $1000/(2\\cdot100)=5$. **2) Answer ②**","calc_1":"**1) $0.2^2=0.04$ → ×1000=40. **2) Answer ②**","calc_2":"**1) $3\\times0.2=0.6$ → ×10=6. **2) Answer ②**","calc_3":"**1) $\\max(0,2-5+1)=0$. **2) Answer ①**","calc_4":"**1) $\\sqrt{9+16}=5$. **2) Answer ②**","calc_5":"**1) $950/50=19$. **2) Answer ③**","calc_6":"**1) $\\max(0,1.2-1.5+0.2)=0$. **2) Answer ①**","calc_7":"**1) $(1-p)^0=1$ → ×100=100. **2) Answer ③**","calc_8":"**1) $0.5\\times0.16\\times100=8$. **2) Answer ②**","calc_9":"**1) $1000/(2\\cdot400)=1.25$ → ×100=125. **2) Answer ②**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 2\nassert answer == 2","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 2\nassert answer == 2","definition_4":"answer = 2\nassert answer == 2","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 1\nassert answer == 1","trueFalse_6":"answer = 0\nassert answer == 0","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 0\nassert answer == 0","trueFalse_9":"answer = 0\nassert answer == 0","scenario_0":"answer = 2\nassert answer == 2","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 2\nassert answer == 2","choice_0":"answer = 2\nassert answer == 2","choice_1":"answer = 2\nassert answer == 2","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 2\nassert answer == 2","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 2\nassert answer == 2","choice_6":"answer = 2\nassert answer == 2","choice_7":"answer = 2\nassert answer == 2","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 2\nassert answer == 2","concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 3\nassert answer == 3","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 2\nassert answer == 2","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 2\nassert answer == 2","concept_9":"answer = 2\nassert answer == 2","calc_0":"n_a, n_b, N, K = 900, 100, 1000, 2\nw_b = N // (K * n_b)\nanswer = 2 if w_b == 5 else 0\nassert answer == 2","calc_1":"p, gamma = 0.8, 2\nval = int(round(((1 - p) ** gamma) * 1000))\nanswer = 2 if val == 40 else 0\nassert answer == 2","calc_2":"w, ce = 3, 0.2\nval = int(round(w * ce * 10))\nanswer = 2 if val == 6 else 0\nassert answer == 2","calc_3":"dap, dan, alpha = 2, 5, 1\nloss = max(0, dap - dan + alpha)\nanswer = 1 if loss == 0 else 0\nassert answer == 1","calc_4":"import math\nanswer = 2 if int(round(math.sqrt(3**2 + 4**2))) == 5 else 0\nassert answer == 2","calc_5":"n_a, n_b = 950, 50\nratio = n_a // n_b\nanswer = 3 if ratio == 19 else 0\nassert answer == 3","calc_6":"dap, dan, alpha = 1.2, 1.5, 0.2\nloss = int(round(max(0, dap - dan + alpha) * 100))\nanswer = 1 if loss == 0 else 0\nassert answer == 1","calc_7":"p, gamma = 0.7, 0\nval = int(round(((1 - p) ** gamma) * 100))\nanswer = 3 if val == 100 else 0\nassert answer == 3","calc_8":"d = 0.4\nval = int(round(0.5 * d * d * 100))\nanswer = 2 if val == 8 else 0\nassert answer == 2","calc_9":"n_b, N, K = 400, 1000, 2\nw_b = N / (K * n_b)\nval = int(round(w_b * 100))\nanswer = 2 if val == 125 else 0\nassert answer == 2"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midDlCh05":{"chapter":"Chapter 05","title":"Regularization and Overfitting Prevention: Learn to Understand, Not Memorize","description":"Imagine a student who **memorizes every past exam** and scores 100 — but gets 0 when the numbers change slightly on a new paper. AI can fall into the same **overfitting** trap: perfect on train data, then a sharp drop on unseen val/test.\n\nIf earlier chapters taught how to optimize and train models, this chapter teaches techniques that help models stay flexible and **generalize** — not just memorize. You will learn **L2 regularization**, **Dropout**, **early stopping**, and **data augmentation** step by step from concept to practice.\n\n**※ Note:** **Batch Normalization** (scaling activations) in other chapters is NOT the same as **regularization** here (preventing overfitting). The words sound similar but the roles are completely different!","sectionTitle":"Regularization and Overfitting Prevention: Learn to Understand, Not Memorize","whatIs":{"0":"**1. What is overfitting? (Practice ace, real-world fail)**\n\n**Concept:** The model **memorizes not only useful patterns but also noise and trivial details** in train data. When truly new data arrives, it cannot adapt flexibly and performance drops sharply.\n\n**Intuitive analogy:** Think of a driving test where someone only memorized \"turn right at the second tree.\" On a real road with different landmarks, they crash — they memorized a situation, not the principle.\n\n**Practice tip:** If train loss keeps falling but val loss starts rising, overfitting is underway. The gap between these metrics is the **generalization gap** — shrinking it is this chapter's goal.","1":"**2. L2 regularization / Weight decay — fines for heavy luggage**\n\n**Concept:** When weights grow too large, the model becomes hypersensitive to small noise and overfits. We add a **penalty** proportional to weight size to the loss.\n\n**Core formula:** $Loss + \\frac{\\lambda}{2}\\|w\\|^2$\n\n**Formula explained:** We add a penalty based on weight size ($w$) after the usual error (Loss). $\\lambda$ (lambda) controls how **heavy the fine** is — larger $\\lambda$ pushes weights to stay smaller.\n\n**Intuitive analogy:** Airline baggage limits — heavy bags (large weights) cost extra fees, so you pack only essentials and travel light.\n\n**Practice tip:** If $\\lambda$ is **too large**, the model may **underfit** (too scared to learn). Tune gradually and watch val metrics.","2":"**3. Dropout — practice with random teammates benched**\n\n**Concept:** During training, each neuron is **temporarily turned off** with probability $p$. This prevents over-reliance on a few neurons and helps the whole network learn features evenly.\n\n**Intuitive analogy:** National team practice where star players are randomly benched — the team learns to score without depending on one ace.\n\n**Practice tip:** $p$ is often between $0.2$ and $0.5$. During training, scale remaining outputs by $1/(1-p)$. At **inference**, use **100% of neurons** — do not drop them.","3":"**4. Early stopping & data augmentation — stop in time, see more variety**\n\n**Concept & analogy:**\n\n- **Early stopping:** Like turning off the oven when cookies smell burnt, even if the recipe says 30 minutes. Stop when val performance stops improving and **save the best checkpoint**.\n\n- **Data augmentation:** Flipping or rotating a cat photo still shows a cat. We create variations to increase effective training data — like practicing math with different numbers.\n\n**Practice tip:** When data is scarce, **augmentation** to increase diversity is often the first priority. Never merge val into train."},"whyImportant":{"0":"**1. Filter 'frog in a well' models**\n\n99% train accuracy means nothing if val/test is only 60% in production. Real skill is **generalization** on unseen data — the techniques here maximize that ability.","1":"**2. Occam's razor: simpler is more stable**\n\nL2 and dropout stop the model from inventing **unnecessarily complex rules**. Simpler, regularized models tend to be more stable and trustworthy.","2":"**3. Essential safety belt in real AI pipelines**\n\n**Train/val split**, **early stopping**, and **weight decay** are not optional — training without them is like driving without brakes."},"howUsed":{"0":"**① Diagnosis: is our model overfitting?**\n\nIn practice, you often see this first: train accuracy looks great and loss keeps falling, but val metrics lag behind or get worse. It is like acing practice problems yet failing a mock exam with slightly different questions. Before making the model bigger or training longer, **check whether overfitting is the real problem**.\n\nPlot train and val loss on the same chart. If train keeps improving while val starts rising at some epoch, that turning point is where the **generalization gap** opens. Once you see that pattern, add L2 regularization or dropout to keep the model from growing too complex, and set up early stopping so training halts before val deteriorates further.","1":"**② Tuning L2 and dropout**\n\nTurning regularization on is only the first step — **how strong** you set it matters. For L2 ($\\lambda$ / weight decay), start very small (e.g. $0.0001$) and increase gradually while watching val. If the penalty is too harsh from the start, the model may shrink into **underfitting** before it learns useful patterns.\n\nFor dropout, try turning off roughly 20–50% of neurons during training. At **inference and evaluation**, dropout must be **off** and all neurons active. Remembering that training and inference behave differently prevents a common deployment mistake.","2":"**③ Smart early stopping**\n\nEarly stopping lets **val metrics decide when to stop**. Each epoch, log val loss and **save weights at the best val so far**. When val stops improving, end training — but deploy the checkpoint from the **best val epoch**, not necessarily the last one.\n\n**Patience** (often 5–20 epochs) means \"wait a bit longer before giving up\" when val dips temporarily. Learning curves are noisy; patience helps you avoid stopping too early on a bad streak.","3":"**④ Data is king**\n\nL2, dropout, and early stopping all help, but the most reliable fix for overfitting is still **more good data**. With enough diverse examples, the model has less room to memorize noise.\n\nIf you cannot collect more data soon, use **augmentation** — rotate, crop, flip — to create slightly different versions of existing samples. Like seeing the same cat from many angles, this nudges the model toward broader generalization. In real projects, **expanding or diversifying data** is often more effective than jumping straight to a fancier architecture."},"problemSolving":{"0":"$32","2":"$33","3":"$34"},"summary":"The heart of this chapter is simple: stop the model from **memorizing train data** and help it **generalize** to new examples. That is why we studied L2 regularization, Dropout, early stopping, and data augmentation.\n\nWhen overfitting is suspected, look at the **gap between train and val** first. L2 adds a penalty on large weights to keep the model simpler. Dropout randomly disables neurons during training so the network does not rely on a few paths — but at inference you must use **all neurons**. Early stopping ends training when val stops improving and keeps the weights from the best val epoch.\n\nIn practice, try **more data or augmentation** before reshaping the model or sweeping many hyperparameters at once. Change one setting at a time — $\\lambda$, dropout rate, patience — and watch how val moves. That way you can tell what actually helped.","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Regularization at a glance","overfitGap":"**Overfitting gap** — suspect when train is good but val/test is poor.\n\n**Plain language:** Scores look great on practice (train) but suddenly fail on new questions (val) — a sign of **memorization**. If train loss falls while val loss rises, overfitting is likely.\n\n**Core signal:** train loss ↓, val loss ↑\n\n**Symbols** — **Generalization gap** = train performance minus val performance. Large gap → memorization; small gap with good val → healthy generalization.\n\n**Numeric example:** train 99%, val 60% → 39%p gap.\n\n**Analogy:** Memorized driving-test cues only — fine on the same course, fails on new roads.","l2WeightDecay":"**L2 / Weight decay** — penalize large weights.\n\n**Plain language:** Heavy weights make the model twitchy on noise. We add a **\"heavy bag fine\"** so the model keeps weights small and simple.\n\n**Core formula:** $L = L_{\\text{data}} + \\frac{\\lambda}{2}\\sum_i w_i^2$\n\n**Formula explained** — $L_{\\text{data}}$ is the original error (e.g. cross-entropy). $\\lambda$ is fine strength. $w_i^2$ grows with weight size — **bigger weights, bigger penalty**.\n\n**Numeric example:** $\\lambda=2$, $w_1=3$, $w_2=4$ → L2 term = 25.\n\n**Analogy:** Excess baggage fees.","dropout":"**Dropout** — randomly disable neurons during training.\n\n**Plain language:** Bench different players each practice so no single neuron (star) carries everything. At **inference, everyone plays** — do not drop neurons then.\n\n**Core:** train: prob $p$ off · scale outputs $\\times 1/(1-p)$ · inference: **no mask**\n\n**Formula explained** — $p$ is drop rate (often 0.2–0.5). $1/(1-p)$ rescales remaining activations to match expected average.\n\n**Numeric example:** $p=0.5$ → scale 2.\n\n**Analogy:** Soccer squad training without the ace every day.","earlyStop":"**Early stopping** — stop when val stops improving.\n\n**Plain language:** Turn off the oven before cookies burn; save the model from the **best val epoch**, not the last one.\n\n**Core:** save **lowest val** checkpoint · stop after **patience** epochs without gain\n\n**Formula explained** — **patience** = \"wait N more epochs before giving up\" to avoid stopping on a random bad epoch.\n\n**Numeric example:** best val at epoch 12 → deploy weights from ~epoch 12.\n\n**Analogy:** Stop the marathon while still fresh; keep your personal best."},"visual":"","problemSolvingLabel":"Problem-solving guide","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **5 problems** sampled from a bank of **60** (easy 2 · medium 2 · hard 1; order easy→medium→hard). Pick the option number.","practiceProblemsInstruction":"Read each problem and choose the best option.","midDlCh05VisualIntro":"The 2×2 diagram shows how **four regularization tools** change learning **before → after** application.","midDlCh05VisualConceptTitle":"Regularization 2×2: Train/Val · Dropout · L2 · Aug","midDlCh05VisualFlowTitle":"Generalize, don't memorize — curves, weights, neurons, data","midDlCh05VisualModelTitle":"Stop overfitting: halt, shrink, diversify","midDlCh05VisualPanelTrainVal":"Train / Val & Early Stop","midDlCh05VisualPanelDropout":"Dropout","midDlCh05VisualPanelL2":"L2 Weight Decay","midDlCh05VisualPanelAug":"Data Augmentation","midDlCh05VisualLabelBefore":"Before","midDlCh05VisualLabelAfter":"After","midDlCh05VisualLabelStop":"Stop","midDlCh05VisualCaption":"Each panel: **before (left) → after (right)**. **Train/Val**: early stop when val rises. **Dropout**: neurons off at train, all on at inference. **L2**: weight shrink. **Aug**: more diverse samples, smoother boundary.","problems":{"definition_0":"The **most** common pattern in overfitting is?\n1) Both train and val are poor\n2) **Train is good but val/test are poor**\n3) Both train and val are perfect","definition_1":"The core purpose of L2 regularization (weight decay) is?\n1) Set learning rate to 0\n2) **Penalize large weights to encourage simpler, stable models**\n3) Only increase batch size","definition_2":"Which is closest to the core idea of dropout?\n1) Randomly turn off neurons at inference\n2) **Turn off some neurons during training to reduce co-adaptation**\n3) Always fix weights to 0","definition_3":"The **most** appropriate criterion for early stopping is?\n1) Train loss only\n2) **When validation loss (or val metric) stops improving**\n3) Infinite number of epochs","definition_4":"Which is closest to what weight decay does?\n1) **Shrink and suppress weights every step**\n2) Skip backpropagation\n3) Delete data","definition_5":"The main effect of data augmentation is?\n1) Only speed up GPU\n2) **Increase effective training diversity to reduce memorization**\n3) Merge val set into train","definition_6":"What is the generalization gap?\n1) GPU memory difference\n2) **Difference between train and val/test performance**\n3) Batch size difference","definition_7":"Which situation is closest to underfitting?\n1) Train 99%, val 98%\n2) **Both train and val have low performance**\n3) Train 50%, val 99%","definition_8":"When **diagnosing** overfitting, what should you check first?\n1) GPU utilization only\n2) **Train–validation performance gap**\n3) Number of epochs only","definition_9":"The correct handling of dropout at **inference** is?\n1) Randomly turn off like training\n2) **Use all neurons (no stochastic mask)**\n3) Keep network in training mode","trueFalse_0":"[T/F] Overfitting means fitting train well but not new data (val). 1=true, 0=false.","trueFalse_1":"[T/F] L2 regularization can add a $\\frac{\\lambda}{2}\\|w\\|^2$ term to the loss. 1=true, 0=false.","trueFalse_2":"[T/F] Dropout **randomly deactivates some neurons during training**. 1=true, 0=false.","trueFalse_3":"[T/F] Dropout turns off neurons at the **same rate during inference**. 1=true, 0=false.","trueFalse_4":"[T/F] Early stopping can halt training by monitoring validation performance. 1=true, 0=false.","trueFalse_5":"[T/F] More training data or augmentation can help reduce overfitting. 1=true, 0=false.","trueFalse_6":"[T/F] Weight decay and L2 regularization are closely related. 1=true, 0=false.","trueFalse_7":"[T/F] Making $\\lambda$ arbitrarily large always improves test performance. 1=true, 0=false.","trueFalse_8":"[T/F] Data augmentation can help reduce memorizing train samples only. 1=true, 0=false.","trueFalse_9":"[T/F] Underfitting can occur when the model is too simple or under-trained. 1=true, 0=false.","scenario_0":"[App] Train accuracy 99%, val 55%. What to try **first**? ① Enlarge model/layers ② **Regularization, early stopping, data augmentation** ③ Delete val set","scenario_1":"[App] Train loss keeps ↓ but val loss ↑ from epoch 20. Most plausible action? ① **Early stopping (save best val checkpoint)** ② LR ×10 ③ Turn off dropout and train more","scenario_2":"[App] 500 images, deep CNN, severe overfitting. Best combo first? ① dropout 0 + bigger model ② **Data augmentation + dropout/L2** ③ Merge train=val","scenario_3":"[App] Applied L2 (λ=0.001) and dropout (p=0.3) but val stays poor. **Next** most plausible try? ① **Add data augmentation and early stopping with patience** ② Increase λ 100× ③ Turn off dropout, 10× epochs only","scenario_4":"[App] Weight norm explodes. Appropriate response? ① **Strengthen L2/weight decay** ② Remove regularization ③ Ignore val","scenario_5":"[App] Model memorized even label noise in train. What may help? ① **Regularization, augmentation, smaller model** ② Train longer on train only ③ dropout 0","scenario_6":"[App] Confused with **Batch Normalization**. The goal of **overfitting-prevention regularization** is? ① **Prevent overfitting, suppress weights** ② Per-layer mean 0 var 1 ③ Save GPU memory","scenario_7":"[App] dropout $p=0.5$ during training — output scale correction? ① Multiply by $1-p$ only ② **Scale active neurons by $1/(1-p)$** ③ No scaling","scenario_8":"[App] Train 98%, val 52% but you only want to change learning rate. **First** decision? ① **Review data, regularization, early stopping** ② Turn off dropout and train more ③ Delete val set","scenario_9":"[App] Small medical image classification. Mitigate scarce data? ① **Augmentation: rotation, flip, etc.** ② Merge val into train ③ Model ×10 only","choice_0":"When suspecting overfitting, check **first**? ① Train accuracy only ② **Train–val gap, regularization, data** ③ GPU temperature","choice_1":"A commonly used dropout rate $p$ is? ① 0.01 ② **Around 0.5** ③ 0.99","choice_2":"Which is closest to L2 regularization loss form? ① **$L_{\\text{data}}+\\frac{\\lambda}{2}\\|w\\|^2$** ② $L_{\\text{data}}-\\|w\\|^2$ ③ CE only","choice_3":"Which model does early stopping save? ① Last epoch ② **Checkpoint at best val metric** ③ Minimum train loss","choice_4":"Relationship between L2 regularization and weight decay? ① Completely unrelated ② **Closely related (suppress weight magnitude)** ③ Same as dropout","choice_5":"What to try when underfitting? ① Only large $\\lambda$ ② **More capacity, training, data** ③ Early stop immediately","choice_6":"Appropriate image data augmentation example? ① **Rotation, horizontal flip, crop** ② Change val labels ③ Fix weights to 0","choice_7":"When $\\lambda$ is **too** large? ① Always better ② **Underfitting, performance drop** ③ Ignore val","choice_8":"Generalization performance means? ① Train loss only ② **Performance on unseen data** ③ Number of epochs","choice_9":"Dropout ensemble intuition? ① One fixed subnet ② **Train different subnetworks each step** ③ Random off at inference","concept_0":"Train loss 0.05, val loss 0.8 — then? ① Underfitting ② **Suspect overfitting** ③ Perfect generalization","concept_1":"Increasing $\\lambda$ makes the L2 term? ① Smaller ② **Larger, stronger weight suppression** ③ Unchanged","concept_2":"dropout $p=0.5$, training scale $1/(1-p)$ (int)? ① 1 ② **2** ③ 4","concept_3":"$$\\lambda=2$, $w_1=1$, $w_2=2$ — L2 term $\\frac{\\lambda}{2}(w_1^2+w_2^2)$ (int)? ① 3 ② **5** ③ 8","concept_4":"patience=5 means if val doesn't improve for 5 epochs? ① Stop immediately ② **Consider stopping** ③ Watch train only","concept_5":"Dropout at inference? ① Keep random mask ② **Use all neurons** ③ LR = 0","concept_6":"Train and val accuracy both 55%? ① Overfitting ② **Underfitting** ③ Perfect","concept_7":"Direct effect of weight decay? ① **Suppress weight magnitude** ② Batch norm ③ Remove loss","concept_8":"**Batch normalization** vs **overfitting-prevention regularization (L2, dropout, etc.)**? ① Same ② **Latter: overfitting/weights; former: activation scale** ③ Unrelated","concept_9":"When val loss starts rising? ① Always continue ② **Review early stop, regularization** ③ Delete val","calc_0":"[Calc] $\\lambda=2$, $w_1=3$, $w_2=4$ — $\\frac{\\lambda}{2}(w_1^2+w_2^2)$ (int)?\n① 15\n② **25**\n③ 35\n④ 50","calc_1":"[Calc] dropout $p=0.5$ — $1/(1-p)\\times 10$ (int)?\n① 10\n② **20**\n③ 30\n④ 40","calc_2":"[Calc] $\\lambda=4$, $w=3$ — $\\frac{\\lambda}{2}w^2$ (int)?\n① 9\n② **18**\n③ 27\n④ 36","calc_3":"[Calc] dropout $p=0.2$ — $1/(1-p)\\times 100$ (int)?\n① 100\n② **125**\n③ 150\n④ 200","calc_4":"[Calc] $\\lambda=2$, $w_1=2$, $w_2=1$ — L2 term (int)?\n① **5**\n② 6\n③ 8\n④ 10","calc_5":"[Calc] $\\lambda=0.01$, $\\|w\\|^2=100$ — L2 term $\\times 100$ (int)?\n① 25\n② 40\n③ **50**\n④ 100","calc_6":"[Calc] dropout $p=0.8$ — $1/(1-p)$ (int)?\n① 2\n② 3\n③ 4\n④ **5**","calc_7":"[Calc] $\\lambda=2$, $w_1=1$, $w_2=2$, $w_3=2$ — L2 term (int)?\n① **9**\n② 10\n③ 12\n④ 16","calc_8":"[Calc] train acc 99%, val 60% vs train 70%, val 68%. Closer to overfitting?\n① Latter ② **Former** ③ Both underfitting","calc_9":"[Calc] Val loss minimum at epoch 12 — early stopping saves?\n① Epoch 50 ② **Best near epoch 12** ③ Train minimum ④ Random"},"problemAnswers":{"definition_0":2,"definition_1":2,"definition_2":2,"definition_3":2,"definition_4":1,"definition_5":2,"definition_6":2,"definition_7":2,"definition_8":2,"definition_9":2,"trueFalse_0":1,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":0,"trueFalse_4":1,"trueFalse_5":1,"trueFalse_6":1,"trueFalse_7":0,"trueFalse_8":1,"trueFalse_9":1,"scenario_0":2,"scenario_1":1,"scenario_2":2,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":2,"scenario_8":1,"scenario_9":1,"choice_0":2,"choice_1":2,"choice_2":1,"choice_3":2,"choice_4":2,"choice_5":2,"choice_6":1,"choice_7":2,"choice_8":2,"choice_9":2,"concept_0":2,"concept_1":2,"concept_2":2,"concept_3":2,"concept_4":2,"concept_5":2,"concept_6":2,"concept_7":1,"concept_8":2,"concept_9":2,"calc_0":2,"calc_1":2,"calc_2":2,"calc_3":2,"calc_4":1,"calc_5":3,"calc_6":4,"calc_7":1,"calc_8":2,"calc_9":2},"problemSolutions":{"definition_0":"**1) Concept:** Overfitting fits train well but val/test poorly. **2) Example:** train 99%, val 55%. **3) Answer 2**","definition_1":"**1) L2 penalizes large weights. **2) Answer 2**","definition_2":"**1) Dropout randomly off neurons during training. **2) Answer 2**","definition_3":"**1) Early stop uses val metric. **2) Answer 2**","definition_4":"**1) Weight decay shrinks weights. **2) Answer 1**","definition_5":"**1) Augmentation increases data diversity. **2) Answer 2**","definition_6":"**1) Gap = train–val performance diff. **2) Answer 2**","definition_7":"**1) Both low → underfitting. **2) Answer 2**","definition_8":"**1) Diagnose overfitting via train–val gap. **2) Answer 2**","definition_9":"**1) Inference uses all neurons. **2) Answer 2**","trueFalse_0":"Overfitting = train good, val poor. **Answer 1**","trueFalse_1":"L2 term $\\frac{\\lambda}{2}\\|w\\|^2$. **Answer 1**","trueFalse_2":"Dropout applies during training. **Answer 1**","trueFalse_3":"Dropout off at inference. **Answer 0**","trueFalse_4":"Early stop monitors val. **Answer 1**","trueFalse_5":"More data/augmentation helps. **Answer 1**","trueFalse_6":"Weight decay ≈ L2. **Answer 1**","trueFalse_7":"Excessive $\\lambda$ can cause underfitting. **Answer 0**","trueFalse_8":"Augmentation reduces memorization. **Answer 1**","trueFalse_9":"Underfitting = too simple/under-trained. **Answer 1**","scenario_0":"**1) Large gap → regularization, early stop. **2) Answer 2**","scenario_1":"**1) Val ↑ → early stopping. **2) Answer 1**","scenario_2":"**1) Small data → augmentation + dropout/L2. **2) Answer 2**","scenario_3":"**1) After regularization, try augmentation + early stop. **2) Answer 1**","scenario_4":"**1) Norm explosion → strengthen L2. **2) Answer 1**","scenario_5":"**1) Noise memorization → regularization, shrink. **2) Answer 1**","scenario_6":"**1) Overfitting regularization = weights & overfitting. **2) Answer 1**","scenario_7":"**1) Scale $1/(1-p)$. **2) Answer 2**","scenario_8":"**1) Overfitting: data, regularization, early stop first. **2) Answer 1**","scenario_9":"**1) Small medical data → augmentation. **2) Answer 1**","choice_0":"**1) Check train–val gap. **2) Answer 2**","choice_1":"**1) p ≈ 0.5. **2) Answer 2**","choice_2":"**1) L_data + λ/2||w||². **2) Answer 1**","choice_3":"**1) Save best val checkpoint. **2) Answer 2**","choice_4":"**1) L2 ≈ weight decay. **2) Answer 2**","choice_5":"**1) Underfitting → capacity, training ↑. **2) Answer 2**","choice_6":"**1) Rotation, flip, crop. **2) Answer 1**","choice_7":"**1) Excessive λ → underfitting. **2) Answer 2**","choice_8":"**1) Performance on unseen data. **2) Answer 2**","choice_9":"**1) Subnetwork ensemble. **2) Answer 2**","concept_0":"**1) Large val loss → overfitting. **2) Answer 2**","concept_1":"**1) λ↑ → L2 term↑. **2) Answer 2**","concept_2":"**1) $1/0.5=2$. **2) Answer 2**","concept_3":"**1) $0.5\\cdot2\\cdot(1+4)=5$. **2) Answer 2**","concept_4":"**1) patience = epochs without improvement. **2) Answer 2**","concept_5":"**1) Inference = all active. **2) Answer 2**","concept_6":"**1) Both low → underfitting. **2) Answer 2**","concept_7":"**1) Weight suppression. **2) Answer 1**","concept_8":"**1) Regularization vs norm layers. **2) Answer 2**","concept_9":"**1) Val ↑ → review stop. **2) Answer 2**","calc_0":"**1) $0.5\\cdot2\\cdot(9+16)=25$. **2) Answer ②**","calc_1":"**1) $1/0.5\\times10=20$. **2) Answer ②**","calc_2":"**1) $0.5\\cdot4\\cdot9=18$. **2) Answer ②**","calc_3":"**1) $1/0.8\\times100=125$. **2) Answer ②**","calc_4":"**1) $0.5\\cdot2\\cdot(4+1)=5$. **2) Answer ①**","calc_5":"**1) $0.5\\cdot0.01\\cdot100\\times100=50$. **2) Answer ③**","calc_6":"**1) $1/0.2=5$. **2) Answer ④**","calc_7":"**1) $0.5\\cdot2\\cdot(1+4+4)=9$. **2) Answer ①**","calc_8":"**1) train 99% val 60% has larger gap. **2) Answer ②**","calc_9":"**1) Save epoch with min val loss. **2) Answer ②**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 2\nassert answer == 2","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 2\nassert answer == 2","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 2\nassert answer == 2","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 2\nassert answer == 2","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 0\nassert answer == 0","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 1\nassert answer == 1","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 0\nassert answer == 0","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 2\nassert answer == 2","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 2\nassert answer == 2","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 2\nassert answer == 2","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","choice_0":"answer = 2\nassert answer == 2","choice_1":"answer = 2\nassert answer == 2","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 2\nassert answer == 2","choice_4":"answer = 2\nassert answer == 2","choice_5":"answer = 2\nassert answer == 2","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 2\nassert answer == 2","choice_8":"answer = 2\nassert answer == 2","choice_9":"answer = 2\nassert answer == 2","concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 2\nassert answer == 2","concept_6":"answer = 2\nassert answer == 2","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 2\nassert answer == 2","concept_9":"answer = 2\nassert answer == 2","calc_0":"lam, w1, w2 = 2, 3, 4\nval = int(0.5 * lam * (w1**2 + w2**2))\nanswer = 2 if val == 25 else 0\nassert answer == 2","calc_1":"p = 0.5\nval = int(1 / (1 - p) * 10)\nanswer = 2 if val == 20 else 0\nassert answer == 2","calc_2":"lam, w = 4, 3\nval = int(0.5 * lam * w**2)\nanswer = 2 if val == 18 else 0\nassert answer == 2","calc_3":"p = 0.2\nval = int(1 / (1 - p) * 100)\nanswer = 2 if val == 125 else 0\nassert answer == 2","calc_4":"lam, w1, w2 = 2, 2, 1\nval = int(0.5 * lam * (w1**2 + w2**2))\nanswer = 1 if val == 5 else 0\nassert answer == 1","calc_5":"lam, w2 = 0.01, 100\nval = int(0.5 * lam * w2 * 100)\nanswer = 3 if val == 50 else 0\nassert answer == 3","calc_6":"p = 0.8\nval = int(1 / (1 - p))\nanswer = 4 if val == 5 else 0\nassert answer == 4","calc_7":"lam = 2\nval = int(0.5 * lam * (1 + 4 + 4))\nanswer = 1 if val == 9 else 0\nassert answer == 1","calc_8":"answer = 2\nassert answer == 2","calc_9":"answer = 2\nassert answer == 2"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midDlCh06":{"chapter":"Chapter 06","title":"Normalization Layers: Matching Scale Per Layer","description":"When you study deep learning, you'll often run into the word \"normalization.\" In many languages the terms sound similar, but **regularization against overfitting (Regularization — e.g. L2 penalty, dropout)** and **normalization that matches signal size per layer (Normalization — e.g. batch norm, layer norm)** play completely different roles. This chapter goes deep on **normalization layers** that keep each layer's activations from growing too large or too small so the model can train steadily without wobbling.\n\nPicture an orchestra. If the violin is too loud and the piano too quiet, even a great conductor can't produce beautiful music. Deep neural networks are the same: when signal size swings layer by layer, learning slows or stops entirely. **Batch Normalization** groups several samples into one reference; **Layer Normalization** sets reference inside a single sample. We explain how train vs serve behavior differs, step by step, with beginner-friendly analogies.","sectionTitle":"Normalization Layers: Matching Scale Per Layer","whatIs":{"0":"**1. Why match scale per layer?**\n\n**Plain language:** A neural network passes data through many layers to find the answer. When early-layer weights shift slightly during training, the size and spread of data reaching later layers can snowball into big changes. When **signals wobble layer by layer (internal covariate shift)**, the model gets confused and can't learn properly.\n\n**Analogy:** In a relay race, the next runner needs a steady baton handoff. If the first runner sprints one lap and jogs the next, timing breaks. Norm layers **keep each layer's signal \"stride\" steady** so the next layer always receives a stable range.\n\n**Tip:** If loss suddenly spikes or gradients vanish, before only tweaking learning rate, check **norm layer placement and train/inference mode**.","1":"**2. Batch Normalization — align several samples at once**\n\n**Plain language:** Data usually enters the model in mini-batches. Batch norm uses **the whole batch** as reference: how far is each value from the batch mean? Fixing everything rigidly to 0–1 can hurt expressiveness, so **learnable parameters ($\\gamma$, $\\beta$)** are multiplied and added to flexibly rescale and shift again.\n\n**Core formula:** $\\hat{x} = \\frac{x - \\mu}{\\sqrt{\\sigma^2 + \\varepsilon}}$\n\n**One-line meaning:** Subtract batch mean $\\mu$ from input $x$ and divide by standard deviation $\\sigma$. $\\varepsilon$ is a tiny safety value so the denominator never hits zero.\n\n**Analogy:** After a test, you don't just look at raw score — you use **class mean and standard deviation** to see where you stand (like a z-score).\n\n**Tip:** Very effective in vision CNNs when you can feed **32 or 64 images per batch**; usually placed right after Conv or Linear layers.","2":"**3. Layer Normalization — balance inside one sample**\n\n**Plain language:** Batch norm needs other samples as reference — a downside. Layer norm ignores other data and uses only the **feature values inside one sample** to compute mean and variance. It shines when batch size is 1 or sentence length varies (Transformer, RNN).\n\n**Analogy:** Batch norm is like the **class math average**; layer norm is **one student's Korean, English, math, science scores averaged within that student** — ignoring the desk neighbor.\n\n**Tip:** Standard in Transformer blocks (like ChatGPT's backbone) and RNN/LSTM. When memory forces mini-batch size 1–2, layer norm is safer than batch norm.","3":"**4. Training vs inference — batch norm's two faces**\n\n**Plain language:** Batch norm behaves differently while training vs at serve time. **While training**, it normalizes using the current mini-batch while **quietly recording running mean and spread** across all data seen so far. **At serve time**, even if the user uploads one image, it uses **statistics accumulated during training**.\n\n**Analogy:** National-team **practice** sets intensity from today's squad; the **World Cup match** picks players and tactics from **experience built over many friendlies**.\n\n**Tip:** In PyTorch, `model.train()` and `model.eval()` switch modes. Forgetting `model.eval()` at deploy means each upload shifts the reference and yields wrong results. **Always enable inference mode before shipping.**"},"whyImportant":{"0":"**1. Makes imagined deep networks actually trainable**\n\nIn theory, stacking hundreds of layers should make models smarter, but activations and gradients used to explode or vanish so training was impossible. Batch and layer norm tame that instability — the top enabler for **bolder learning rates and much deeper models** in practice.","1":"**2. Separates two \"normalizations\" that sound alike**\n\nIn Korean both get called \"normalization,\" which confuses beginners. **L2 and dropout (Regularization)** stop overfitting and slim down weights. **Batch/layer norm (Normalization)** polishes **signal size** flowing between layers. Practice scores great but real test fails → former; model sputters from the start with weird loss → latter.","2":"**3. Pick the right fit for data and model type**\n\nEven great techniques need the right context. Batch norm fits image CNNs; forcing it into Transformer or RNN text models can break training. Understand batch size, sequence length, etc., and choose the right norm layer for best performance."},"howUsed":{"0":"**① Diagnose: is our model's engine sputtering?**\n\n**Everyday example:** If your car engine shakes and the speedometer swings wildly, you can't drive normally. In deep learning, if loss jumps around or you see `NaN` (not a number), the data pipeline is broken. Don't just floor the accelerator (learning rate) — **add norm layers in the right place like an oil change**.\n\nIf train accuracy is 99% but new data only hits 50%, that's overfitting. Add **dropout or data augmentation (Regularization)**, not only norm layers (Normalization).","1":"**② Photos & vision: batch norm's main stage**\n\n**Everyday example:** Building a phone AI to tell dogs from cats — with enough GPU memory you might feed 32 or 64 photos per step. **Batch norm** works great by averaging brightness and feature spread across **\"this pile of 64 photos.\"**\n\nHigh-res medical images where you only fit 1–2 photos per step? The reference keeps shaking and batch norm backfires. Try larger effective batch size or another norm approach.","2":"**③ Chatbots & translators: layer norm's solo stage**\n\n**Everyday example:** User messages range from \"Hi?\" (two characters) to \"Summarize this article in three lines\" (very long). Forcing variable-length sentences into one batch average is awkward.\n\nText models use **layer norm** to adjust signal size **only inside the one sentence being analyzed**, ignoring neighbors — like a marathoner keeping their own pace.","3":"**④ Pre-launch checklist: turn on inference mode (eval)**\n\n**Everyday example:** You trained hard and launched a web service, but results swing every time a user uploads an image. Cause: the model never switched to **\"performance mode (inference).\"**\n\nOne `model.eval()` in PyTorch stops computing batch means from new data and uses **reliable stats accumulated during training**. The crucial last step: drop practice habits and follow the live playbook."},"problemSolving":{"0":"Problems cluster around **unstable training**, **batch vs layer norm**, and **two kinds of \"normalization\"**.\n\nWhen **training wobbles**, check NaN loss, exploding/vanishing gradients, and wild train loss swings. Before going deeper, check **norm layer placement** and **train/inference mode**.\n\n**Train good, val bad** suggests **overfitting (memorization)**. Norm layers alone may not fix it — also consider **L2, dropout, early stopping, augmentation**.\n\n**Batch norm** groups samples (batch 32+). **Layer norm** works inside one sample (batch=1, Transformers). **Training** = current batch + stored experience; **inference** = stored reference — call `model.eval()` before deploy.\n\nClues: batch=1/Transformer → layer norm; inference mode/stored stats → batch norm; NaN → norm layers; weight penalty/neuron off → overfitting tools (not this chapter's main answer).","2":"$35","3":"**True/False** — \"Inference always uses only the current mini-batch\" is a common mistake. Inference uses **stored reference** → **false (0)**.\n\n\"Batch norm and L2 share the same main goal\" → L2 = weights/memorization; batch norm = signal size → **false (0)**.\n\n\"Batch norm can update running mean and running variance during training\" → correct → **true (1)**.\n\n\"Tiny batch shakes batch norm reference\" → correct → **true (1)**.\n\n---\n\n**Multiple choice** — \"Align by distance from average\" → core norm-layer form → **Answer 1**.\n\n\"Overfitting prevention vs norm layers\" → not ① same or ③ unrelated; different goals **②** (weights/memorization vs signal size) → **Answer 2**.\n\n---\n\n**Concept problems** — batch=1 with batch norm only → unstable; layer norm **②** → **Answer 2**.\n\nText/Transformer models → layer norm **②** is standard → **Answer 2**.\n\n---\n\n**Calculation problems** — normalized 2, ×3 +1 → 6+1 = **7** → **Answer 3**.\n\nNorm using **stored reference** at inference → batch norm **②** → **Answer 2**."},"summary":"This chapter's core is **batch and layer normalization** — keeping **signal size steady layer by layer** to stabilize training. Don't confuse them with **regularization against overfitting (L2, dropout)** despite similar naming. Regularization stops blind memorization; **normalization layers** keep **water pressure (signal size)** even in the data pipeline.\n\n**Batch norm** groups **several samples (mini-batch)** to set reference — common in image CNNs; at serve time you must use **stats accumulated during training**. **Layer norm** sets reference **inside one sample only** — essential for NLP (Transformer) and very small batches.\n\nPractice tip: if training refuses to run and loss explodes, calm it with norm layers (Normalization); if validation keeps dropping while train stays high, use regularization techniques — **prescribe by situation and role**.","sectionLabels":{"whatIs":"Understanding the concepts","whyImportant":"Why does this matter?","howUsed":"How it's used in practice","summary":"Three-line takeaway"},"formulaGuide":{"title":"Core concepts and formulas at a glance","batchNorm":"**Batch Normalization** — balance using the mini-batch as reference.\n\n**Plain language:** If a middle layer suddenly outputs abnormally large or small values, the next layer gets shocked. Batch norm gathers values from **one input mini-batch** and adjusts balance — \"you're above our batch average, turn down; you're below, turn up.\"\n\n**Key point:** Strongest when you can feed **32+ samples at once**. With only 1–2 per step, the reference shakes and it can hurt.\n\n**Everyday analogy:** After a test, scoring yourself vs **whole-class average** — like a standard score.","layerNorm":"**Layer Normalization** — balance independently inside one sample.\n\n**Plain language:** Variable-length chat data is hard to batch neatly. Layer norm ignores other samples and averages **inside one sample (one sentence)** only.\n\n**Key point:** Works perfectly with batch size 1. A core part of Transformer-based LLMs.\n\n**Everyday analogy:** Instead of comparing to others, average **your own subject scores** and balance within yourself.","trainInfer":"**Training vs inference** — practice and live play must differ.\n\n**Plain language:** Batch norm uses **current mini-batch stats** while training but also secretly tracks **cumulative mean** across all data. At **serve time**, even one image uses that accumulated reference.\n\n**Key point:** Always call PyTorch's `model.eval()` before deploy. Otherwise predictions keep changing — a major bug.\n\n**Everyday analogy:** Practice matches today's squad; World Cup matches rely on **experience from dozens of games**.","vsRegularization":"**Two \"normalizations\" — how to tell them apart**\n\n**Plain language:** Same Korean label, different jobs. **Regularization (L2, dropout)** stops rigid memorization. **Norm layers (batch/layer norm)** keep internal **data pressure** steady so training doesn't blow up.\n\n**Key point:** High train, low val → regularization. NaN or wild loss during training → norm layers.\n\n**Everyday analogy:** Regularization = lightening your backpack (weight control); norm layers = mixing orchestra volume (signal size)."},"visual":"","problemSolvingLabel":"Problem-solving guide","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **5 problems** sampled from a bank of **60** (easy 2 · medium 2 · hard 1; order easy→medium→hard). Pick the option number.","practiceProblemsInstruction":"Read each problem and choose the best option.","midDlCh06VisualIntro":"The diagram shows **batch norm, layer norm, train/inference, and role distinction** as **before → after** in plain language.","midDlCh06VisualConceptTitle":"Normalization layers at a glance","midDlCh06VisualFlowTitle":"Match signal size per layer","midDlCh06VisualModelTitle":"Stabilize training; tell roles apart","midDlCh06VisualPanelBN":"Batch normalization","midDlCh06VisualPanelLN":"Layer normalization","midDlCh06VisualPanelTrainInfer":"Training vs inference","midDlCh06VisualPanelVsReg":"Role distinction","midDlCh06VisualLabelBefore":"Before","midDlCh06VisualLabelAfter":"After","midDlCh06VisualLabelEval":"Inference mode","midDlCh06VisualCaption":"Left = **before**, right = **after**. Batch norm **groups samples**; layer norm works **inside one sample**. Training uses the current batch; serving uses **stored reference**. Overfitting tools (L2/dropout) handle **memorization/weights**; norm layers handle **signal size**.","midDlCh06VisualAnnotBnBefore":"Spread across batch","midDlCh06VisualAnnotBnAfter":"Centered together","midDlCh06VisualAnnotLnBefore":"Sizes differ per sample","midDlCh06VisualAnnotLnAfter":"Even within sample","midDlCh06VisualAnnotTrainBefore":"Reference line wobbles","midDlCh06VisualAnnotTrainAfter":"Stored reference","midDlCh06VisualAnnotVsBefore":"Overfit prevention","midDlCh06VisualAnnotVsAfter":"Signal size match","midDlCh06VisualAnnotL2":"Weight penalty","midDlCh06VisualAnnotDropout":"Neuron off","midDlCh06VisualAnnotBnShort":"Batch","midDlCh06VisualAnnotLnShort":"Layer","midDlCh06VisualSamplePrefix":"Sample","problems":{"definition_0":"The **core** goal of normalization layers (Batch Norm, Layer Norm) is?\n1) Add L2 penalty on weights\n2) **Stabilize activations per layer (normalize mean/variance scale)**\n3) Randomly turn off neurons during training","definition_1":"Batch Normalization computes statistics mainly across which axis?\n1) **Mini-batch axis — multiple samples in the same layer**\n2) Feature axis within a single sample only\n3) The entire dataset every step","definition_2":"Layer Normalization computes statistics mainly across?\n1) The whole mini-batch\n2) **The feature axis within each sample**\n3) All network weights","definition_3":"What is the role of learnable $\\gamma$ and $\\beta$ after normalization $\\hat{x}$?\n1) Replace the optimizer\n2) **Scale and shift normalized output to restore expressiveness**\n3) Fix the learning rate","definition_4":"At **inference**, Batch Norm mainly uses?\n1) **Running mean and running variance accumulated during training**\n2) Only the current mini-batch statistics\n3) Full validation-set statistics","definition_5":"In $\\frac{x-\\mu}{\\sqrt{\\sigma^2+\\varepsilon}}$, what is $\\varepsilon$ for?\n1) Encourage overfitting\n2) **Numerical stability when variance is near zero**\n3) Set batch size","definition_6":"Relationship between **overfitting regularization (L2, dropout)** and **norm layers (batch/layer norm)**?\n1) Exactly the same technique\n2) **Same word \"normalization\" but different goals (weight penalty vs activation scale)**\n3) Completely unrelated","definition_7":"Internal covariate shift means?\n1) GPU temperature change\n2) **Layer input distributions keep changing during training**\n3) Only batch size changes","definition_8":"Layer Norm is **especially** common in?\n1) 1980s single-layer perceptrons only\n2) **Transformers, RNNs — small or variable batch / sequences**\n3) Never used","definition_9":"During **training**, Batch Norm $\\mu$ and $\\sigma^2$ come from?\n1) Full train set each time\n2) **Current mini-batch**\n3) Val set only","trueFalse_0":"[T/F] Batch Norm and Layer Norm can help stabilize deep network training. 1=true, 0=false.","trueFalse_1":"[T/F] Batch Norm uses mini-batch mean and variance during training. 1=true, 0=false.","trueFalse_2":"[T/F] At inference, Batch Norm must use **only** current mini-batch stats like training. 1=true, 0=false.","trueFalse_3":"[T/F] Layer Norm normalizes along the feature axis within each sample. 1=true, 0=false.","trueFalse_4":"[T/F] $\\gamma$ and $\\beta$ in BN/LN are learnable parameters. 1=true, 0=false.","trueFalse_5":"[T/F] Batch Norm and L2 regularization have **exactly the same primary purpose**. 1=true, 0=false.","trueFalse_6":"[T/F] $\\varepsilon$ prevents division errors when variance is near zero. 1=true, 0=false.","trueFalse_7":"[T/F] Very small batch size can make Batch Norm statistics noisier. 1=true, 0=false.","trueFalse_8":"[T/F] During training, Batch Norm can update running mean and running variance. 1=true, 0=false.","trueFalse_9":"[T/F] When batch size is 1 or very small, Layer Norm is often more suitable. 1=true, 0=false.","scenario_0":"[App] Deep CNN: exploding activations, unstable gradients. Try **first**? ① Remove all layers ② **Add Batch Norm (or appropriate norm) after Conv/Linear** ③ Only increase L2 λ 100×","scenario_1":"[App] Transformer, batch=1 per GPU. Best norm choice? ① Batch Norm only ② **Layer Norm** ③ No normalization","scenario_2":"[App] BN left in train mode at deployment — shaky inference. Fix? ① **model.eval() / inference mode** ② Always train mode ③ Delete BN layers","scenario_3":"[App] For **activation scale stabilization**, the best choice is? ① Weight L2 penalty only ② **Batch/layer norm (BN/LN)** ③ Dropout only","scenario_4":"[App] Image CNN, batch=64. Common choice? ① **Batch Norm** ② Always Layer Norm only ③ Never use norms","scenario_5":"[App] BN model deployed in `model.train()` mode. One photo yields inconsistent results. Closest cause? ① Learning rate too small ② **Batch norm uses changing mini-batch statistics** ③ Too many weights","scenario_6":"[App] After norm $\\hat{x}=0$, $\\gamma=2$, $\\beta=3$: $y=\\gamma\\hat{x}+\\beta$ (integer)? ① 0 ② 2 ③ **3**","scenario_7":"[App] batch=2 makes BN stats too noisy. Alternative? ① **Try Layer Norm or larger batch** ② Force BN ③ Merge val into train","scenario_8":"[App] RNN/LSTM sequence model. Common norm? ① Batch norm on batch axis only ② **Layer Norm** ③ No normalization","scenario_9":"[App] After norm $\\hat{x}=2$, $\\gamma=3$, $\\beta=1$: $y=\\gamma\\hat{x}+\\beta$ (integer)? ① 5 ② 6 ③ **7**","choice_0":"Batch Norm vs Layer Norm axis — BN mainly uses? ① **Batch (sample N) axis** ② Single-sample feature axis ③ Epoch axis","choice_1":"At inference, BN mean/variance come from? ① Current batch only ② **Running mean and variance** ③ Always 0 and 1","choice_2":"Closest core normalization form? ① **$\\frac{x-\\mu}{\\sqrt{\\sigma^2+\\varepsilon}}$** ② $\\frac{\\lambda}{2}\\|w\\|^2$ ③ $1/(1-p)$","choice_3":"Overfitting regularization vs norm layers — relationship? ① Identical ② **Different goals (weights/overfitting vs activation scale)** ③ Unrelated","choice_4":"Role of $\\varepsilon$? ① **Numerical stability** ② Dropout rate ③ Weight decay","choice_5":"$$\\gamma$, $\\beta$ are? ① **Learnable scale and shift** ② Fixed hyperparameters only ③ Delete the mean","choice_6":"Training-time BN statistics from? ① **Current mini-batch** ② Full data every step ③ Test set","choice_7":"LN is advantageous when? ① Batch is always huge ② **Small/variable batch, sequences** ③ Never","choice_8":"Internal covariate shift is? ① **Layer input distribution changes during training** ② GPU movement ③ Val only changes","choice_9":"model.eval() for BN does? ① **Use running statistics** ② Enable dropout ③ Only disable gradients","concept_0":"$$\\mu=4$, $x=6$, $\\sigma^2=4$, $\\varepsilon=0$: $\\frac{x-\\mu}{\\sqrt{\\sigma^2+\\varepsilon}}$ (integer)? ① 0.5 ② **1** ③ 2","concept_1":"$$\\hat{x}=-1$, $\\gamma=4$, $\\beta=2$: $y=\\gamma\\hat{x}+\\beta$ (integer)? ① -6 ② **-2** ③ 2","concept_2":"Training BN $\\mu$, $\\sigma^2$ from? ① Val set ② **Current batch** ③ All weights","concept_3":"L2 vs batch norm — purpose? ① Same ② **Different** ③ BN = dropout","concept_4":"BN stats with batch=1? ① Perfect ② **Unstable / poor fit** ③ Always 0","concept_5":"BN in train mode at inference? ① **Wrong batch stats, shaky performance** ② Always better ③ No effect","concept_6":"LN normalizes along? ① Batch N ② **Features within sample** ③ Epochs","concept_7":"$$\\gamma=1$, $\\beta=0$ on normalized output? ① **Identity (no scale/shift)** ② Always zero ③ Double","concept_8":"Why add $\\varepsilon$? ① **Prevent division by zero** ② Learning rate ③ Turn off BN","concept_9":"Main norm in Transformers? ① BN only ② **Layer Norm** ③ None","calc_0":"[Calc] $\\mu=2$, $x=5$, $\\sigma^2=9$, $\\varepsilon=0$: $\\frac{x-\\mu}{\\sqrt{\\sigma^2+\\varepsilon}}$ (integer)?\n① 0\n② **1**\n③ 2\n④ 3","calc_1":"[Calc] $\\hat{x}=-1$, $\\gamma=4$, $\\beta=2$: $y=\\gamma\\hat{x}+\\beta$ (integer)?\n① -6\n② **-2**\n③ 2\n④ 6","calc_2":"[Calc] $\\mu=3$, $x=7$, $\\sigma^2=4$, $\\varepsilon=0$: normalized value (integer)?\n① 1\n② **2**\n③ 3\n④ 4","calc_3":"[Calc] $\\hat{x}=2$, $\\gamma=3$, $\\beta=1$: $y=\\gamma\\hat{x}+\\beta$ (integer)?\n① 5\n② 6\n③ **7**\n④ 8","calc_4":"[Calc] $\\mu=5$, $x=5$, $\\sigma^2=1$, $\\varepsilon=0$: normalized value (integer)?\n① **0**\n② 1\n③ 2\n④ 5","calc_5":"[Calc] $\\hat{x}=1$, $\\gamma=0$, $\\beta=5$: $y=\\gamma\\hat{x}+\\beta$ (integer)?\n① 0\n② 1\n③ **5**\n④ 6","calc_6":"[Calc] Batch [2,4,6], $\\mu=4$, $\\sigma^2=4$, $x=6$: $(x-\\mu)/\\sqrt{\\sigma^2}$ (integer)?\n① 0\n② **1**\n③ 2\n④ 3","calc_7":"[Calc] $\\hat{x}=3$, $\\gamma=2$, $\\beta=-1$: $y$ (integer)?\n① 3\n② 4\n③ **5**\n④ 7","calc_8":"[Calc] Which norm uses running mean/variance at inference?\n① Layer Norm\n② **Batch Norm**\n③ L2\n④ Dropout","calc_9":"[Calc] $\\mu=10$, $x=10$, $\\sigma^2=25$, $\\varepsilon=0$: normalized value (integer)?\n① **0**\n② 1\n③ 2\n④ 5"},"problemAnswers":{"definition_0":2,"definition_1":1,"definition_2":2,"definition_3":2,"definition_4":1,"definition_5":2,"definition_6":2,"definition_7":2,"definition_8":2,"definition_9":2,"trueFalse_0":1,"trueFalse_1":1,"trueFalse_2":0,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":0,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":1,"scenario_0":2,"scenario_1":2,"scenario_2":1,"scenario_3":2,"scenario_4":1,"scenario_5":2,"scenario_6":3,"scenario_7":1,"scenario_8":2,"scenario_9":3,"choice_0":1,"choice_1":2,"choice_2":1,"choice_3":2,"choice_4":1,"choice_5":1,"choice_6":1,"choice_7":2,"choice_8":1,"choice_9":1,"concept_0":2,"concept_1":2,"concept_2":2,"concept_3":2,"concept_4":2,"concept_5":1,"concept_6":2,"concept_7":1,"concept_8":1,"concept_9":2,"calc_0":2,"calc_1":2,"calc_2":2,"calc_3":3,"calc_4":1,"calc_5":3,"calc_6":2,"calc_7":3,"calc_8":2,"calc_9":1},"problemSolutions":{"definition_0":"**1) BN/LN stabilize activation scale. **2) Answer 2**","definition_1":"**1) BN uses batch-axis stats. **2) Answer 1**","definition_2":"**1) LN uses per-sample feature axis. **2) Answer 2**","definition_3":"**1) γ,β restore expressiveness. **2) Answer 2**","definition_4":"**1) Inference uses running μ,σ². **2) Answer 1**","definition_5":"**1) ε for numerical stability. **2) Answer 2**","definition_6":"**1) Overfitting prevention vs norm layers. **2) Answer 2**","definition_7":"**1) Changing input distribution. **2) Answer 2**","definition_8":"**1) Transformers/RNNs → LN. **2) Answer 2**","definition_9":"**1) Training uses current mini-batch. **2) Answer 2**","trueFalse_0":"Norm layers help stabilize training. **Answer 1**","trueFalse_1":"BN training = mini-batch stats. **Answer 1**","trueFalse_2":"Inference uses running stats. **Answer 0**","trueFalse_3":"LN = per-sample feature axis. **Answer 1**","trueFalse_4":"γ,β are learnable. **Answer 1**","trueFalse_5":"BN ≠ L2 purpose. **Answer 0**","trueFalse_6":"ε = stability. **Answer 1**","trueFalse_7":"Small batch → noisy BN. **Answer 1**","trueFalse_8":"BN updates running stats while training. **Answer 1**","trueFalse_9":"Small batch → LN often better. **Answer 1**","scenario_0":"**1) Unstable → add BN/LN. **2) Answer 2**","scenario_1":"**1) batch=1 → LN. **2) Answer 2**","scenario_2":"**1) eval mode for running stats. **2) Answer 1**","scenario_3":"**1) Activation scale → BN/LN. **2) Answer 2**","scenario_4":"**1) CNN batch=64 → BN. **2) Answer 1**","scenario_5":"**1) Train-mode BN shakes batch stats. **2) Answer 2**","scenario_6":"**1) $2\\cdot0+3=3$. **2) Answer 3**","scenario_7":"**1) Small batch → LN or larger batch. **2) Answer 1**","scenario_8":"**1) Sequences → LN. **2) Answer 2**","scenario_9":"**1) $3\\cdot2+1=7$. **2) Answer 3**","choice_0":"**1) BN = batch axis. **2) Answer 1**","choice_1":"**1) Running μ,σ². **2) Answer 2**","choice_2":"**1) (x-μ)/√(σ²+ε). **2) Answer 1**","choice_3":"**1) Different goals. **2) Answer 2**","choice_4":"**1) ε = stability. **2) Answer 1**","choice_5":"**1) γ,β learnable. **2) Answer 1**","choice_6":"**1) Current mini-batch. **2) Answer 1**","choice_7":"**1) Small/variable batch → LN. **2) Answer 2**","choice_8":"**1) Layer input distribution shifts. **2) Answer 1**","choice_9":"**1) eval → running stats. **2) Answer 1**","concept_0":"**1) $(6-4)/2=1$. **2) Answer 2**","concept_1":"**1) $4\\cdot(-1)+2=-2$. **2) Answer 2**","concept_2":"**1) BN train = current batch. **2) Answer 2**","concept_3":"**1) L2 ≠ BN. **2) Answer 2**","concept_4":"**1) batch=1 → unstable BN. **2) Answer 2**","concept_5":"**1) Train-mode BN = wrong stats. **2) Answer 1**","concept_6":"**1) LN = feature axis. **2) Answer 2**","concept_7":"**1) γ=1,β=0 → identity. **2) Answer 1**","concept_8":"**1) ε prevents div by zero. **2) Answer 1**","concept_9":"**1) Transformer → LN. **2) Answer 2**","calc_0":"**1) $(5-2)/3=1$. **2) Answer ②**","calc_1":"**1) $4\\cdot(-1)+2=-2$. **2) Answer ②**","calc_2":"**1) $(7-3)/2=2$. **2) Answer ②**","calc_3":"**1) $3\\cdot2+1=7$. **2) Answer ③**","calc_4":"**1) $(5-5)/1=0$. **2) Answer ①**","calc_5":"**1) $0\\cdot1+5=5$. **2) Answer ③**","calc_6":"**1) $(6-4)/2=1$. **2) Answer ②**","calc_7":"**1) $2\\cdot3-1=5$. **2) Answer ③**","calc_8":"**1) BN uses running stats. **2) Answer ②**","calc_9":"**1) $(10-10)/5=0$. **2) Answer ①**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 2\nassert answer == 2","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 2\nassert answer == 2","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 2\nassert answer == 2","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 0\nassert answer == 0","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 0\nassert answer == 0","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 2\nassert answer == 2","scenario_1":"answer = 2\nassert answer == 2","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 2\nassert answer == 2","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 2\nassert answer == 2","scenario_6":"answer = 3\nassert answer == 3","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 2\nassert answer == 2","scenario_9":"answer = 3\nassert answer == 3","choice_0":"answer = 1\nassert answer == 1","choice_1":"answer = 2\nassert answer == 2","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 2\nassert answer == 2","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 1\nassert answer == 1","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 2\nassert answer == 2","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 1\nassert answer == 1","concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 2\nassert answer == 2","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 2\nassert answer == 2","calc_0":"mu, x, s2, eps = 2, 5, 9, 0\nval = int((x - mu) / (s2 + eps) ** 0.5)\nanswer = 2 if val == 1 else 0\nassert answer == 2","calc_1":"xhat, gamma, beta = -1, 4, 2\nval = int(gamma * xhat + beta)\nanswer = 2 if val == -2 else 0\nassert answer == 2","calc_2":"mu, x, s2, eps = 3, 7, 4, 0\nval = int((x - mu) / (s2 + eps) ** 0.5)\nanswer = 2 if val == 2 else 0\nassert answer == 2","calc_3":"xhat, gamma, beta = 2, 3, 1\nval = int(gamma * xhat + beta)\nanswer = 3 if val == 7 else 0\nassert answer == 3","calc_4":"mu, x, s2, eps = 5, 5, 1, 0\nval = int((x - mu) / (s2 + eps) ** 0.5)\nanswer = 1 if val == 0 else 0\nassert answer == 1","calc_5":"xhat, gamma, beta = 1, 0, 5\nval = int(gamma * xhat + beta)\nanswer = 3 if val == 5 else 0\nassert answer == 3","calc_6":"mu, x, s2 = 4, 6, 4\nval = int((x - mu) / s2 ** 0.5)\nanswer = 2 if val == 1 else 0\nassert answer == 2","calc_7":"xhat, gamma, beta = 3, 2, -1\nval = int(gamma * xhat + beta)\nanswer = 3 if val == 5 else 0\nassert answer == 3","calc_8":"answer = 2\nassert answer == 2","calc_9":"mu, x, s2, eps = 10, 10, 25, 0\nval = int((x - mu) / (s2 + eps) ** 0.5)\nanswer = 1 if val == 0 else 0\nassert answer == 1"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midMlChapters":{"midMl00":{"chapter":"Chapter 00","title":"Intermediate ML: Real-World Data Limits and Model Optimization","description":"An intro to real-world preprocessing and model tuning, building on basic machine learning."},"midMl01":{"chapter":"Chapter 01","title":"Data Scaling and Distribution Transformation","description":"Covers standardization, min-max scaling, and robust scaling so features with different units affect the model uniformly and to handle outliers."},"midMl02":{"chapter":"Chapter 02","title":"Categorical Encoding","description":"Explains one-hot encoding, ordinal encoding, and target encoding to convert categorical data into numeric form for computation."},"midMl03":{"chapter":"Chapter 03","title":"Missing Data and Imputation","description":"Covers mean/median imputation, KNN-based imputation, and regression-based imputation beyond simple deletion of missing values."},"midMl04":{"chapter":"Chapter 04","title":"Imbalanced Data Basics","description":"Covers SMOTE and class weights so the model does not bias toward the majority in fraud detection, diagnosis, and similar settings."},"midMl05":{"chapter":"Chapter 05","title":"Advanced Cross Validation","description":"Covers stratified cross-validation to preserve class ratios and time series split when temporal order must be preserved."},"midMl06":{"chapter":"Chapter 06","title":"Multiclass Evaluation and ROC-AUC","description":"Extends precision/recall to multiclass (micro/macro) and introduces ROC curves for overall classification performance across thresholds."},"midMl07":{"chapter":"Chapter 07","title":"SVM Basics: Decision Boundary and Margin","description":"Finds the optimal separating hyperplane by maximizing the margin to the nearest support vectors."},"midMl08":{"chapter":"Chapter 08","title":"Kernel Trick: Nonlinear SVM","description":"Maps data to a higher-dimensional space via inner products (kernel) to achieve nonlinear separation without explicit feature transformation."},"midMl09":{"chapter":"Chapter 09","title":"Dimensionality Reduction 1: PCA","description":"Linear compression onto a few orthogonal principal components that retain most of the data variance."},"midMl10":{"chapter":"Chapter 10","title":"Ensemble: Bagging and Pasting","description":"Bagging (bootstrap) and pasting (no replacement) build multiple models and combine by voting; explains bias-variance tradeoff."},"midMl11":{"chapter":"Chapter 11","title":"Boosting Basics: AdaBoost","description":"Sequentially combines weak learners by upweighting misclassified samples to reduce error."},"midMl12":{"chapter":"Chapter 12","title":"Gradient Boosting Machine (GBM)","description":"Each new tree fits the residual of the previous ensemble, combining gradient descent with ensemble learning."},"midMl13":{"chapter":"Chapter 13","title":"Density-Based Clustering (DBSCAN)","description":"Forms clusters by density and flags noise, going beyond spherical K-means."},"midMl14":{"chapter":"Chapter 14","title":"Hierarchical Clustering and Dendrogram","description":"Unsupervised merging or splitting of similar points into a hierarchical tree (dendrogram) without fixing the number of clusters."},"midMl15":{"chapter":"Chapter 15","title":"Gaussian Mixture Model (GMM)","description":"Soft clustering by assuming data from a mixture of Gaussians and estimating membership via EM."},"midMl16":{"chapter":"Chapter 16","title":"Anomaly Detection Basics","description":"Unsupervised/semi-supervised methods using distribution or distance to flag points that deviate from normal patterns."},"midMl17":{"chapter":"Chapter 17","title":"Pipeline: Modeling Automation","description":"Chains scaling, encoding, dimensionality reduction, and model training in one workflow to improve reuse and avoid data leakage."},"midMl18":{"chapter":"Chapter 18","title":"Hyperparameter Tuning 1: Grid and Random Search","description":"Compares grid search (full combinations) and random search for finding hyperparameters such as tree depth and learning rate."},"midMl19":{"chapter":"Chapter 19","title":"Hyperparameter Tuning 2: Bayesian Optimization (Optuna)","description":"Uses a surrogate model of past trials to suggest the next hyperparameters for faster, more efficient search."},"midMl20":{"chapter":"Chapter 20","title":"Intermediate ML Summary","description":"Summarizes the pipeline from missing data and scaling through PCA, SVM and boosting, to hyperparameter tuning."}},"midMlCh00":{"description":"Building on basic ML—data, features, training, and evaluation—this chapter introduces working with messy real-world tables and refining models in practice.","sectionTitle":"Real-world data, preprocessing, and tuning","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used"},"whatIs":{"0":"**Real-world data is not a practice CSV** — Tables in basic courses are often tidy. In production you see missing cells, text categories such as region or gender, and numeric features on different scales. Labels can be rare, as in fraud detection. Models still consume matrices $\\mathbf{X}$ and labels $\\mathbf{y}$, so the first job is to turn messy tables into **feature vectors**.","1":"**Preprocessing prepares data for the model** — Scaling aligns units, encoding turns text into numbers, and imputation fills gaps. Resampling can rebalance skewed classes. What basic Ch.00 called \"choosing good features\" becomes a repeatable set of steps in real projects.","2":"**Tuning and pipelines stabilize experiments** — Values that change during training (weights, tree splits) differ from values you set in advance (tree depth, SVM $C$, etc.). The latter are **hyperparameters**. A **pipeline** chains preprocessing and training so new data is handled in the same order every time."},"whyImportant":{"0":"**Data quality and scale shape performance** — Biased data or one dominant feature scale can make $y \\approx f(\\mathbf{x})$ look strong in validation yet fail in production. Distance-based models such as KNN and SVM change their notion of \"close\" when scales drift. Normalization from basic KNN becomes a daily habit here.","1":"**Leakage inflates scores** — If test information enters training or preprocessing, validation looks great while live performance drops. Fitting a scaler on all data before cross-validation is the same trap. Split first, fit statistics on training only, then transform validation and test with those statistics.","2":"**Imbalance and metrics go together** — Accuracy alone can stay high when the model always predicts the majority class. For rare events you also need precision, recall, and ROC-AUC. Hyperparameter tuning is also about balancing overfitting and underfitting for better generalization."},"howUsed":{"0":"**Order matters in practice** — Explore the data, split into train, validation, and test, fit preprocessors on training only, train the model, tune hyperparameters against validation, and report on the held-out test set at the end. That sequence keeps evaluation closer to real generalization.","1":"**How this course is organized** — Early chapters cover scaling, encoding, missing data, imbalance, cross-validation, and multiclass metrics. The middle adds SVM, PCA, ensembles, clustering, and anomaly detection. Later chapters cover pipelines and grid, random, and Bayesian search. Preview each title in the roadmap below.","2":"**It extends basic ML** — If you already studied data and features, missing values, and cross-validation, intermediate ML applies the same ideas to one realistic table. The goal is not a formula list but a calm understanding of why cleaning matters, where metrics mislead, and how to run sound experiments."}},"midMlCh01":{"chapter":"Chapter 01","title":"Data Scaling: Fair Weight Class for Elephants and Mice","description":"Human age might range from 20 to 70, while annual income can swing from 20M to 200M KRW. Machine learning does not understand \"years\" or \"won\"—it only sees **how big the number looks**. Feed the raw table as-is and the model may think: \"Income is 50 million, so age 30 is basically noise I can ignore.\"\n\n**Data scaling** converts elephant-sized and mouse-sized features onto the same percentage or score sheet so they compete on **fair footing**. This chapter explains **standardization** (like SAT-style z-scores), **Min-Max** (squeeze into a tiny box), and **robust scaling** (ignore the opera singer who wandered into karaoke night).","sectionTitle":"Data Scaling: Matching Units and Handling Distributions","sectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"In practice","problemSolving":"Problem solving"},"whatIs":{"0":"**1. What is scaling? (One height chart for every feature)**\n\nImagine a dating-app matcher comparing age gap vs income gap. A 5-year age difference and a 1M KRW income gap feel similar to humans—but the computer sees \"1,000,000 vs 5\" and treats age like dust. Scaling puts every feature on a **fair ruler** (often 0–1 or mean 0) so magnitude stops bullying meaning.","1":"**2. Standardization: SAT-style z-scores**\n\nFormula: $z = \\frac{x - \\mu}{\\sigma}$\n\nYou scored 90 on an easy language test and 80 on a brutal math test. Raw numbers favor language—but after adjusting for mean ($\\mu$) and spread ($\\sigma$), the hard-test score can win. Standardization gathers features near mean 0 and variance 1 on a shared bell-curve field.","2":"**3. Min-Max: squeeze into a miniature box**\n\nFormula: $x' = \\frac{x - x_{\\min}}{x_{\\max} - x_{\\min}}$\n\nLast place becomes $0$, first place becomes $1$, everyone else lands at 0.2, 0.5, 0.8 in between. Even million-scale raw values get pressed into the tight $[0,1]$ box—great when you need a fixed input range (e.g. pixels).","3":"**4. Robust scaling: ignore the opera star at karaoke**\n\nFormula: $x' = \\frac{x - \\mathrm{median}}{\\mathrm{IQR}}$\n\nOne world-class singer joins neighborhood karaoke and wrecks the **average** skill score—everyone looks tone-deaf. Robust scaling uses the **median** (middle person) and **IQR** (normal middle 50%) so extreme outliers like Pavarotti cannot rewrite the ruler.","4":"**5. Scaling vs the word \"normalization\"**\n\nThe same word can mean different things by context. **Normalization** sometimes means Min-Max scaling, and in deep learning it can mean L2 **weight regularization**. This chapter is **feature preprocessing** (matching units and ranges)—not basic ML Ch.13 **regularization**.\n\nDistance skew shows up in the numbers. With a 5-year age gap and a 1M KRW income gap, Euclidean distance is about $\\sqrt{5^2 + 1{,}000{,}000^2} \\approx 1{,}000{,}000$—almost all from income. After standardization both axes weigh more evenly so distance/margin models like SVM can look at age and income together."},"whyImportant":{"0":"**1. Oxygen for distance and margin models (SVM, etc.)**\n\nThese models rely on literal distance or margins. Without scaling it is like plotting a map with X in centimeters and Y in kilometers.\n\nPicture a table with only age (~30–40) and income (~4,500–6,100 in ten-thousands KRW). The income bars swallow the chart. The left panel below is that **range mismatch**; the right shows **z-scores** with balanced weight.","1":"**2. Training speed booster (hot-dog hill vs rice-bowl valley)**\n\nUnscaled loss landscapes look like long hot-dog hills: gradient descent zigzags forever. After scaling the bowl becomes round and smooth—you slide to the bottom much faster.","2":"**3. No cheating—data leakage warning**\n\nComputing the exam average from tomorrow's answer key before you sit the test is cheating. Setting scaling rules using train and test together leaks future information. Models ace practice and fail production."},"howUsed":{"0":"**Choose the right scaler for the situation**\n\nNo single scaler suits every dataset. Look at the shape of your data first.\n\nIf outliers or extremes have wrecked the mean, **Robust** often beats standardization that relies on mean and standard deviation. Median and IQR set the ruler so one outlier like Pavarotti cannot rewrite the whole scale.\n\nIf values already live in a fixed box—pixels from 0 to 255—and you need inputs in something like $[0,1]$, **Min-Max** is a natural fit. Minimum becomes $0$, maximum becomes $1$, everyone else lands in between.\n\nFor general numeric data or deep learning when you are unsure, teams often try **standardization** first. The table below compares all three formulas and traits.","1":"**Apply the same ruler to train, validation, and test**\n\nIn practice you split data into train, validation, and test. **Scaling rules**—mean, min, max, and so on—are learned from **training data only**, then applied unchanged to validation and test. Recomputing those rules with validation or test mixed in is like grading tomorrow's exam with today's mock scores: **data leakage**.\n\nThe same rule holds in cross-validation. For each fold, set the ruler from **that fold's training rows only** and apply it to the validation rows. Even when you repeat mock exams to estimate performance, the grading key must come from training data each time, or scores will look better than they are in production.","2":"**Log-transform long tails before scaling**\n\nYouTube views or GDP often have fat right tails. Before scaling, try $x \\rightarrow \\log(1+x)$ to flatten the shape, then standardize—a common combo in practice.","3":"**Model-by-model exceptions**\n\nSome models almost require scaling; others can skip it. Algorithms that use **distance** or **margins** directly—such as SVM—get pulled toward whichever feature has the biggest numbers if rulers do not match, and performance often collapses when you omit scaling.\n\nTree models and random forests only ask whether 30 is greater than 3. They care about **rank**, not whether income is stored as 45 million or 4,500, so you can usually leave features unscaled.\n\nNeural nets and other **gradient-based** learners nudge weights in tiny steps. When input ranges differ wildly, optimization zigzags and slows. Min-Max or standardization flattens the loss landscape and keeps training steadier."},"summary":"**One-liner:** Scaling puts elephant-sized and mouse-sized features on a **fair ruler** so distance/gradient models are not pulled by one axis.\n\nIn practice you can choose quickly. Heavy outliers or long tails → try **Robust** or log then standardize. Pixels or fixed $[0,1]$ inputs → **Min-Max** is handy. SVM, neural nets on ordinary numeric features → **standardize** first. Trees and random forests care about **rank**, so scaling is usually optional.\n\nOne rule is non-negotiable: never relearn scaling rules on test or validation—**set the ruler on training (or each CV train fold) only**, then **apply it everywhere else**.","problemSolving":{"0":"$36","1":"$37"},"problemSolvingLabel":"Explanation for problem solving","problemSolvingFallback":"Scaling items fall into concept, formula calc, model/data scenario, and rule-setting order (leakage). Check the formula name, plug-in math, distance vs tree models, and whether scaling rules came from training data only.","scalingTable":{"title":"Scaling methods compared","caption":"Why each method behaves as it does, and which data it suits—explained in prose.","headers":{"method":"Method","definition":"Why it works this way"},"rows":{"0":{"method":"Standardization","definition":"With $z=\\frac{x-\\mu}{\\sigma}$ you ask how far a point sits from the mean in standard-deviation units. Most values cluster near 0 with spread near 1, which helps **distance/margin** models like SVM compare features on similar footing. But $\\mu$ and $\\sigma$ use **every** point, so one outlier can shift the mean and inflate $\\sigma$, pushing most data to odd $z$ scores."},"1":{"method":"Min-Max scaling","definition":"$$x'=\\frac{x-x_{\\min}}{x_{\\max}-x_{\\min}}$ pins the minimum at $0$ and maximum at $1$, scaling everything else by rank in between. Output is **always $[0,1]$**, which fits pixels (0–255 → 0–1) and neural nets that expect a fixed input box. The catch: the ruler depends on only **min and max**, so one extreme can stretch $x_{\\max}$ and squash everyone else into 0.01–0.02."},"2":{"method":"Robust scaling","definition":"$$x'=\\frac{x-\\mathrm{median}}{\\mathrm{IQR}}$ uses the **median** and **IQR (middle 50% width)** instead of mean and standard deviation. Extremes barely move median or IQR, so for **long-tailed or outlier-heavy** data like income or payments, the ruler stays anchored to the typical middle bulk."}}},"dataScalingVisualIntro":"With age and income on different scales, distance leans one way. The diagram below shows bar comparison before/after scaling and the preprocessing order.","dataScalingVisualStep0":"① Before scaling — income axis dominates distance","dataScalingVisualStep1":"② After standardization — axes reach balance","dataScalingVisualStep2":"③ Min-Max — values compressed into [0,1] box","dataScalingVisualStep3":"④ Outlier appears — Robust keeps center; Min-Max range stretches","dataScalingVisualStep4":"⑤ Set the ruler on training data only → apply the same ruler to validation/test","dataScalingVisualCaption":"Scaling moves data into a model-friendly coordinate system; the ruler is always set from training statistics only.","dataScalingVisualAriaLabel":"Data scaling: distance before/after scaling, pipeline flow","dataScalingVisualLabelAge":"Age","dataScalingVisualLabelIncome":"Income (10k KRW)","dataScalingVisualLabelQuery":"New point ?","dataScalingVisualLabelNeighbor":"Sample","dataScalingVisualLabelOutlier":"Outlier","dataScalingVisualLabelTrain":"Set ruler on train","dataScalingVisualLabelTransform":"Apply to validation","visualIntro":"Large feature B (income) bullies small feature A (age) like an elephant vs a mouse. The diagram below compares bars before/after scaling and the preprocessing order at a glance.","visualAriaLabel":"Data scaling: feature ranges, standardization, preprocessing order","visualDiagram":{"hintStep0":"① Features have different numeric ranges","hintStep1":"② Standardization makes them comparable","hintStep2":"③ Set rules on training data only, apply to validation","panelGuide":"Left = before scaling (range mismatch) · Right = after z-score · Bottom = set ruler on train → apply to validation","problemPhase0":"Bars appear","problemPhase1":"Range gap","problemPhase2":"Feature B emphasis","scalePhase0":"Before scaling","scalePhase1":"Transforming","scalePhase2":"Aligning ranges","scalePhase3":"z-score","pipelinePhase0":"Idle","pipelinePhase1":"Set ruler on train","pipelinePhase2":"Apply to validation","pipelinePhase3":"Model training","panelProblemTitle":"Features on different scales","badgeRaw":"Before scaling","problemFootLabel":"Scale mismatch","problemFoot":"Large values (like Feature B) dominate model updates and distance.","panelScaleTitle":"Comparable after standardize","badgeStd":"Standardized","scaleFootLabel":"After scaling","scaleFoot":"z-scores put Feature A and B on similar footing.","featureA":"Feature A","featureB":"Feature B","s1":"Sample 1","s2":"Sample 2","s3":"Sample 3","axisRaw":"Raw values","axisZ":"z-score","panelPipelineTitle":"Set ruler on train → apply to validation","pipelineFootLabel":"Practical note","pipelineFoot":"Setting scaling rules on all data leaks validation into the scaler. Use training statistics only.","pipelineModelBadge":"Model training","methodStd":"Standardize","methodMinmax":"Min-Max","methodRobust":"Robust"}},"advMlChapters":{"advMl00":{"chapter":"Chapter 00","title":"Advanced ML: SOTA Models and Interpretability","description":"Principles of optimized boosting ensembles used in Kaggle and the importance of XAI for interpreting black-box predictions."},"advMl01":{"chapter":"Chapter 01","title":"XGBoost Algorithm","description":"Algorithm that improves on GBM speed and adds regularization to control tree complexity and prevent overfitting."},"advMl02":{"chapter":"Chapter 02","title":"LightGBM Algorithm","description":"Leaf-wise growth for speed and accuracy; contrast with level-wise tree building."},"advMl03":{"chapter":"Chapter 03","title":"CatBoost: Categorical Boosting","description":"Ordered Boosting to avoid target leakage; strong on tabular data with many categories."},"advMl04":{"chapter":"Chapter 04","title":"t-SNE for Manifold Visualization","description":"Nonlinear dimensionality reduction preserving local structure for 2D/3D visualization."},"advMl05":{"chapter":"Chapter 05","title":"UMAP: Topological Geometry","description":"Fast manifold learning preserving local and global structure; alternative to t-SNE."},"advMl06":{"chapter":"Chapter 06","title":"Isolation Forest","description":"Unsupervised anomaly detection using random splits; anomalies need fewer splits to isolate."},"advMl07":{"chapter":"Chapter 07","title":"One-Class SVM","description":"Kernel-based method learning a boundary around normal data; points outside are anomalies."},"advMl08":{"chapter":"Chapter 08","title":"Feature Selection and Importance","description":"Permutation importance and other ways to identify key variables."},"advMl09":{"chapter":"Chapter 09","title":"XAI 1: Partial Dependence Plot (PDP)","description":"Marginal effect of a feature on model prediction; global interpretability."},"advMl10":{"chapter":"Chapter 10","title":"XAI 2: LIME","description":"Local linear approximation to explain individual predictions."},"advMl11":{"chapter":"Chapter 11","title":"XAI 3: SHAP","description":"Shapley values for fair feature attribution to predictions."},"advMl12":{"chapter":"Chapter 12","title":"Time Series Preprocessing and Stationarity","description":"ADF test and differencing for stationarity."},"advMl13":{"chapter":"Chapter 13","title":"ARIMA and SARIMA","description":"Classical statistical forecasting with AR, MA, I, and seasonality."},"advMl14":{"chapter":"Chapter 14","title":"Prophet: Structural Time Series","description":"Trend, seasonality, and holiday effects for interpretable forecasting."},"advMl15":{"chapter":"Chapter 15","title":"Content-Based Filtering","description":"Recommendations from item attributes and similarity (e.g. cosine)."},"advMl16":{"chapter":"Chapter 16","title":"Matrix Factorization","description":"Latent factors for user-item rating prediction."},"advMl17":{"chapter":"Chapter 17","title":"Factorization Machines","description":"Efficient modeling of feature interactions in high-dimensional sparse data."},"advMl18":{"chapter":"Chapter 18","title":"Association Rules and Apriori","description":"Support, confidence, lift; traditional basket analysis."},"advMl19":{"chapter":"Chapter 19","title":"AutoML Basics: PyCaret and FLAML","description":"Automating preprocessing, model selection, and hyperparameter tuning."},"advMl20":{"chapter":"Chapter 20","title":"Advanced ML Summary: SOTA Pipeline and XAI","description":"From XGBoost/LightGBM pipelines to SHAP, time series, and recommender systems."}},"advDlChapters":{"advDl00":{"chapter":"Chapter 00","title":"Advanced DL: Large Models and Generative AI Paradigm"},"advDl01":{"chapter":"Chapter 01","title":"Transformer 1: Self-Attention and Parallelization"},"advDl02":{"chapter":"Chapter 02","title":"Transformer: Positional Encoding and Feed-Forward"},"advDl03":{"chapter":"Chapter 03","title":"Transformer Lineage: Encoder (BERT) vs Decoder (GPT)"},"advDl04":{"chapter":"Chapter 04","title":"Attention Optimization: FlashAttention and Sparse Attention"},"advDl05":{"chapter":"Chapter 05","title":"Vision Transformer (ViT) and Image Patches"},"advDl30":{"chapter":"Chapter 06","title":"Swin Transformer: Hierarchical Windows and Global Context"},"advDl31":{"chapter":"Chapter 07","title":"Vision Models: Local CNN vs Global ViT"},"advDl08":{"chapter":"Chapter 08","title":"PEFT 1: PEFT and LoRA"},"advDl09":{"chapter":"Chapter 09","title":"QLoRA and Quantization: Tuning When Smaller"},"advDl10":{"chapter":"Chapter 10","title":"Value Alignment and RLHF: Matching Human Preferences"},"advDl11":{"chapter":"Chapter 11","title":"DPO: Aligning with Preferences without Reinforcement Learning"},"advDl12":{"chapter":"Chapter 12","title":"RAG: Reducing Hallucinations with Retrieval"},"advDl13":{"chapter":"Chapter 13","title":"LLM Agents: Models That Use Tools"},"advDl27":{"chapter":"Chapter 14","title":"Master CNNs: Kernels, Stride, Padding & Backbone Evolution"},"advDl28":{"chapter":"Chapter 15","title":"Object Detection: R-CNN Family vs YOLO (Bounding Boxes)"},"advDl29":{"chapter":"Chapter 16","title":"Image Segmentation: U-Net and DeepLab (Pixel-Level Understanding)"},"advDl15":{"chapter":"Chapter 17","title":"Grad-CAM and XAI: Where CNNs Look"},"advDl14":{"chapter":"Chapter 18","title":"Graph Neural Networks (GNN): Message Passing to Neighbors"},"advDl16":{"chapter":"Chapter 19","title":"Autoencoder: Compress and Reconstruct"},"advDl17":{"chapter":"Chapter 20","title":"VAE: A Generative Space in Probability"},"advDl18":{"chapter":"Chapter 21","title":"GAN Basics: Generator vs Discriminator"},"advDl19":{"chapter":"Chapter 22","title":"Conditional GAN: Generate on Condition"},"advDl20":{"chapter":"Chapter 23","title":"Diffusion 1: Add Noise, Then Denoise"},"advDl21":{"chapter":"Chapter 24","title":"Diffusion 2: Diffusing in Latent Space"},"advDl22":{"chapter":"Chapter 25","title":"Vision-Language Models and CLIP: Images and Text Together (CNN Meets LLM)"},"advDl23":{"chapter":"Chapter 26","title":"Speech Recognition and Audio: Sound to Text"},"advDl24":{"chapter":"Chapter 27","title":"Model Compression and Knowledge Distillation"},"advDl25":{"chapter":"Chapter 28","title":"Inference Optimization and Deployment: From Servers to Browser Runtimes"},"advDl26":{"chapter":"Chapter 29","title":"Advanced DL Wrap-Up: Architecture and Future"}},"advDlCh00":{"chapter":"Chapter 00","title":"Advanced DL: Large Models and Generative AI Paradigm","description":"Advanced Deep Learning (Ch.00) is the entry point that connects “why models got so large” with “how generative AI systems actually work.” We go beyond learning representations from data: how large Transformers build contextual understanding, predict the next token, and then how we align, control, and deploy those models for real users.","roadmapTitle":"An advanced roadmap toward large generative models","roadmapDescription":"This roadmap gradually fills from Ch01 onward, showing how each chapter contributes to the full system.","roadmapListHeading":"What you will learn in Ch01–Ch24","sectionTitle":"What is Advanced DL? (Generative AI system view)","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving guide"},"whatIs":{"0":"**Foundation models / LLMs** are trained with the objective of predicting the next token. In other words, they maximize $p(x_t\\mid x_{ tokenization -> context window -> Transformer -> decoding (greedy/beam/sample)`. Decoding strategy and prompt design strongly affect output quality.","1":"Alignment and control can be done in multiple ways. For example, **RLHF / DPO** uses preferences to improve the model, and **RAG** retrieves external knowledge to ground answers.","2":"From a product perspective, **tool use**, caching/batching, and optimization such as quantization or knowledge distillation are part of the whole stack. The same base model can feel very different depending on how you run it."},"problemSolving":{"0":"This section ties the whole Advanced DL track to how you might reason about it in exam-style questions. **Next-token prediction** in pretraining builds broad language ability and connects to probabilistic generation and representation learning. **Instruction tuning and SFT** shape how models follow user intent, which brings in data formatting and fine-tuning.\n\n**Alignment** addresses preferences, safety, and truthfulness through ideas like preference learning and reward modeling. **RAG and grounded generation** lean on retrieval, embeddings, and assembling context to reduce ungrounded answers. **Inference optimization** targets latency and cost with quantization, caching, distillation, and similar serving-side tools."}},"advDlCh01":{"chapter":"Chapter 01","title":"Transformer 1: See Self-Attention at a Glance","description":"The heart of a Transformer model is **Self-Attention**, **Add & Norm** (residual connection and layer normalization) that keeps training stable, and a **Feed Forward** neural network that transforms gathered information deeply. If older models read tokens one by one and often lose earlier context, Transformers process the entire sentence at once like a top-down overview. In this chapter, you will learn the attention mechanism with Query, Key, and Value, and the intuitive roles of Add & Norm and Feed Forward that help the model learn deeply and reliably.","sectionTitle":"Transformer 1: See Self-Attention at a Glance","whatIs":{"0":"**Concept Explanation: The Eye for Context**\n\nSelf-attention lets every token look at all other tokens at the same time, and it assigns weights to decide how much each token should influence the meaning of the current token. For example, in the phrase \"I went to the bank\", self-attention can infer whether \"bank\" refers to a financial place or a riverbank by looking at surrounding words at the same time.","1":"**Intuition: Query (Q), Key (K), Value (V)**\n\nThink of it like searching in a library.\n1. **Query (Q)** is what you type: the question you want to answer.\n2. **Key (K)** is the information on book labels: what each token contains as searchable features.\n3. **Value (V)** is the actual content you get.\nSelf-attention scores how well Q matches K, then mixes V according to the scores to produce the updated meaning.","2":"**Mathematical Explanation: Scaled Dot-Product Attention**\n\nLet the input embeddings be a matrix $X$. We project them into $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ using learnable matrices. Attention scores come from the dot product $QK^T$. When the dimension is large, dot-product values can become too big, so we divide by $\\sqrt{d_k}$ for scaling. After applying softmax, we obtain weights $A=\\mathrm{softmax}(QK^T/\\sqrt{d_k})$. The final output is $AV$. Here, $d_k$ is the key dimension, and $A$ is the weight matrix that tells how much each token attends to others.","3":"**Real ML Example: Smarter Sentence Understanding**\n\nIn spam filtering, words like \"free\" and \"click\" may be far apart, but self-attention captures their strong relationship at once and helps decide whether a message is spam. In medical document classification, it can connect symptoms, lab results, and negations like \"no\" in the same step to reduce false diagnoses."},"whyImportant":{"0":"**Concept Explanation: Perfectly Solving Long-Range Dependency**\n\nSelf-attention is effective because it can capture long-range dependency directly. When a word at the beginning of a sentence influences the meaning at the end, self-attention connects them without losing information in the middle.","1":"**Intuition: Relay (RNN) vs. Group Chat (Self-Attention)**\n\nRNN-style processing passes information like a relay race, so information can weaken as it moves step by step. Self-attention is like a group chat: everyone sees all messages at the same time, so distant information is available immediately.","2":"**Mathematical Explanation: Shorter Information Path**\n\nIn RNNs, the information path length grows with the token distance $n$ (about $O(n)$), which makes gradient flow harder for long distances. In self-attention, all tokens are connected in a single step, so the path length is about $O(1)$. Short paths make training more stable and help preserve important dependencies.","3":"**Real ML Example: Summarizing Long Texts**\n\nThe advantage is most obvious in tasks like summarizing long legal documents or analyzing long chat logs where key points at the beginning must connect to conclusions at the end."},"howUsed":{"0":"**Practical Assembly: Pre-work and a Conveyor Belt**\n\nIn real systems, you prepare text before feeding it into the model. First, split the sentence into token pieces and convert them to vectors (embeddings). Then add positional encoding so the model knows the order—because attention itself behaves like a \"set\" operation that ignores order. Positional encoding acts like a tiny tag such as \"this is the k-th token\". After that, the data goes through a big transformer block: [multi-head attention → Add & Norm → feed-forward → Add & Norm]. Repeating this block 12 times leads to a BERT-Base style model, while repeating 96 times reaches GPT-3 scale.","1":"**Multi-Head Attention: Expert Committee Division**\n\nInstead of relying on one perspective, you use an expert committee: multiple heads in parallel. For example, if you have 512-dimensional vectors, split them into 8 heads so each head processes 64-dimensional slices from its own viewpoint. One head may focus on grammatical relations like \"who did what\", another on sentiment nuances like positive vs. negative cues, and another on named entities such as people and places. Each head computes attention using $\\mathrm{head}_h=\\mathrm{softmax}(Q_hK_h^T/\\sqrt{d_k})V_h$. After that, concatenate the 64-dim outputs from all heads (8 heads) and project back to 512 dimensions to form a rich representation.","2":"**Feed Forward + Activation: Microscopic Observation and Feature Extraction**\n\nOnce attention has identified relationships between tokens, the feed-forward network (FFNN) updates each token representation independently and deeply. In practice, it often uses an hourglass structure: expand the dimension (e.g., 512 → 2048) to observe fine-grained patterns, then apply a non-linear activation such as ReLU or GELU, and finally compress back to the original size. Using a formula: $\\mathrm{FFN}(x)=\\max(0, xW_1 + b_1)W_2 + b_2$. This helps the model move beyond simple pattern matching and learn complex concepts.","3":"**Code and Framework Usage: Hyperparameter Tuning**\n\nAll this complex math is packaged in frameworks like PyTorch and TensorFlow as a single building block: `nn.TransformerEncoderLayer`. Practitioners usually do not derive everything from scratch; instead they tune key hyperparameters. You choose the embedding size $d_{model}$, the number of attention heads $n_{head}$, and the feed-forward expansion size $dim_{feedforward}$. With these settings aligned to your application and available GPU resources, you can achieve strong performance."},"problemSolving":{"0":"For self-attention items, start from “every token attends to every token to build weights $A=\\mathrm{softmax}(QK^T/\\sqrt{d_k})$.” You form Q, K, V with $W_Q,W_K,W_V$, scale by $\\sqrt{d_k}$, then softmax so each query row sums to 1. Multi-head runs several attentions in parallel; config questions often use $d_{model}=n_{head}\\times d_{head}$.","2":"$38","3":"**Short definition** — \"Softmax in $\\mathrm{Attention}(Q,K,V)$ is applied along which axis? ① keys for each query row ② columns only ③ batch only\" → per query over keys. **Answer 1**\n\n---\n\n**T/F** — \"Scaling by $\\sqrt{d_k}$ helps prevent extremely peaked softmax when dot products are large.\" → True. **Answer 1**\n\n---\n\n**Application** — \"For long-range subject–verb agreement in translation, what helps? ① self-attention weights ② unigram counts only\" → ①. **Answer 1**\n\n---\n\n**Choice** — \"4 heads, head dim 32 → $d_{model}$? ① 128 ② 36 ③ 8\" → $4\\times32=128$. **Answer 1**\n\n---\n\n**Calc** — \"Sequence length 20: about how many score cells in an $N\\times N$ matrix?\" → $400$. **Answer 400**"},"summary":"$39","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuideDiagramCaption":"The diagram below shows the self-attention flow from building Q/K/V through scaling, softmax, and the weighted sum.","formulaGuide":{"title":"Understanding the core formulas","formulaGuideDiagramCaption":"The diagram below shows the self-attention flow from building Q/K/V through scaling, softmax, and the weighted sum.","linear":"$$Q=XW_Q$, $K=XW_K$, $V=XW_V$ where $X$ is the input embeddings and $W_Q/W_K/W_V$ are learnable projection matrices. This step splits the same text into \"query-like\", \"key-like\", and \"value-like\" representations.","xavierVariance":"$$S=QK^T$ is the token-to-token relevance score matrix. Larger scores mean stronger relationships, but when the dimension is large the raw values can grow too much—so we divide by $\\sqrt{d_k}$ to stabilize them.","heVariance":"$$A=\\mathrm{softmax}(S/\\sqrt{d_k})$ produces a weight matrix whose rows sum to 1, meaning each token decides how much it should attend to others.","xavierUniform":"$$O=AV$ is the final context representation created by mixing values using weights $A$. The key is that it is a weighted average based on importance, not a plain average."},"visual":"The conceptual diagram is drawn in the order `input tokens → embeddings → Q/K/V projections → similarity matrix (QK^T) → scaling (√d_k) → softmax → weighted sum (AV) → multi-head combination`. The learning flow is shown as vertical stages: `tokenization → positional information → self-attention → feed-forward → prediction`. The model operation diagram sends arrows from one token to all tokens; the arrow thickness represents attention strength. On the frontend, the container uses `min-w-0`, `max-w-full`, `overflow-visible`, and `minHeight: \"320px\"`, while the SVG is responsive via `viewBox` to avoid clipping on mobile.","problemSolvingLabel":"Tips for solving the problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are 10 randomly selected problems from a pool of 60. The difficulty mix is easy 4, medium 3, hard 3, and answers are integers only.","practiceProblemsInstruction":"Read the problem prompt and enter the correct integer in the blank (?).","practiceProblemsInstructionConcept":"Read the prompt and options ①②③, then enter one choice index.","practiceProblemsInstructionOx":"Enter 1 if the statement is true, 0 if false.","practiceProblemsInstructionScenario":"Read the question and options ①②③, then enter one choice index.","practiceProblemsInstructionVote":"Enter one integer: the count of 1s (sum) in the given binary vector.","practiceProblemsInstructionAggregate":"Enter one integer: the sum of the given numbers.","practiceProblemsInstructionConfig":"Read the grid/setup prompt, then enter one integer (e.g. cells in an $n\\times n$ grid = $n^2$).","practiceProblemsInstructionEnsemble":"Read the prompt and options ①②③, then enter one choice index for the best trade-off / statement.","advDlCh01VisualIntro":"Self-attention is an operation where each token looks at all tokens and reconstructs context.","advDlCh01VisualStep0":"① Create token embeddings and linearly project them into Q, K, V","advDlCh01VisualStep1":"② Compute relation scores with QK^T","advDlCh01VisualStep2":"③ Scale by √d_k and normalize weights with softmax","advDlCh01VisualStep3":"④ Multiply the weights by V to form context vectors, then combine heads","advDlCh01VisualConceptTitle":"Concept structure: Q/K/V → scores → normalization → weighted sum","advDlCh01VisualFlowTitle":"Learning flow: tokenization → attention → update representations → prediction","advDlCh01VisualModelTitle":"Model operation: each token attends to all tokens at once","advDlCh01VisualInputTokenLabel":"Input tokens","advDlCh01VisualTokenRelationLabel":"Token relations (self-attention)","advDlCh01VisualContextVectorOutputLabel":"Context output","advDlCh01VisualContextVectorExplainLine1":"A context vector is","advDlCh01VisualContextVectorExplainLine2":"summary of attended info","advDlCh01VisualCoreFormulaLabel":"Core formula","advDlCh01VisualLegendWeak":"Weak reference","advDlCh01VisualLegendMedium":"Medium reference","advDlCh01VisualLegendStrong":"Strong reference","advDlCh01VisualCurrentSuffix":" (current)","problems":{"concept_0":"Problem instruction: Choose the key function of self-attention.\n\nActual question: Which mechanism computes importance by letting each token reference the entire sentence at the same time? ① Self-attention ② Max pooling ③ Dropout","concept_1":"Problem instruction: Confirm the meanings of Q, K, and V.\n\nActual question: Which description is closest to Query? ① A vector that represents what information you want to find ② The correct label ③ A loss value","concept_2":"Problem instruction: Check what the symbols in the formula mean.\n\nActual question: In $A=softmax(QK^T/\\sqrt{d_k})$, what is $d_k$? ① Batch size ② Key vector dimension ③ Number of classes","concept_3":"Problem instruction: Choose the intuition behind multi-head attention.\n\nActual question: The best reason to use multi-head attention is? ① It simultaneously looks at relationships from different perspectives ② It makes parameters equal to 0 ③ It deletes tokens","concept_4":"Problem instruction: Choose the advantage of self-attention.\n\nActual question: Why can self-attention capture word relationships far apart in a long sentence? ① It can directly reference any token in one layer ② Sentences always become short ③ The loss function disappears","concept_5":"Problem instruction: Connect with a real-world example.\n\nActual question: Why is self-attention especially useful in spam email classification? ① It considers interactions between words together ② It automatically generates training data ③ It removes the GPU","ox_0":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Self-attention allows each token to refer to all other tokens at the same time. True -> 1, False -> 0.","ox_1":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Query, Key, and Value mean the same thing, so no distinction is needed. True -> 1, False -> 0.","ox_2":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: In scaled dot-product attention, the purpose of dividing by $\\sqrt{d_k}$ is to reduce score explosion. True -> 1, False -> 0.","ox_3":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Multi-head attention always makes the information representation simpler than a single head. True -> 1, False -> 0.","ox_4":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: After softmax, the sum of a token's attention weights is usually 1. True -> 1, False -> 0.","ox_5":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Self-attention is used in NLP tasks such as translation, summarization, and classification. True -> 1, False -> 0.","scenario_0":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In a long customer support log, when an early negation expression flips the meaning of a later sentence, which model component is most helpful? ① Self-attention ② Average pooling only ③ A simple rule-based system","scenario_1":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: To interpret phrases like \"not cancer\" reliably in medical text, what should you use first? ① Self-attention that looks at contextual words together ② Word frequency only ③ Only the last word","scenario_2":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In a translation model, what element should you check first to better capture subject-verb agreement? ① Attention head settings ② Image augmentation ③ Pixel normalization","scenario_3":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In generating fraud-transaction explanations, what do you need to reflect relationships among transaction records? ① Compute token-to-token weights ② Delete samples ③ Only reduce the classes","vote_0":"Problem instruction: Compute a weighted ensemble score.\n\nActual question: Head reliability weights are [3,2,1,2,1] and binary votes are [1,1,0,1,0]. What is the weighted sum over positive (1) votes?","vote_1":"Problem instruction: Threshold counting.\n\nActual question: Layer probabilities are [0.92,0.63,0.71,0.48,0.83,0.69]. Treat values ≥ 0.7 as positive. How many positives?","vote_2":"Problem instruction: Count class occurrences.\n\nActual question: 3-class prediction labels are [2,0,1,2,1,0,2,2]. How many times is class 2 predicted?","vote_3":"Problem instruction: Ensemble margin.\n\nActual question: Class A has 7 votes and class B has 4. What is A − B?","scenario_4":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In legal summarization, to connect distant clauses, what structure should you apply first? ① Self-attention ② A 1-gram frequency table ③ Random selection","scenario_5":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: If a news summarization model misses a key sentence, what should you check first? ① The distribution of attention weights ② File extensions ③ Folder names","scenario_6":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In multilingual translation, what is the most natural thing to tune to reduce word alignment errors? ① The number of heads and the dimension ② Monitor brightness ③ Mouse speed","scenario_7":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In long-document classification, if information from the early sentences is lost, what is the most relevant direction? ① Strengthen global context reference ② Delete all tokens ③ Remove the label","scenario_8":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In customer complaint detection, how can you preserve the context of “refund not yet”? ① Reflect the relationship between the negation and key words using attention ② Use only word length ③ Use only numbers","scenario_9":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In an experiment, multi-head attention was more stable than a single head. What is the most reasonable reason? ① Combine multiple perspectives ② Automatically augment data ③ Ignore the loss","vote_4":"Problem instruction: Reliability-weighted sum.\n\nActual question: Reliabilities are [4,3,2,1,2,3,1,2] and votes are [1,1,1,0,1,0,1,1]. What is the sum of reliabilities where the vote is 1?","vote_5":"Problem instruction: Threshold counting.\n\nActual question: Layer probabilities are [0.4,0.7,0.2,0.8,0.1,0.6,0.3,0.9,0.55,0.65]. Treat values ≥ 0.6 as positive. How many positives?","vote_6":"Problem instruction: Compare two layers.\n\nActual question: Layer A=[1,0,1,0,1,0,1,0,1,0,1,0], Layer B=[1,1,1,0,0,0,1,1,1,0,1,1]. How many positions differ?","vote_7":"Problem instruction: Compare two layers.\n\nActual question: Layer A=[1,1,0,0,1,1,0,0,1,1,0,0], Layer B=[1,0,0,1,1,0,0,1,1,0,0,1]. How many positions have both A and B equal to 1?","vote_8":"Problem instruction: Signed class balance.\n\nActual question: For vote vector [0,0,0,1,1,1,1,1,0,1], what is (# of 1s) − (# of 0s)?","vote_9":"Problem instruction: Compare early vs late segments.\n\nActual question: Early votes [1,1,1,1,1,0], late votes [0,0,1,0,1,0]. What is (early positives − late positives)?","aggregate_0":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the class-1 prediction counts of three heads are [2,1,2], what is the total sum?","aggregate_1":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the spam prediction counts of four heads are [3,2,1,2], what is the total number of spam predictions?","aggregate_2":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If five heads give scores to class 2 as [4,4,3,5,4], what is the sum?","aggregate_3":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the number of normal transactions per head is [6,5,7,6], what is the total number?","ensemble_0":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: What is the main benefit of combining multi-head outputs? ① Combining diverse representations improves generalization ② It removes parameters ③ It stops training","ensemble_1":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: When different heads look at different relationships, what effect should you expect? ① Increased chance that errors cancel out ② The same error always happens ③ Only information loss increases","ensemble_2":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: The most reasonable reason multi-head is stronger than a single head is? ① Splitting the feature space allows parallel learning ② Force the number of tokens to become 1 ③ Remove softmax","ensemble_3":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: From an ensemble perspective, what is the correct caution when increasing the number of heads? ① Check the balance between performance and computation ② Computation always decreases ③ Increase without validation","aggregate_4":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of six head scores [5,4,6,5,4,6]?","aggregate_5":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the class-0 counts are [7,8,6,9], what is the total sum?","aggregate_6":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of the keyword-matching counts per head [10,12,11,9,8]?","aggregate_7":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of positive prediction counts per batch [14,16,15]?","aggregate_8":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of the error counts across eight heads [1,2,1,2,1,2,1,2]?","aggregate_9":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of the interest-token counts per head [3,5,7,9,11]?","config_0":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 4 and each head dimension is 16, what is the model dimension $d_{model}$?","config_1":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 8 and each head dimension is 8, what is the model dimension $d_{model}$?","config_2":"Problem instruction: Calculate the model configuration.\n\nActual question: When the number of tokens is 10, the attention score matrix size (number of elements) is $10\\times10$. How many elements are there?","config_3":"Problem instruction: Calculate the model configuration.\n\nActual question: When the number of tokens is 12, the number of elements in the score matrix is $12\\times12$. What is the value?","config_4":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 6 and each head dimension is 12, what is $d_{model}$?","config_5":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 3 and each head dimension is 24, what is $d_{model}$?","config_6":"Problem instruction: Calculate the model configuration.\n\nActual question: When the sequence length is 14, the number of self-attention score elements is $14\\times14$. What is the value?","config_7":"Problem instruction: Calculate the model configuration.\n\nActual question: When the sequence length is 16, the number of score elements is $16\\times16$. What is the value?","config_8":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 12 and each head dimension is 10, what is $d_{model}$?","config_9":"Problem instruction: Calculate the model configuration.\n\nActual question: When the number of tokens is 20, the number of elements in the score matrix is $20\\times20$. What is the value?","ensemble_4":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: Why can you expect variance reduction when combining different heads? ① The errors from different heads partially cancel out ② All heads are always perfect ③ Learning data is unnecessary","ensemble_5":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: From an ensemble perspective, what is the purpose of increasing head diversity? ① Make it so the same input reveals different features ② Copy all heads identically ③ Fix the weights","ensemble_6":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: When deciding the number of heads in a real service, what is the most important factor? ① Balance accuracy improvement and latency ② Always use the maximum number of heads ③ Always use the minimum number of heads","ensemble_7":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: If performance does not improve even after combining multiple heads, what should you check first? ① Whether heads only see very similar patterns ② The length of token names ③ File colors"},"problemAnswers":{"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":1,"concept_4":1,"concept_5":1,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"vote_0":7,"vote_1":3,"vote_2":4,"vote_3":3,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":14,"vote_5":5,"vote_6":4,"vote_7":3,"vote_8":2,"vote_9":3,"aggregate_0":5,"aggregate_1":8,"aggregate_2":20,"aggregate_3":24,"ensemble_0":1,"ensemble_1":1,"ensemble_2":1,"ensemble_3":1,"aggregate_4":30,"aggregate_5":30,"aggregate_6":50,"aggregate_7":45,"aggregate_8":12,"aggregate_9":35,"config_0":64,"config_1":64,"config_2":100,"config_3":144,"config_4":72,"config_5":72,"config_6":196,"config_7":256,"config_8":120,"config_9":400,"ensemble_4":1,"ensemble_5":1,"ensemble_6":1,"ensemble_7":1},"problemSolutions":{"concept_0":"This question asks for the definition of self-attention: the key idea is whether each token simultaneously references the whole set of tokens. Only option ① matches this definition. In practice, for spam classification you need relationships between surrounding words (e.g., “free + click”), not just one isolated word, which reduces false positives. Therefore, the correct answer is 1.","concept_1":"Query is the question vector that represents what you want to find. Key is the matching criterion, and Value is the actual information to retrieve. In medical document classification, Query helps the current token find the contextual clues it needs, then compares with Key to fetch the important Value. Therefore, the correct answer is 1.","concept_2":"$$d_k$ is the dimension of the Key vectors. When the dimension is large, the variance of dot products grows and softmax may become overly one-sided, so we divide by $\\sqrt{d_k}$ as scaling. This scaling is crucial for training stability, and it’s also used to prevent training blow-ups in translation models. The correct answer is 2.","concept_3":"Multi-head attention increases representational power by letting you view relationships from multiple perspectives at the same time. For example, one head can capture grammar while another can capture connections between named entities. In sentiment analysis of customer reviews, if a separate head captures relationships involving negation, accuracy improves. The correct answer is 1.","concept_4":"Self-attention is strong for long-range dependency because it can directly reference tokens at arbitrary distances within a single layer. This is especially useful when early clauses change the meaning later, such as in legal documents. Therefore, the correct answer is 1.","concept_5":"Spam email classification depends heavily on interactions between words. Self-attention improves classification performance by reflecting contextual relationships as attention weights. Steps: (1) tokenize (2) compute relation scores (3) incorporate important context (4) classify. The correct answer is 1.","ox_0":"This is true because it matches the definition of self-attention. In practice, the key for strong translation and summarization is that each token looks at the whole set at the same time. Answer: 1.","ox_1":"This is false: Q, K, and V have different roles. Without distinguishing them, relationship computation would not make sense. Even in fraud-transaction detection logs, separating question/matching/content is important. Answer: 0.","ox_2":"True. The $\\sqrt{d_k}$ scaling prevents softmax saturation caused by large dot products and helps stable learning. Answer: 1.","ox_3":"False. Multi-head attention usually makes representations richer by learning diverse patterns, not simpler. Answer: 0.","ox_4":"softmax normalizes into probabilities, so the sum of weights in one row is 1. Therefore it is true. Answer: 1.","ox_5":"True. Self-attention is widely used in translation, summarization, classification, and question answering. Answer: 1.","scenario_0":"To capture relationships between distant words in long logs, self-attention is appropriate because it allows global reference. With average pooling only, you easily lose the direction of relationships. This is especially effective when an early negation flips the meaning later in customer-support complaint detection. Answer: 1.","scenario_1":"In “not cancer,” you must consider the relationship between the negation word and the disease name. Self-attention reflects the interaction between the two tokens directly, reducing the risk of misdiagnosis. Steps: (1) compute token relation scores (2) incorporate negation weights (3) perform final classification. Answer: 1.","scenario_2":"Subject-verb agreement is a long-distance dependency problem, so the design of attention heads is the primary thing to check. Image augmentation and pixel normalization are not the top priority for text translation. Answer: 1.","scenario_3":"To reflect relationships among transaction records, you need to compute token-to-token weights. This is essentially what self-attention does. In generating fraud explanations, you can also group evidence tokens to improve interpretability. Answer: 1.","vote_0":"Elementwise multiply weights [3,2,1,2,1] by votes [1,1,0,1,0] and sum: $3\\cdot1+2\\cdot1+1\\cdot0+2\\cdot1+1\\cdot0=7$. Answer: 7.","vote_1":"Values ≥ 0.7 are 0.92, 0.71, and 0.83 → 3 positives. Answer: 3.","vote_2":"In labels [2,0,1,2,1,0,2,2], class 2 appears 4 times. Answer: 4.","vote_3":"Margin: $7-4=3$. Answer: 3.","scenario_4":"Linking distant clauses in legal summarization is a classic long-range dependency problem, so self-attention is the best choice. Answer: 1.","scenario_5":"Missing a key sentence often happens when the attention distribution is biased toward one side. Checking the weight distribution first is a practical approach. Answer: 1.","scenario_6":"Word alignment errors in multilingual translation are directly related to attention components such as the number of heads and the head dimensions. Answer: 1.","scenario_7":"If early-sentence information is lost, you should counter it by strengthening global reference (using self-attention and adjusting layers/heads). Answer: 1.","scenario_8":"The key is to look at the relationship between the negation and the important words together. This is especially important in sentiment analysis and complaint detection. Answer: 1.","scenario_9":"The core reason multi-head attention improves stability is combining multiple perspectives. By learning different patterns in parallel, generalization improves. Answer: 1.","vote_4":"Add reliabilities only where vote=1: $4+3+2+2+1+2=14$. Answer: 14.","vote_5":"Values ≥ 0.6 are 0.7, 0.8, 0.6, 0.9, 0.65 → 5 positives. Answer: 5.","vote_6":"A and B differ at 4 positions. Answer: 4.","vote_7":"Positions where both are 1: indices 1, 5, 9 → 3. Answer: 3.","vote_8":"Six 1s and four 0s → $6-4=2$. Answer: 2.","vote_9":"Early has five 1s, late has two 1s → $5-2=3$. Answer: 3.","aggregate_0":"Aggregation sum: $2+1+2=5$. Prediction aggregation is the first step of combining head outputs with a simple or weighted sum. Answer: 5.","aggregate_1":"Total: $3+2+1+2=8$. Even in spam-detection operations, you sum head outputs per batch and compare against a threshold. Answer: 8.","aggregate_2":"Score sum: $4+4+3+5+4=20$. Steps: (1) check each head's score (2) sum (3) pick the class with the highest score. Answer: 20.","aggregate_3":"Sum: $6+5+7+6=24$. Similar table-style aggregation is also used in financial anomaly detection. Answer: 24.","ensemble_0":"Multi-head attention increases generalization by combining diverse representations. Reducing single-view bias is the key idea. Answer: 1.","ensemble_1":"If different heads see different patterns, some errors may cancel out. This is the basic principle of ensembles. Answer: 1.","ensemble_2":"Splitting the feature space and observing in parallel is multi-head attention's strength. Reducing the token count or removing softmax is not the essence. Answer: 1.","ensemble_3":"Increasing the number of heads can improve performance but also increases computation. So you should check the trade-off and balance. Answer: 1.","aggregate_4":"Sum: $5+4+6+5+4+6=30$. Answer: 30.","aggregate_5":"Sum: $7+8+6+9=30$. Answer: 30.","aggregate_6":"Sum: $10+12+11+9+8=50$. Answer: 50.","aggregate_7":"Sum: $14+16+15=45$. Answer: 45.","aggregate_8":"Sum: $1+2+1+2+1+2+1+2=12$. Answer: 12.","aggregate_9":"Sum: $3+5+7+9+11=35$. Answer: 35.","config_0":"The model dimension is usually $d_{model}=head\\_count \\times head\\_dim$. Compute: $4\\times16=64$. Answer: 64.","config_1":"Compute: $8\\times8=64$. A common integer setup for lightweight translation models. Answer: 64.","config_2":"The number of elements in the score matrix is the square of the number of tokens. Compute: $10\\times10=100$. Answer: 100.","config_3":"Compute: $12\\times12=144$. It shows that longer length makes the computation grow quadratically. Answer: 144.","config_4":"Compute: $6\\times12=72$. Answer: 72.","config_5":"Compute: $3\\times24=72$. You can form the same $d_{model}$ with a different head configuration. Answer: 72.","config_6":"Compute: $14\\times14=196$. It illustrates why the computational load increases for long sequences. Answer: 196.","config_7":"Compute: $16\\times16=256$. Answer: 256.","config_8":"Compute: $12\\times10=120$. Answer: 120.","config_9":"Compute: $20\\times20=400$. This is why costs increase as sequence length grows in search and document summarization. Answer: 400.","ensemble_4":"If the errors of different heads are not exactly the same, combining them can reduce variance. Answer: 1.","ensemble_5":"The purpose of head diversity is to let different features be observed so that you get the benefit of combination. Answer: 1.","ensemble_6":"In real services, you must satisfy both accuracy and latency (SLA), so balance is the key. Answer: 1.","ensemble_7":"If performance does not improve, you should first check for insufficient head diversity. If heads only learn similar patterns, the ensemble benefit is small. Answer: 1."},"problemTestCodes":{"concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","vote_0":"weights = [3,2,1,2,1]\nvotes = [1,1,0,1,0]\nassert sum(w*v for w, v in zip(weights, votes)) == 7","vote_1":"probs = [0.92,0.63,0.71,0.48,0.83,0.69]\nassert sum(1 for p in probs if p >= 0.7) == 3","vote_2":"labels = [2,0,1,2,1,0,2,2]\nassert sum(1 for y in labels if y == 2) == 4","vote_3":"a_votes = 7\nb_votes = 4\nassert a_votes - b_votes == 3","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"weights = [4,3,2,1,2,3,1,2]\nvotes = [1,1,1,0,1,0,1,1]\nassert sum(w*v for w, v in zip(weights, votes)) == 14","vote_5":"probs = [0.4,0.7,0.2,0.8,0.1,0.6,0.3,0.9,0.55,0.65]\nassert sum(1 for p in probs if p >= 0.6) == 5","vote_6":"a = [1,0,1,0,1,0,1,0,1,0,1,0]\nb = [1,1,1,0,0,0,1,1,1,0,1,1]\nassert sum(1 for x, y in zip(a, b) if x != y) == 4","vote_7":"a = [1,1,0,0,1,1,0,0,1,1,0,0]\nb = [1,0,0,1,1,0,0,1,1,0,0,1]\nassert sum(1 for x, y in zip(a, b) if x == 1 and y == 1) == 3","vote_8":"votes = [0,0,0,1,1,1,1,1,0,1]\nones = sum(votes)\nzeros = len(votes) - ones\nassert ones - zeros == 2","vote_9":"early = [1,1,1,1,1,0]\nlate = [0,0,1,0,1,0]\nassert sum(early) - sum(late) == 3","aggregate_0":"values = [2,1,2]\ntotal = sum(values)\nassert total == 5","aggregate_1":"values = [3,2,1,2]\nassert sum(values) == 8","aggregate_2":"values = [4,4,3,5,4]\nassert sum(values) == 20","aggregate_3":"values = [6,5,7,6]\nassert sum(values) == 24","ensemble_0":"answer = 1\nassert answer == 1","ensemble_1":"answer = 1\nassert answer == 1","ensemble_2":"answer = 1\nassert answer == 1","ensemble_3":"answer = 1\nassert answer == 1","aggregate_4":"values = [5,4,6,5,4,6]\nassert sum(values) == 30","aggregate_5":"values = [7,8,6,9]\nassert sum(values) == 30","aggregate_6":"values = [10,12,11,9,8]\nassert sum(values) == 50","aggregate_7":"values = [14,16,15]\nassert sum(values) == 45","aggregate_8":"values = [1,2,1,2,1,2,1,2]\nassert sum(values) == 12","aggregate_9":"values = [3,5,7,9,11]\nassert sum(values) == 35","config_0":"head_count, head_dim = 4, 16\nd_model = head_count * head_dim\nassert d_model == 64","config_1":"head_count, head_dim = 8, 8\nd_model = head_count * head_dim\nassert d_model == 64","config_2":"tokens = 10\ncells = tokens * tokens\nassert cells == 100","config_3":"tokens = 12\ncells = tokens * tokens\nassert cells == 144","config_4":"head_count, head_dim = 6, 12\nassert head_count * head_dim == 72","config_5":"head_count, head_dim = 3, 24\nassert head_count * head_dim == 72","config_6":"tokens = 14\nassert tokens * tokens == 196","config_7":"tokens = 16\nassert tokens * tokens == 256","config_8":"head_count, head_dim = 12, 10\nassert head_count * head_dim == 120","config_9":"tokens = 20\nassert tokens * tokens == 400","ensemble_4":"answer = 1\nassert answer == 1","ensemble_5":"answer = 1\nassert answer == 1","ensemble_6":"answer = 1\nassert answer == 1","ensemble_7":"answer = 1\nassert answer == 1"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"hard","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh02":{"chapter":"Chapter 02","title":"Transformer: Positional Encoding and Feed-Forward","description":"Self-attention captures **relationships between tokens** well, but it does not fully provide **which position in the sentence** each token occupies. Transformers therefore **add positional encoding (PE)** to token embeddings so the model knows **which word comes where**. After mixing relations in a block, a **feed-forward (FFN)** layer updates each token representation in depth. This chapter explains sinusoidal PE, how it differs from learned positional embeddings, and the **per-token MLP** role of FFN in a beginner-friendly way.","sectionTitle":"Transformer: Positional Encoding and Feed-Forward","whatIs":{"0":"**1. Concept: why positional encoding?**\n\nSelf-attention scores all tokens at once; if inputs are only token embeddings in a bag, **first vs last** can blur. **Positional encoding** builds a vector $PE(p)$ of length $d_{model}$ for each position $p$ and **adds** it to embeddings.\n\n**Intuition:** like seat row/column labels in a theater—PE tags each token seat.\n\n**Math:** let token embedding be $x_t \\in \\mathbb{R}^{d_{model}}$; often $h_t^{(0)} = x_t + PE(t)$.\n\n**Use:** translation, summarization, QA—word order matters, so BERT/GPT always add position.","1":"$3a","2":"**3. Concept: feed-forward (FFN) — a “deep chat” per token**\n\n**One line:** **Attention** is where tokens **mix with each other**; **FFN** is the next step where **each token lane stays separate** and the **same** small network runs **once per lane** (like the green **compute blocks** in the figure above).\n\n**Analogy:** after a group meeting (attention), everyone walks into a booth **one by one** for a **private follow-up** (FFN). The vector width $d_{model}$ is often **expanded** in the middle (wider hidden) and then **compressed** back—an hourglass shape.\n\n**Why bother?** Attention is mostly linear maps + mixing; FFN adds **nonlinearity** (e.g. **ReLU**, $\\max(0,\\cdot)$) so the model can learn **curved, complex rules**, not only straight-line patterns.\n\n**Math (reference):** $\\mathrm{FFN}(x)=\\max(0,xW_1+b_1)W_2+b_2$. Weights $W_1,W_2$ are usually **shared across all positions**.\n\n**Use:** NER, sentiment—attention collects context; FFN sharpens each token.","3":"**4. Concept: flow inside one block — one conveyor step**\n\n**One line:** Each encoder **block** is like **one station** on an assembly line: always the **same order** of steps.\n\n**Easy order:**\n1. **Start:** add **PE** to embeddings so each token “knows” its slot.\n2. **Mix:** **Attention** lets tokens swap context.\n3. **Stabilize:** **Add & Norm** — add a **skip/residual** so signals don’t vanish, then **layer-normalize** scales.\n4. **Per-token polish:** **FFN** updates **each lane** with nonlinearity.\n5. **Again Add & Norm** to finish the station.\n\n**Math (reference):** $h'=\\mathrm{LayerNorm}(h+\\mathrm{Attn}(h))$, then $h''=\\mathrm{LayerNorm}(h'+\\mathrm{FFN}(h'))$. Stack many such blocks to build rich representations.\n\n**Use:** search, chatbots, codegen—repeat dozens of times."},"whyImportant":{"0":"**Order changes meaning**\n\n\"I ate rice\" vs \"rice ate I\" differ grammatically. Without PE, models struggle to keep this consistent. Fraud logs also rely on **time order**.","1":"**FFN brings nonlinearity**\n\nAttention is largely linear maps plus softmax mixing; FFN expands, applies ReLU/GELU, and learns **complex rules**—e.g., symptom combinations in clinical text.","2":"**Compute trade-offs**\n\nLarger $d_{ff}$ and depth raise quality but also GPU cost and latency—key for production tuning.","3":"**Foundation for modern models**\n\nAbsolute embeddings, sinusoidal PE, RoPE, ALiBi… the theme is **encode order in tensors**. FFN + attention blocks underpin BERT, GPT, ViT."},"howUsed":{"0":"**Pipeline: tokenize → embed → +PE**\n\nTokenize, multiply by an embedding matrix, add position vectors. Libraries expose max_position_embeddings for learned tables. Long-document QA must co-design **context length**.","1":"**FFN hyperparameters**\n\nintermediate_size ($d_{ff}$), activation (GELU), dropout. Example: $d_{model}=768$ → $d_{ff}=3072$ is common. Code models may widen FFN for syntax/style.","2":"**Decoder note**\n\nMasked attention hides future tokens, but PE still marks **left-to-right order** for generation quality.","3":"**Debugging hints**\n\nIf order matters, inspect PE/RoPE/context length; if representations are flat, inspect FFN width/depth/activation—common for spam/news tasks."},"problemSolving":{"0":"PE + FFN questions are easiest when you split roles: **order** comes from positional encoding, **token-to-token mixing** from attention, **per-token nonlinearity** from FFN. The usual recipe is $h=x+PE(pos)$, and FFN weights are typically **shared** across positions in a layer. Larger $d_{ff}$, depth, and context length move capacity and cost together.","2":"$3b","3":"**Definition** — \"Self-attention alone fully encodes absolute order without PE.\" True=1, False=0 → False. **Answer 0**\n\n---\n\n**T/F** — \"Sinusoidal PE stacks multiple frequency components to distinguish positions.\" → True. **Answer 1**\n\n---\n\n**Choice** — \"FFN primarily transforms: ① each token vector ② batch indices only\" → ①. **Answer 1**\n\n---\n\n**Calc** — \"$N=50$ tokens → score matrix cells (dense)?\" → $2500$. **Answer 2500**"},"summary":"Half of why transformers work is attention, but you still need a reliable way to tell the model **which slot** each token occupies. Sinusoidal PE overlays multiple frequency waves so each position gets a distinct pattern added to embeddings. Later, attention mixes tokens while FFN applies the same nonlinear transformation at every position to refine features. The expand-then-contract FFN is the practical knob between quality and compute—shared across translation, summarization, classification, and generation.","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Reading the formulas","linear":"In $h_t^{(0)} = x_t + PE(t)$, $x_t$ is the token embedding and $PE(t)$ is the vector for position $t$. You **add content and order (as numbers)** to form the model input.","xavierVariance":"Sinusoidal PE uses $PE(t,2i)=\\sin(t/10000^{2i/d})$ and $PE(t,2i+1)=\\cos(t/10000^{2i/d})$ to encode position with multiple frequencies $i$. Here $d$ is $d_{model}$ and $t$ is the token index.","heVariance":"In $\\mathrm{FFN}(h)=W_2\\,\\sigma(W_1 h+b_1)+b_2$, $\\sigma$ is a nonlinearity, $W_1$ maps $d_{model}\\to d_{ff}$, and $W_2$ maps $d_{ff}\\to d_{model}$.","xavierUniform":"**Weight sharing**—the same FFN at every position—helps generalization and keeps implementation simple."},"visual":"Interactive visualization of positional encoding and FFN flow.","problemSolvingLabel":"How to approach the exercises","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are 10 problems randomly drawn from a pool of 60. Difficulty mix: 4 easy, 3 medium, 3 hard. Enter **integers only**.","practiceProblemsInstruction":"Read the prompt and the question, then enter your answer as an integer.","practiceProblemsInstructionConcept":"Read the prompt and options ①②③, then enter one choice index.","practiceProblemsInstructionOx":"Enter 1 if the statement is true, 0 if false.","practiceProblemsInstructionScenario":"Read the question and options ①②③, then enter one choice index.","practiceProblemsInstructionVote":"Enter one integer: the count of 1s (sum) in the given binary vector.","practiceProblemsInstructionAggregate":"Enter one integer: the sum of the given numbers.","practiceProblemsInstructionConfig":"Read the grid/setup prompt, then enter one integer (e.g. cells in an $n\\times n$ grid = $n^2$).","practiceProblemsInstructionEnsemble":"Read the prompt and options ①②③, then enter one choice index for the best trade-off / statement.","advDlCh02VisualZoneLabelTop":"Top","advDlCh02VisualZoneLabelBottom":"Bottom","advDlCh02VisualIntroTop":"Read **left to right**—each column adds **word meaning** and **order turned into numbers (PE)**.","advDlCh02VisualIntroBottom":"Lanes **don’t mix**; each passes through the **same compute block** once (same weights, same ops).","advDlCh02VisualIntroNote":"Papers call this compute block **FFN**.","advDlCh02VisualStep0":"① **Meaning** + **which-number** info, added (same idea as adding PE)","advDlCh02VisualStep1":"② Then (if needed) attention mixes nearby tokens","advDlCh02VisualStep2":"③ FFN: wider hidden layer → nonlinear (bend once) → project back to output size","advDlCh02VisualStep3":"④ Add a skip (+), tidy up, then next layer or output","advDlCh02VisualConceptTitle":"① Build inputs → (middle steps omitted) → ② Same FFN per lane","advDlCh02VisualBridgeLead":"**①** first, then **②**—in order inside one block.","advDlCh02VisualBridgeBlock1":"**①** adds **meaning + order (PE)** to build the **input**. (Attention in between is skipped in the figure.)","advDlCh02VisualBridgeBlock2":"**②** then refines **each lane** with the **same FFN**. Lanes **don’t mix**.","advDlCh02VisualBridgeMicroCaption":"Order inside one block","advDlCh02VisualAnimHint":"The diagram slowly highlights each step (~7s each).","advDlCh02VisualAnimStepPe":"① Input","advDlCh02VisualAnimStepBridge":"Link","advDlCh02VisualAnimStepFfn":"② FFN","advDlCh02VisualFlowTitle":"Big picture: split → add order info → repeat layers → predict","advDlCh02VisualModelTitle":"In one line: meaning+order vectors pass through layers","advDlCh02VisualInputTokenLabel":"Input token + position","advDlCh02VisualTokenRelationLabel":"Token embedding + PE sum","advDlCh02VisualContextVectorOutputLabel":"Updated per-token representation","advDlCh02VisualContextVectorExplainLine1":"The FFN at each position","advDlCh02VisualContextVectorExplainLine2":"uses the same MLP (nonlinear)","advDlCh02VisualCoreFormulaLabel":"In math: **meaning+order (PE)** as $h{+}PE$, then **each lane** refines with $\\mathrm{FFN}(h)$","advDlCh02VisualLegendWeak":"Low intermediate activation","advDlCh02VisualLegendMedium":"Medium","advDlCh02VisualLegendStrong":"High intermediate activation","advDlCh02VisualCurrentSuffix":" (current)","advDlCh02VisualPanelPeTitle":"① Put meaning + order numbers (PE) together","advDlCh02VisualPanelFfnTitle":"② Same compute block polishes each lane (FFN)","advDlCh02VisualTrainCaption":"Like noting **which word in the sentence** this is, as numbers.","advDlCh02VisualSameMachineHint":"Four lanes, no cross-talk—same compute block each time","advDlCh02VisualMachineIn":"input","advDlCh02VisualMachineMid":"wider layer","advDlCh02VisualMachineOut":"output","advDlCh02VisualMachineAct":"nonlinear","advDlCh02VisualEmbShort":"meaning","advDlCh02VisualPosShort":"pos","advDlCh02VisualPosSlotShort":"index","advDlCh02VisualPeShort":"order","advDlCh02VisualSumPrimary":"{slot} fused","advDlCh02VisualSumSub":"meaning + order","advDlCh02VisualFfnSameNote":"All four lanes: **same compute block** (W₁, W₂ shared)","advDlCh02VisualFfnPerToken":"lane","advDlCh02VisualFfnInLabel":"width","advDlCh02VisualLegendExpand":"widen","advDlCh02VisualLegendNonlin":"nonlinear","advDlCh02VisualLegendProject":"narrow","advDlCh02VisualLegendFfnLabel":"compute block (FFN)","problems":{"concept_0":"Self-attention alone weakens explicit order; which module injects order as vectors? ① Positional encoding ② Dropout only ③ Batch norm only","concept_1":"In the original sinusoidal PE, even dimension index $2i$ usually uses? ① $\\sin$ ② $\\cos$ ③ ReLU","concept_2":"In a transformer block, the FFN does what to each token? ① mixes tokens ② applies the same MLP per token to deepen representations ③ shrinks sequence length","concept_3":"Often $d_{ff}=4d_{model}$. If $d_{model}=128$, a natural $d_{ff}$ is? ① 256 ② 512 ③ 64","concept_4":"Which matches learned positional embeddings? ① add a learned position vector per index ② only $\\sin$ ③ no position","concept_5":"When order of sentences matters for labels, what input must you keep with attention? ① token embedding + position ② pixels only ③ filename only","ox_0":"Additive PE is usually added to token embeddings. True = 1, False = 0.","ox_1":"The FFN applies one softmax over the whole sequence length. True = 1, False = 0.","ox_2":"The same FFN weights are typically shared across positions. True = 1, False = 0.","ox_3":"Sinusoidal PE is designed so periodic patterns can reflect relative distance. True = 1, False = 0.","ox_4":"Usually $d_{ff}$ is smaller than $d_{model}$. True = 1, False = 0.","ox_5":"FFN after attention is widely used in NLP stacks. True = 1, False = 0.","scenario_0":"In medical summaries, order of pre/post drug matters. What to strengthen first? ① order signal incl. PE ② image rotation ③ batch size only","scenario_1":"Spam: \"free\" and \"click now\" are far apart but related. With attention, to add order? ① embedding + PE ② color space ③ audio sampling only","scenario_2":"Fraud text: amount and time order matter. Which layer widens expressivity? ① per-token FFN ② pooling only ③ regex only","scenario_3":"Legal docs: relative clause distance matters. Classical PE that handles periodic patterns? ① sinusoidal PE ② random drop ③ file extension","scenario_4":"Model confuses \"today\" vs \"tomorrow\". Check first? ① PE + embedding ② monitor DPI ③ font size","scenario_5":"Raising FFN width increases compute. Balance? ① $d_{ff}$ vs latency ② mouse DPI ③ theme color","scenario_6":"Cross-lingual translation with different word order. Preprocess? ① subword emb + PE ② pixel norm only ③ zip only","scenario_7":"Long logs: negation early changes later meaning. Keep order? ① input with PE ② word length only ③ UUID only","scenario_8":"Sentiment: after seeing \"not\" and \"good\", need per-token nonlinearity? ① FFN ② mean only ③ stop","scenario_9":"Removing FFN hurts a lot. Why? ① deep nonlinear token transform is lost ② batch becomes 1 ③ GPU vanishes","vote_0":"Layer votes [1,1,0,1,0] — how many 1s?","vote_1":"Layer votes [1,0,1,1,1,0] — how many 1s?","vote_2":"Layer votes [0,0,1,0,1,1,1,0] — how many 1s?","vote_3":"Layer votes [1,1,1,1,0,0,1,0,1,1] — how many 1s?","vote_4":"Layer votes [1,1,1,0,1,0,1,1] — how many 1s?","vote_5":"Layer votes [0,1,0,1,0,1,0,1,1,1] — how many 1s?","vote_6":"Layer votes [1,0,1,0,1,0,1,0,1,0,1,0] — how many 1s?","vote_7":"Layer votes [1,1,0,0,1,1,0,0,1,1,0,0] — how many 1s?","vote_8":"Layer votes [0,0,0,1,1,1,1,1,0,1] — how many 1s?","vote_9":"Layer votes [1,1,1,1,1,0,0,0,1,0,1,0] — how many 1s?","aggregate_0":"Three heads predict positives [2,1,2]. Sum?","aggregate_1":"Four blocks spam scores [3,2,1,2]. Sum?","aggregate_2":"Five FFN active counts [4,4,3,5,4]. Sum?","aggregate_3":"Four PE match counts [6,5,7,6]. Sum?","aggregate_4":"Six layer scores [5,4,6,5,4,6]. Sum?","aggregate_5":"Class-0 counts [7,8,6,9]. Sum?","aggregate_6":"Keyword matches [10,12,11,9,8]. Sum?","aggregate_7":"Batch positives [14,16,15]. Sum?","aggregate_8":"Eight head errors [1,2,1,2,1,2,1,2]. Sum?","aggregate_9":"Position interest counts [3,5,7,9,11]. Sum?","ensemble_0":"Stacking blocks mainly helps by? ① staged abstraction for complex patterns ② zero params ③ no training","ensemble_1":"Why can errors cancel across depth? ① different transforms per layer ② identical outputs ③ delete data","ensemble_2":"Why multi-layer FFN over one? ① repeated nonlinearity boosts expressivity ② force length 1 ③ remove softmax","ensemble_3":"When adding blocks, watch? ① accuracy vs compute vs overfit ② infinite depth always ③ no validation","ensemble_4":"If layers duplicate work? ① smaller gain from redundancy ② always better ③ cannot train","ensemble_5":"Goal of depth? ① staged abstraction ② identical copy ③ freeze","ensemble_6":"Production depth trade-off? ① accuracy vs latency ② refresh rate ③ icon size","ensemble_7":"If stalled, check? ① layers learn same pattern ② filename ③ theme","config_0":"4 heads, head dim 16 → $d_{model}$?","config_1":"8 heads, head dim 8 → $d_{model}$?","config_2":"10 tokens → attention score matrix has $10\\times10$ entries. Value?","config_3":"12 tokens → $12\\times12$ entries. Value?","config_4":"6 heads, head dim 12 → $d_{model}$?","config_5":"3 heads, head dim 24 → $d_{model}$?","config_6":"Length 14 → $14\\times14$ score cells. Value?","config_7":"Length 16 → $16\\times16$. Value?","config_8":"12 heads, head dim 10 → $d_{model}$?","config_9":"20 tokens → $20\\times20$. Value?"},"problemAnswers":{"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":2,"concept_4":1,"concept_5":1,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":1,"ox_4":0,"ox_5":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"vote_0":3,"vote_1":4,"vote_2":4,"vote_3":7,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":6,"vote_5":6,"vote_6":6,"vote_7":6,"vote_8":6,"vote_9":7,"aggregate_0":5,"aggregate_1":8,"aggregate_2":20,"aggregate_3":24,"ensemble_0":1,"ensemble_1":1,"ensemble_2":1,"ensemble_3":1,"aggregate_4":30,"aggregate_5":30,"aggregate_6":50,"aggregate_7":45,"aggregate_8":12,"aggregate_9":35,"config_0":64,"config_1":64,"config_2":100,"config_3":144,"config_4":72,"config_5":72,"config_6":196,"config_7":256,"config_8":120,"config_9":400,"ensemble_4":1,"ensemble_5":1,"ensemble_6":1,"ensemble_7":1},"problemSolutions":{"concept_0":"PE supplements order that pure self-attention under-emphasizes. Spam detection also depends on word order. Answer 1.","concept_1":"Even $2i$ uses $\\sin$ in the classic pairing. Clinical timelines tie to this. Answer 1.","concept_2":"FFN applies the same MLP per token (not mixing tokens). Answer 2.","concept_3":"$$4\\times128=512$. Answer 2.","concept_4":"Learned absolute embeddings add a trainable vector per position. Answer 1.","concept_5":"Embeddings plus position are needed. Answer 1.","ox_0":"Additive PE sums into embeddings. Answer 1.","ox_1":"FFN is per position, not one softmax over length. Answer 0.","ox_2":"Weights are usually shared. Answer 1.","ox_3":"Periodic design encodes relative cues. Answer 1.","ox_4":"Typically $d_{ff} \\ge d_{model}$. Answer 0.","ox_5":"Standard NLP blocks use FFN. Answer 1.","scenario_0":"Clinical ordering needs PE-bearing inputs. Answer 1.","scenario_1":"Use embeddings + PE with attention. Answer 1.","scenario_2":"Per-token FFN widens features. Answer 1.","scenario_3":"Sinusoidal PE is classical. Answer 1.","scenario_4":"Check PE+embedding wiring. Answer 1.","scenario_5":"Balance width vs latency. Answer 1.","scenario_6":"Subword embeddings + PE. Answer 1.","scenario_7":"Keep order via PE. Answer 1.","scenario_8":"Use FFN for nonlinearity. Answer 1.","scenario_9":"Removing FFN removes deep nonlinear transforms. Answer 1.","vote_0":"Sum is 3. Answer 3.","vote_1":"Sum is 4. Answer 4.","vote_2":"Sum is 4. Answer 4.","vote_3":"Sum is 7. Answer 7.","vote_4":"Sum is 6. Answer 6.","vote_5":"Sum is 6. Answer 6.","vote_6":"Sum is 6. Answer 6.","vote_7":"Sum is 6. Answer 6.","vote_8":"Sum is 6. Answer 6.","vote_9":"Sum is 7. Answer 7.","aggregate_0":"$$2+1+2=5$. Answer 5.","aggregate_1":"$$3+2+1+2=8$. Answer 8.","aggregate_2":"$$4+4+3+5+4=20$. Answer 20.","aggregate_3":"$$6+5+7+6=24$. Answer 24.","ensemble_0":"Depth stacks representations. Answer 1.","ensemble_1":"Different layers transform differently. Answer 1.","ensemble_2":"Repeated nonlinear layers add capacity. Answer 1.","ensemble_3":"Watch overfitting and compute. Answer 1.","aggregate_4":"Sum 30. Answer 30.","aggregate_5":"Sum 30. Answer 30.","aggregate_6":"Sum 50. Answer 50.","aggregate_7":"Sum 45. Answer 45.","aggregate_8":"Sum 12. Answer 12.","aggregate_9":"Sum 35. Answer 35.","config_0":"$$4\\times16=64$. Answer 64.","config_1":"$$8\\times8=64$. Answer 64.","config_2":"$$10\\times10=100$. Answer 100.","config_3":"$$12\\times12=144$. Answer 144.","config_4":"$$6\\times12=72$. Answer 72.","config_5":"$$3\\times24=72$. Answer 72.","config_6":"$$14\\times14=196$. Answer 196.","config_7":"$$16\\times16=256$. Answer 256.","config_8":"$$12\\times10=120$. Answer 120.","config_9":"$$20\\times20=400$. Answer 400.","ensemble_4":"Redundant layers yield smaller gains. Answer 1.","ensemble_5":"Depth enables abstraction stages. Answer 1.","ensemble_6":"Balance accuracy and latency. Answer 1.","ensemble_7":"Check representation diversity. Answer 1."},"problemTestCodes":{"concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 1\nassert answer == 1","ox_4":"answer = 0\nassert answer == 0","ox_5":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","vote_0":"votes = [1,1,0,1,0]\nassert sum(votes) == 3","vote_1":"votes = [1,0,1,1,1,0]\nassert sum(votes) == 4","vote_2":"votes = [0,0,1,0,1,1,1,0]\nassert sum(votes) == 4","vote_3":"votes = [1,1,1,1,0,0,1,0,1,1]\nassert sum(votes) == 7","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"votes = [1,1,1,0,1,0,1,1]\nassert sum(votes) == 6","vote_5":"votes = [0,1,0,1,0,1,0,1,1,1]\nassert sum(votes) == 6","vote_6":"votes = [1,0,1,0,1,0,1,0,1,0,1,0]\nassert sum(votes) == 6","vote_7":"votes = [1,1,0,0,1,1,0,0,1,1,0,0]\nassert sum(votes) == 6","vote_8":"votes = [0,0,0,1,1,1,1,1,0,1]\nassert sum(votes) == 6","vote_9":"votes = [1,1,1,1,1,0,0,0,1,0,1,0]\nassert sum(votes) == 7","aggregate_0":"values = [2,1,2]\nassert sum(values) == 5","aggregate_1":"values = [3,2,1,2]\nassert sum(values) == 8","aggregate_2":"values = [4,4,3,5,4]\nassert sum(values) == 20","aggregate_3":"values = [6,5,7,6]\nassert sum(values) == 24","ensemble_0":"answer = 1\nassert answer == 1","ensemble_1":"answer = 1\nassert answer == 1","ensemble_2":"answer = 1\nassert answer == 1","ensemble_3":"answer = 1\nassert answer == 1","aggregate_4":"values = [5,4,6,5,4,6]\nassert sum(values) == 30","aggregate_5":"values = [7,8,6,9]\nassert sum(values) == 30","aggregate_6":"values = [10,12,11,9,8]\nassert sum(values) == 50","aggregate_7":"values = [14,16,15]\nassert sum(values) == 45","aggregate_8":"values = [1,2,1,2,1,2,1,2]\nassert sum(values) == 12","aggregate_9":"values = [3,5,7,9,11]\nassert sum(values) == 35","config_0":"assert 4 * 16 == 64","config_1":"assert 8 * 8 == 64","config_2":"assert 10 * 10 == 100","config_3":"assert 12 * 12 == 144","config_4":"assert 6 * 12 == 72","config_5":"assert 3 * 24 == 72","config_6":"assert 14 * 14 == 196","config_7":"assert 16 * 16 == 256","config_8":"assert 12 * 10 == 120","config_9":"assert 20 * 20 == 400","ensemble_4":"answer = 1\nassert answer == 1","ensemble_5":"answer = 1\nassert answer == 1","ensemble_6":"answer = 1\nassert answer == 1","ensemble_7":"answer = 1\nassert answer == 1"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"hard","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh03":{"chapter":"Chapter 03","title":"Transformer lineage: BERT understands, GPT generates","description":"The Transformer evolved into two great lineages. **BERT, from the encoder clan (understanding models)**, reads a whole sentence at a glance; **GPT, from the decoder clan (generation models)**, keeps inventing the next token from what came before. If BERT is the ace of 'college-entrance cloze reading,' GPT is the prodigy of 'word chains and novel writing.' This chapter explains how the two models learn, and why their roles in industry differ completely—using analogies beginners can grasp.","sectionTitle":"Transformer lineage: BERT understands, GPT generates","whatIs":{"0":"**1. BERT: bidirectional reading for “understanding” (encoder)**\n\n**Concept:** BERT (Bidirectional Encoder Representations from Transformers) grows out of the Transformer **encoder** alone. The core is **bidirectional context**: left and right words are used together to build the most faithful **representation** of what the current word means.\n\n**Intuition:** like a master clinician who lays out past history (left) and today’s tests (right) **at once** and decides holistically—seeing the whole picture makes context understanding strong.\n\n**Math:** BERT’s flagship training is **MLM (Masked Language Modeling)**: punch a hole (`[MASK]`) in the sentence and train the distribution $p(w_t \\mid \\text{full context})$ for the correct token $w_t$.\n\n**ML use case:** text classification (“positive or negative review?”), named-entity recognition (“find names and dates”), document search, and more.","1":"**2. GPT: endlessly “generating” the next token (decoder)**\n\n**Concept:** GPT (Generative Pre-trained Transformer) develops the Transformer **decoder**. The model is not allowed to see the full sentence at once: a **mask** hides future words so that only **past tokens ($1\\ldots t-1$)** are used to predict the next token $t$—**autoregressive** behavior.\n\n**Intuition:** like a novelist at a typewriter—you **cannot see the next sentence in advance**; you imagine the next word from what you have already written.\n\n**Math:** to stop future information leaking, **causal masking** sets the upper triangle of the attention matrix to $-\\infty$. Training maximizes $-\\log p(x_t \\mid x_{> 3\nassert answer == 32","ensemble_1":"answer = 96 // 4\nassert answer == 24","ensemble_2":"answer = 80 // 2\nassert answer == 40","ensemble_3":"answer = 512 // 4\nassert answer == 128","ensemble_4":"answer = 14 * 14\nassert answer == 196","ensemble_5":"answer = 10 * 10\nassert answer == 100","ensemble_6":"answer = 8 * 8\nassert answer == 64","ensemble_7":"answer = 32 // 2\nassert answer == 16","config_0":"assert 8 * 8 == 64","config_1":"assert 9 * 9 == 81","config_2":"assert 10 * 10 == 100","config_3":"assert 11 * 11 == 121","config_4":"assert 12 * 12 == 144","config_5":"assert 6 * 6 == 36","config_6":"assert 7 * 7 == 49","config_7":"assert 16 * 16 == 256","config_8":"assert 20 * 20 == 400","config_9":"assert 25 * 25 == 625"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh13":{"chapter":"Chapter 15","title":"Object Detection: R-CNN Family vs YOLO (Finding Bounding Boxes)","description":"Classification asks \"Is there a cat in this photo?\" **Object detection** goes one step further: **\"Where exactly is the cat (X, Y), and how big (W, H)?\"** — drawing **bounding boxes** around objects. It powers self-driving, defect inspection, and robot vision.\n\nThis chapter meets two ideas that split the field: the **careful 2-stage R-CNN family** (find suspect regions first, then verify) and **lightning-fast 1-stage YOLO** (split the image like a grid and scan once). We compare their design and easy-to-read metrics for box quality.","sectionTitle":"Object Detection: R-CNN vs YOLO","whatIs":{"0":"**1. Classification vs detection: from 'what' to 'what and where'**\n\n**Image classification** asks \"Is there a cat in this room?\" **Object detection** asks \"**Where** is the cat?\" You draw boxes around multiple objects and name each one — a richer task.","1":"**2. R-CNN family (2-stage): a careful detective**\n\nTwo steps: ① \"Something might be here\" — pull out many **region proposals (RoI)**. ② Look at each closely: \"It's a cat! The box should be this size!\" Very thorough and accurate, but **slower** because of two stages.","2":"**3. YOLO (1-stage): a sharp guard who sees at a glance**\n\n**YOLO (You Only Look Once)** lives up to its name — the image is seen **just once**. Split it into an **S×S grid**; each cell shouts \"There's a dog in my zone!\" and draws a box **at the same time**. No heavy proposal step, so it is **blazing fast** for real-time use.","3":"**4. Quality tools: IoU, NMS, mAP**\n\n- **IoU (intersection ÷ union):** how much predicted and ground-truth boxes overlap (0~1). Core idea: **intersection ÷ union**; full formula in the **formula guide** below.\n- **NMS (Non-Maximum Suppression):** when many boxes pile on the same dog, **cleanup** keeps the **most confident** one and removes the rest.\n- **mAP (mean Average Precision):** the detector **report card** — how well the model finds objects and draws boxes overall."},"whyImportant":{"0":"**The 'eyes' of AI that must act in the real world**\n\nSelf-driving needs exact **pedestrian and car positions** to brake. Robot arms need to know **where to grasp**. Detection gives AI **spatial understanding** beyond one label for the whole image.","1":"**Speed or accuracy? Pick the right tool**\n\nFor CCTV or self-driving where **milliseconds matter**, use **YOLO (1-stage)**. For tiny lesions in medical scans where accuracy beats speed, **R-CNN (2-stage)** fits better. Knowing both sides helps you choose.","2":"**You can't say 'good' without metrics**\n\nEven perfect class names fail if boxes land in the wrong place. **IoU** and **mAP** prove objectively whether the model is smart and give targets for improvement.","3":"**Stepping stone to segmentation**\n\nBounding boxes are rough rectangles — not pixel-perfect outlines. Master detection and you naturally grow into **segmentation**, coloring each pixel inside an object."},"howUsed":{"0":"**Step 1: Prepare data and box coordinates**\n\nLabel where each object sits. Usually unify on top-left / bottom-right **(x_min, y_min, x_max, y_max)** or center + size **(cx, cy, w, h)**.","1":"**Step 2: Pick a network for the job**\n\nNeed **real-time** streaming? Choose a **YOLO-style grid** model. Need tiny, crowded objects with high accuracy? Choose **Faster R-CNN** with an **RPN** backbone.","2":"**Step 3: Train — match boxes and shrink loss**\n\nCompare prediction and ground truth with **IoU**. Train **classification error** (what object?) and **regression error** (where and how big?) together.","3":"**Step 4: Duplicate cleanup (NMS) and final report card (mAP)**\n\nIn production, drop low-confidence boxes, run **NMS (Non-Maximum Suppression)** — **cleanup** for overlapping boxes — then measure **mAP (mean Average Precision)** — the **report card** — on validation data to see how good the model is."},"problemSolving":{"0":"**Start by reading the question this way**\n\n- First sort the axis: **classification vs detection / R-CNN 2-stage vs YOLO 1-stage / IoU·NMS·mAP**\n- For calculations, lock the pattern: YOLO grid **S×S** → total cells **S²** (e.g. S=7 → **7×7=49**)\n- For IoU and union: **union = A + B − intersection**; overlap **4×4** → area **16**\n\n---\n\n**Example (concept)**\n\"Which goal is closest to object detection?\"\n① One class for the whole image ② **Bounding box + class per object** ③ Pixel segmentation only ④ Tune learning rate only\n**Answer 2**\n\n**Why?** Detection finds **what + where**. Classification gives one label for the entire image.\n\n---\n\n**Example (T/F)**\n\"YOLO always uses 2-stage only\"\n**Answer 0** (false)\n\n**Why?** YOLO predicts boxes and classes in **one 1-stage pass** over an **S×S grid**.\n\n---\n\n**Example (calculation)**\n\"YOLO grid S=7 — how many cells?\" → **7×7=49**","1":"**Examples by problem type + why the answer fits**\n\n**Example (scenario)**\n\"After inference, five boxes overlap one person. First step?\"\n① **NMS to remove duplicates** ② Shuffle labels ③ Zero-layer backbone ④ Drop mAP\n**Answer 1**\n\n**Why?** Overlapping duplicates are cleaned with **NMS**.\n\n---\n\n**Example (multiple-choice calc)**\n\"Box A and B each area 32, intersection 16 — union area?\"\n**Answer 48** (32+32-16)\n\n**Why?** Use **union = A + B − intersection**.\n\n---\n\n**Example (grid calc)**\n\"YOLO grid S=9 — total cells?\"\n**Answer 81** (9×9)\n\n**Why?** An **S×S grid** squares the cells per side.\n\n---\n\n**Example (reasoning)**\n\"CCTV needs real-time inference. Which architecture?\"\n① Selective Search only ② Original R-CNN only ③ **YOLO-like 1-stage** ④ Stop augmentation\n**Answer 3**\n\n**Why?** When **speed** matters, try **1-stage YOLO** first."},"summary":"**One-liner:** Object detection finds each object's **class and box** at once — split into careful **2-stage R-CNN** and fast **1-stage YOLO**.\n\n**Key tools:** **IoU** scores overlap, **NMS (duplicate cleanup)** removes extra boxes, **mAP (report card)** summarizes overall performance.\n\n**Next:** move from boxes to **segmentation**, tracing pixel-level outlines.","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary","problemSolving":"Notes for problem solving"},"formulaGuide":{"title":"Reading the formulas (object detection)","linear":"**1. IoU — core formula**\n\n$\\text{IoU}(A,B)=|A\\cap B|/|A\\cup B|$\n\n- **Intersection:** overlapping area\n- **Union:** combined area\n- **Closer to 1** → better match","xavierVariance":"**2. Bounding box coordinates — how do we write the rectangle?**\n\nWe record where the **box around an object** sits using one of two common styles:\n\n- **Corner style:** top-left and bottom-right **(x_min, y_min, x_max, y_max)** — pin **two corners** of the rectangle\n- **Center style:** middle **(cx, cy)** plus **width and height (w, h)** — **center + size**\n\n**w, h** are the box **width and height**. Area is roughly **w×h**.\n\n**YOLO** often scales values to **0~1** relative to the image or grid cell. (e.g. half the image width → **0.5**)","heVariance":"**3. NMS (Non-Maximum Suppression) — five boxes on one person? Keep one**\n\nIf a detector draws **five boxes** around one person, the screen gets messy. **NMS** is the **cleanup** step for overlapping boxes.\n\n1. **Sort by confidence** (highest first).\n2. **Keep the top box**.\n3. **Drop boxes** that overlap it too much (IoU above a threshold). Repeat with what is left.\n\n**In one line:** one object → **one most-confident box**.","xavierUniform":"**4. mAP (mean Average Precision) — the detector report card**\n\n**mAP** is an **overall score** for how well the model **finds objects**.\n\n- Compute **AP** (area under the precision–recall curve) **per class**.\n- **Average AP across classes** → **mAP**.\n- A prediction counts as correct only if it overlaps the ground truth by at least **half** (IoU **≥ 0.5**).\n\n**Higher mAP** → fewer false boxes and **fewer missed objects**."},"formulaGuideDiagramCaption":"**In one line:** IoU divides the **overlap (intersection)** of boxes A and B by their **combined area (union)**.","formulaGuideDiagramAria":"IoU diagram: two bounding boxes intersection and union","formulaGuideDiagramFrozenHint":"Intersection","advDlCh13FormulaGuideLossHint":"IoU · box overlap","advDlCh13VisualInputLabel":"Input","visual":"Animation: input → backbone → (R-CNN: proposals/RoI | YOLO: grid) → classify & regress → NMS → mAP.","problemSolvingLabel":"Notes for problem solving","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"From the **60-question** bank, each session **randomly** draws **10** problems: **4 easy → 3 medium → 3 hard**. **No duplicate problem type (prefix + difficulty)** in one session. Calculations use **small integers, multiply, and sum**; concept, T/F, and scenario items cover **core object-detection ideas** only.","practiceProblemsInstruction":"Choose the best option.","practiceProblemsInstructionCalc":"Compute the value, then choose the best option.","practiceProblemsInstructionConcept":"Concept question — choose the best option.","practiceProblemsInstructionOx":"1 if true, 0 if false.","practiceProblemsInstructionScenario":"Pick the best action for the scenario.","practiceProblemsInstructionVote":"Choose the option matching your calculation.","practiceProblemsInstructionAggregate":"Choose the option matching your IoU, grid, or box-area calculation.","practiceProblemsInstructionConfig":"Choose the value for grid S×S settings.","practiceProblemsInstructionEnsemble":"Choose the value for detection pipeline / grid / anchor calculations.","advDlCh13VisualIntro":"On the same photo we ask **'where is the dog?'** **R-CNN** first draws **many suspect regions (proposal boxes)**, then checks them one by one. **YOLO** splits the image into a **grid**; each cell **at once** reports **where** and **what**. Finally, **NMS (Non-Maximum Suppression)** — **cleanup** that removes **duplicate boxes** on the same object — tidies overlaps, and **mAP (mean Average Precision)** — the detector **report card** — scores how well we found objects.","advDlCh13VisualConceptTitle":"R-CNN: pick candidates, then verify · YOLO: every grid cell at once","advDlCh13VisualSectionTitle":"Object detection at a glance","advDlCh13VisualMetaphor":"**R-CNN** is like opening many candidate crates in a warehouse; **YOLO** is like every zone manager reporting at the same time.","advDlCh13VisualRcnnLaneTitle":"R-CNN — candidates first, verify later","advDlCh13VisualYoloLaneTitle":"YOLO — one pass per grid cell","advDlCh13VisualTwoStageHint":"First draw many **dashed candidate boxes** ('something might be here'), then keep **one solid box** after checking.","advDlCh13VisualOneStageHint":"Split the photo into a **grid**; each cell **reports location and class in one pass**.","advDlCh13VisualTopInputLabel":"Input image","advDlCh13VisualTopFeatureLabel":"Feature map / grid","advDlCh13VisualTopOutputLabel":"Detections (boxes + classes)","advDlCh13VisualBackboneLabel":"Backbone (CNN)","advDlCh13VisualProposalLabel":"Region Proposal / RPN","advDlCh13VisualProposalHint":"Candidate object regions","advDlCh13VisualRoILabel":"RoI Pooling/Align","advDlCh13VisualRoIHint":"Fixed-size features per region","advDlCh13VisualHeadLabel":"Classify & regress head","advDlCh13VisualBboxLabel":"Bounding box","advDlCh13VisualGridCellLabel":"Grid cell S×S","advDlCh13VisualGridHint":"YOLO: each cell owns a zone","advDlCh13VisualAnchorHint":"Anchor boxes: multiple scales/ratios","advDlCh13VisualNmsLabel":"NMS","advDlCh13VisualNmsHint":"Remove overlapping duplicate boxes","advDlCh13VisualMapLabel":"mAP","advDlCh13VisualMapHint":"Mean of per-class AP","advDlCh13VisualLossLabel":"Detection loss","advDlCh13VisualFlowTitle":"What happens to one photo","advDlCh13VisualStep0":"Start with an **RGB photo**. Unlike classification, the goal is **where each object is**, not just what the whole image is.","advDlCh13VisualStep1":"A **CNN backbone** turns the photo into **feature maps**. Later steps read these features to draw boxes.","advDlCh13VisualStep2":"**R-CNN** proposes **many candidate regions** that might contain objects. **YOLO** divides the image into an **S×S grid**.","advDlCh13VisualStep3":"For each **candidate or grid cell**, predict **what it is (class)** and **where it is (box)** together.","advDlCh13VisualStep4":"**Duplicate cleanup (NMS)** keeps one box per object, then **report card (mAP)** and **overlap score (IoU)** show how accurate detection was.","advDlCh13VisualRcnnStep0":"① Input image","advDlCh13VisualRcnnStep1":"② Backbone CNN → feature maps","advDlCh13VisualRcnnStep2":"③ **Region proposals** — many candidate boxes","advDlCh13VisualRcnnStep3":"④ **RoI → classify & regress** — per candidate","advDlCh13VisualYoloStep0":"① Input + backbone (one inference pass)","advDlCh13VisualYoloStep1":"② Split into **S×S grid**","advDlCh13VisualYoloStep2":"③ **Every cell** predicts box + class at once","problems":{"concept_0":"Which best matches **object detection**?\n① Predict **one class** for the whole image\n② **Find bounding boxes and class for each object**\n③ Pixel segmentation only\n④ Optimize learning rate only","concept_1":"Which best describes a **bounding box**?\n① Always a circle\n② **Rectangle coordinates (x,y,w,h) around an object**\n③ One softmax output\n④ Batch-norm parameters","concept_2":"Which fits the **R-CNN family**?\n① Always 1-stage end-to-end\n② **2-stage: region proposals (RoI) then classify & regress boxes**\n③ Same goal as GAN\n④ Autoencoder reconstruction only","concept_3":"Which matches **YOLO**?\n① Crop many times only\n② Selective Search only\n③ **S×S grid cells predict boxes & classes in one inference pass (1-stage)**\n④ PCA only","concept_4":"Why use **IoU**?\n① Learning-rate schedule only\n② **Quantify overlap between predicted and ground-truth boxes**\n③ Measure dataset file size\n④ Learn identity mapping","concept_5":"Role of **NMS**?\n① Keep all candidate boxes\n② **Remove overlapping duplicates; keep highest scores**\n③ Rotate image 90°\n④ Set labels to 0","ox_0":"Detection predicts **boxes and classes** together.\n1 if true, 0 if false.","ox_1":"IoU uses **union only**; intersection is not needed.\n1 if true, 0 if false.","ox_2":"R-CNN-style pipelines often use **proposals → RoI → heads**.\n1 if true, 0 if false.","ox_3":"YOLO **always** uses 2-stage only.\n1 if true, 0 if false.","ox_4":"NMS helps clean **overlapping boxes on the same object**.\n1 if true, 0 if false.","ox_5":"**mAP** is commonly used to summarize detection performance.\n1 if true, 0 if false.","scenario_0":"**Training YOLO**, GPU OOM. Try **first**?\n① Increase LR without bound\n② **Reduce batch, input resolution, or model width**\n③ Delete all data\n④ Remove NMS","scenario_1":"After inference, **5 overlapping boxes** on one person. **First** fix?\n① **Apply NMS**\n② Shuffle labels\n③ Zero-layer backbone\n④ Remove mAP","scenario_2":"Need **real-time** CCTV. Best structural choice?\n① Selective Search only\n② Original R-CNN only\n③ **Consider 1-stage YOLO-style model**\n④ Stop augmentation","scenario_3":"**Small objects** often missed. **First** try?\n① Set confidence to 0\n② **Tune resolution, FPN, anchors/grid**\n③ Set NMS IoU to 1.0\n④ Delete labels","vote_0":"YOLO grid **S=7** — total **grid cells**?","vote_1":"YOLO **S=8** — total cells?","vote_2":"YOLO **S=6** — total cells?","vote_3":"YOLO **S=10** — total cells?","scenario_4":"Val **mAP** much lower than train mAP. Suspect?\n① Training too slow\n② **Overfitting**\n③ Batch always 1\n④ Optimizer name","scenario_5":"Anchors don't match object sizes. **First** step?\n① **Retune anchor/grid scales and aspect ratios**\n② Redefine IoU\n③ Remove NMS\n④ Set mAP to 0","scenario_6":"**Positive matching IoU threshold too low**. Expected issue?\n① Always mAP=1\n② **Wrong boxes trained as positives → worse quality**\n③ NMS unnecessary\n④ 2-stage becomes 1-stage","scenario_7":"Improve **tiny defect** detection?\n① **Higher resolution, FPN, smaller anchors**\n② Random guess\n③ Drop data\n④ Remove loss","scenario_8":"Too many **false positives**. Tune **first**?\n① Disable NMS\n② **Raise confidence threshold**\n③ Zero classes\n④ Remove backbone","scenario_9":"Fair **detector comparison**?\n① **Use standard metrics like mAP**\n② Training loss only\n③ Count boxes only\n④ Never use IoU","vote_4":"Two boxes with **intersection** 4×4 pixels — intersection **area**?","vote_5":"Areas **32** each, **intersection 16** — **union** area? (32+32-16)","vote_6":"YOLO **S=7** — total **cells**?","vote_7":"Box **width 2, height 4** — **area**?","vote_8":"Box **5×5** — **area**?","vote_9":"Box **width 3, height 6** — **area**?","aggregate_0":"Box A and B each **area 20**, **intersection 8** — **union** area? (20+20-8)","aggregate_1":"Box A and B each **area 24**, **intersection 10** — **union** area?","aggregate_2":"Two boxes with **intersection** 5×4 pixels — intersection **area**?","aggregate_3":"YOLO **S=9** — total **cells**?","ensemble_0":"**YOLO grid**: **4** cells per side in a square — total cells?","ensemble_1":"**YOLO grid**: **8** cells per side in a square — total cells?","ensemble_2":"Box A and B each **area 35**, **intersection 21** — **union** area?","ensemble_3":"**YOLO grid**: **5** cells per side in a square — total cells?","aggregate_4":"Box A and B each **area 36**, **intersection 12** — **union** area?","aggregate_5":"Box **width 6, height 8** — **area**?","aggregate_6":"YOLO **S=11** — total **cells**?","aggregate_7":"Box A **area 45**, B **area 30**, **intersection 15** — **union** area?","aggregate_8":"Two boxes with **intersection** 6×6 pixels — intersection **area**?","aggregate_9":"YOLO **S=12** — total **cells**?","config_0":"**YOLO grid S×S**: **5** cells per side — total cells?","config_1":"**6** cells per side — total?","config_2":"**7** cells per side — total?","config_3":"**8** cells per side — total?","config_4":"**9** cells per side — total?","config_5":"**4** cells per side — total?","config_6":"**3** cells per side — total?","config_7":"**10** cells per side — total?","config_8":"**7** cells per side — total?","config_9":"**8** cells per side — total?","ensemble_4":"YOLO **S=7** — total **cells**?","ensemble_5":"YOLO **S=6** — total **cells**?","ensemble_6":"Box **intersection** 3×3 pixels — intersection **area**?","ensemble_7":"YOLO **S=8** — total **cells**?"},"problemSolutions":{"concept_0":"**Example:** COCO — two people, one bike, each with a box.\n\n**Steps:** Detection = **where + what** → **2**.","concept_1":"**Example:** (x,y,w,h) marks object region.\n\n**Steps:** **2**.","concept_2":"**Example:** Faster R-CNN = RPN + RoI head.\n\n**Steps:** **2**.","concept_3":"**Example:** YOLO grid prediction.\n\n**Steps:** **3**.","concept_4":"**Example:** IoU = intersection / union.\n\n**Steps:** **2**.","concept_5":"**Example:** NMS removes duplicates.\n\n**Steps:** **2**.","ox_0":"**Example:** Boxes + classes.\n\n**Steps:** True **1**.","ox_1":"**Example:** IoU needs both intersection and union.\n\n**Steps:** False **0**.","ox_2":"**Example:** 2-stage pipeline.\n\n**Steps:** True **1**.","ox_3":"**Example:** YOLO is 1-stage.\n\n**Steps:** False **0**.","ox_4":"**Example:** NMS deduplicates.\n\n**Steps:** True **1**.","ox_5":"**Example:** mAP is standard.\n\n**Steps:** True **1**.","scenario_0":"**Steps:** OOM → shrink batch/resolution → **2**.","scenario_1":"**Steps:** Overlaps → NMS → **1**.","scenario_2":"**Steps:** Real-time → YOLO → **3**.","scenario_3":"**Steps:** Small objects → resolution/FPN → **2**.","vote_0":"**Calc:** 7×7=49. **Answer: 49**.","vote_1":"**Calc:** 8×8=64. **Answer: 64**.","vote_2":"**Calc:** 6×6=36. **Answer: 36**.","vote_3":"**Calc:** 10×10=100. **Answer: 100**.","scenario_4":"**Steps:** Val mAP low → overfitting → **2**.","scenario_5":"**Steps:** Retune anchors → **1**.","scenario_6":"**Steps:** Low IoU match hurts quality → **2**.","scenario_7":"**Steps:** FPN / high-res → **1**.","scenario_8":"**Steps:** False positives → raise confidence → **2**.","scenario_9":"**Steps:** Compare with mAP → **1**.","vote_4":"**Calc:** 4×4=16. **Answer: 16**.","vote_5":"**Calc:** 32+32-16=48. **Answer: 48**.","vote_6":"**Calc:** 7×7=49. **Answer: 49**.","vote_7":"**Calc:** 2×4=8. **Answer: 8**.","vote_8":"**Calc:** 5×5=25. **Answer: 25**.","vote_9":"**Calc:** 3×6=18. **Answer: 18**.","aggregate_0":"**Calc:** 20+20-8=32. **Answer: 32**.","aggregate_1":"**Calc:** 24+24-10=38. **Answer: 38**.","aggregate_2":"**Calc:** 5×4=20. **Answer: 20**.","aggregate_3":"**Calc:** 9×9=81. **Answer: 81**.","ensemble_0":"**Calc:** 4×4=16. **Answer: 16**.","ensemble_1":"**Calc:** 8×8=64. **Answer: 64**.","ensemble_2":"**Calc:** 35+35-21=49. **Answer: 49**.","ensemble_3":"**Calc:** 5×5=25. **Answer: 25**.","aggregate_4":"**Calc:** 36+36-12=60. **Answer: 60**.","aggregate_5":"**Calc:** 6×8=48. **Answer: 48**.","aggregate_6":"**Calc:** 11×11=121. **Answer: 121**.","aggregate_7":"**Calc:** 45+30-15=60. **Answer: 60**.","aggregate_8":"**Calc:** 6×6=36. **Answer: 36**.","aggregate_9":"**Calc:** 12×12=144. **Answer: 144**.","config_0":"**Calc:** 5×5=25. **Answer: 25**.","config_1":"**Calc:** 6×6=36. **Answer: 36**.","config_2":"**Calc:** 7×7=49. **Answer: 49**.","config_3":"**Calc:** 8×8=64. **Answer: 64**.","config_4":"**Calc:** 9×9=81. **Answer: 81**.","config_5":"**Calc:** 4×4=16. **Answer: 16**.","config_6":"**Calc:** 3×3=9. **Answer: 9**.","config_7":"**Calc:** 10×10=100. **Answer: 100**.","config_8":"**Calc:** 7×7=49. **Answer: 49**.","config_9":"**Calc:** 8×8=64. **Answer: 64**.","ensemble_4":"**Calc:** 7×7=49. **Answer: 49**.","ensemble_5":"**Calc:** 6×6=36. **Answer: 36**.","ensemble_6":"**Calc:** 3×3=9. **Answer: 9**.","ensemble_7":"**Calc:** 8×8=64. **Answer: 64**."},"problemAnswers":{"concept_0":2,"concept_1":2,"concept_2":2,"concept_3":3,"concept_4":2,"concept_5":2,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":1,"scenario_0":2,"scenario_1":1,"scenario_2":3,"scenario_3":2,"vote_0":49,"vote_1":64,"vote_2":36,"vote_3":100,"scenario_4":2,"scenario_5":1,"scenario_6":2,"scenario_7":1,"scenario_8":2,"scenario_9":1,"vote_4":16,"vote_5":48,"vote_6":49,"vote_7":8,"vote_8":25,"vote_9":18,"aggregate_0":32,"aggregate_1":38,"aggregate_2":20,"aggregate_3":81,"ensemble_0":16,"ensemble_1":64,"ensemble_2":49,"ensemble_3":25,"aggregate_4":60,"aggregate_5":48,"aggregate_6":121,"aggregate_7":60,"aggregate_8":36,"aggregate_9":144,"config_0":25,"config_1":36,"config_2":49,"config_3":64,"config_4":81,"config_5":16,"config_6":9,"config_7":100,"config_8":49,"config_9":64,"ensemble_4":49,"ensemble_5":36,"ensemble_6":9,"ensemble_7":64},"problemTestCodes":{"concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 3\nassert answer == 3","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 2\nassert answer == 2","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 1\nassert answer == 1","scenario_0":"answer = 2\nassert answer == 2","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 3\nassert answer == 3","scenario_3":"answer = 2\nassert answer == 2","vote_0":"answer = 7 * 7\nassert answer == 49","vote_1":"answer = 8 * 8\nassert answer == 64","vote_2":"answer = 6 * 6\nassert answer == 36","vote_3":"answer = 10 * 10\nassert answer == 100","scenario_4":"answer = 2\nassert answer == 2","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 2\nassert answer == 2","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 2\nassert answer == 2","scenario_9":"answer = 1\nassert answer == 1","vote_4":"inter = 4 * 4\nanswer = inter\nassert answer == 16","vote_5":"inter = 4 * 4\nunion = 32 + 32 - inter\nanswer = union\nassert answer == 48","vote_6":"answer = 7 * 7\nassert answer == 49","vote_7":"answer = 2 * 4\nassert answer == 8","vote_8":"answer = 5 * 5\nassert answer == 25","vote_9":"answer = 3 * 6\nassert answer == 18","aggregate_0":"inter = 8\nunion = 20 + 20 - inter\nanswer = union\nassert answer == 32","aggregate_1":"inter = 10\nunion = 24 + 24 - inter\nanswer = union\nassert answer == 38","aggregate_2":"inter = 5 * 4\nanswer = inter\nassert answer == 20","aggregate_3":"answer = 9 * 9\nassert answer == 81","ensemble_0":"answer = 4 * 4\nassert answer == 16","ensemble_1":"answer = 8 * 8\nassert answer == 64","ensemble_2":"inter = 21\nunion = 35 + 35 - inter\nanswer = union\nassert answer == 49","ensemble_3":"answer = 5 * 5\nassert answer == 25","aggregate_4":"inter = 12\nunion = 36 + 36 - inter\nanswer = union\nassert answer == 60","aggregate_5":"answer = 6 * 8\nassert answer == 48","aggregate_6":"answer = 11 * 11\nassert answer == 121","aggregate_7":"inter = 15\nunion = 45 + 30 - inter\nanswer = union\nassert answer == 60","aggregate_8":"inter = 6 * 6\nanswer = inter\nassert answer == 36","aggregate_9":"answer = 12 * 12\nassert answer == 144","config_0":"assert 5 * 5 == 25","config_1":"assert 6 * 6 == 36","config_2":"assert 7 * 7 == 49","config_3":"assert 8 * 8 == 64","config_4":"assert 9 * 9 == 81","config_5":"assert 4 * 4 == 16","config_6":"assert 3 * 3 == 9","config_7":"assert 10 * 10 == 100","config_8":"assert 7 * 7 == 49","config_9":"assert 8 * 8 == 64","ensemble_4":"answer = 7 * 7\nassert answer == 49","ensemble_5":"answer = 6 * 6\nassert answer == 36","ensemble_6":"answer = 3 * 3\nassert answer == 9","ensemble_7":"answer = 8 * 8\nassert answer == 64"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh14":{"chapter":"Chapter 12","title":"RAG: Reducing Hallucinations with Retrieval","description":"$42","sectionTitle":"RAG: Grounded answers with retrieval","whatIs":{"0":"**1. Open-book vs closed-book: why RAG exists**\n\nUsing an LLM alone is **closed-book**: like a **closed-notes exam**, the model answers from **what it memorized in training**. **RAG (retrieval-augmented generation)** is closer to an **open-book exam**. Before writing, it looks up a **page-sized slice (chunk)** in a **document store (library)** and pastes it into the prompt.\n\nThe usual pipeline is **retrieve → augment (fill blanks) → generate**. First pick relevant chunks, fill the `{context}` slot in the template, then let the LLM compose the answer.\n\nSuppose someone asks, \"**How did this year's PTO rules change?**\" Closed-book may **guess from last year's common sense** and still sound fluent. RAG finds **page 7 of this year's HR PDF** and answers **with that text as evidence**. The difference is not polish — it is **which page was read**.","1":"**2. Embeddings: text as coordinates**\n\nComputers do not compare words like \"refund\" and \"return\" by meaning directly. **Embedding** maps each sentence to a **list of numbers (a vector)** so similar topics land **near each other** in that space — like cafes clustering on a map.\n\nSearch uses **cosine similarity (cos)**: a score for whether two vectors point in the **same direction (angle)**, which tracks **topic overlap** better than straight-line distance. \"**Refund**\" sits close to \"**returns / chargebacks**\"; \"**lunch menu**\" sits far away. RAG ranks chunks by cos and keeps the most relevant ones.","2":"**3. Chunking & Top-$k$ search**\n\nA **500-page policy PDF** cannot be pasted whole into every prompt. **Chunking** splits long docs into **bite-sized pieces** (e.g. **200 tokens**) stored as **chunks**. Smaller slices mean **more pieces to search**, but slices that are too short **lose context**.\n\nWhen a question arrives, embeddings and cos rank chunks; **Top-$k$** passes only the **top k pages** to the LLM (**k=3** → three chunks). A **vector DB** acts like a library **card catalog** — you do not reread the entire PDF each time, you pull **similar slices** quickly.\n\nIf **k is too large**, **irrelevant paragraphs** slip in and cost rises. More is not always better — **tuning k** matters.","3":"**4. Context budget & prompt template**\n\nModels have a **context limit (ctx)** — one **exam sheet** of text per call. **Instructions (prompt)** and the **user question (query)** use space first; only the **remaining room** holds retrieved chunks. That remainder is the **chunk budget**: **budget = ctx−prompt−query**.\n\nTeams use a **prompt template**: a fixed form with `{context}` for retrieved text and `{question}` for the user. Adding **\"answer only from the context below\"** discourages **guessing outside the pasted pages**.\n\nWith ctx=**4096**, prompt=**512**, query=**200**, the budget is **3384**. At **200** tokens per chunk, roughly **3384 // 200 ≈ 16** chunks fit. **Subtraction (budget)** and **// (max chunks)** are two steps of the same story."},"whyImportant":{"0":"**1. Why RAG exists — from confident guesses to checkable answers**\n\nLLMs are strong at **fluent text**. On facts they never saw — **internal policies, product manuals, yesterday's news** — they may still **sound sure** while being wrong. In business, a wrong answer can reach **approvals or customers** before anyone notices.\n\nRAG **searches first, then writes**. You can trace \"**HR policy PDF, page 12, paragraph 3**\" and fix **documents or search** when something drifts. The goal is not **zero hallucination**, but answers that are **much easier to verify**.\n\n**Example:** \"**Is international shipping covered by the 7-day refund?**\" — closed-book may **generalize domestic rules**; RAG pulls **shipping clauses** and answers **conditionally**.","1":"**2. Keep knowledge up to date without retraining the whole model**\n\nA new policy PDF on **Monday morning** should not force a **full 7B retrain**. In RAG you **chunk → embed → refresh the vector DB**; the **same LLM** can cite the new text the same day.\n\n**Analogy:** when a textbook gets a **new edition**, you **swap library shelves**, not **re-teach every student from scratch**. Weights stay; **which pages are read** stays current.\n\n**Example:** \"**2026 benefits FAQ**\" — scenario problems often prefer **re-chunk & refresh DB** over **full retrain** for this reason.","2":"**3. A smart LLM with the wrong page still fails — retrieval is half the product**\n\nEven a strong generator fails when **retrieval picks the wrong page**. A fluent answer built from a **cafeteria menu** chunk for a **returns** question is still a business failure.\n\nWhen answers drift, check **Top-k**, **chunk size**, **overlap**, and **re-rank** before chasing **temperature** (answer randomness). **recall@k** asks whether the **right document landed in the top k** — a **search report card** separate from polished prose.\n\nRAG exists to run **evidence, freshness, and search quality** as **one pipeline**, not three unrelated knobs."},"howUsed":{"0":"**Step 1: Build the library (ingest & chunk)**\n\nEvery RAG product starts by collecting **sources of truth** — policies, manuals, FAQs, notices. Long files are split into **chunks**, with **metadata** such as **file name and page** so users can later ask for citations.\n\nEach chunk is **embedded** into a **vector DB**, like building a library **card catalog**. After that, you do not paste a **500-page PDF** into every prompt — you pull **relevant slices**.","1":"**Step 2: Find matching pages (retrieve)**\n\nWhen a question arrives, embed it the same way, rank chunks by cos(q,c), and keep **Top-$k$** (e.g. **5 pages**). If raw scores are weak, add **re-rank** or **deduplication**.\n\nIf \"**search works but answers are weird**,\" inspect **k, chunk size, and overlap** here before tuning generation settings.","2":"**Step 3: Assemble the exam sheet (augment)**\n\nSelected chunks must fit the **chunk budget = ctx − prompt − query**. Fill `{context}` in a **template** and add **\"answer only from the context below.\"** This **augment** step finishes the exam sheet so the model reads **pasted pages**, not empty air.","3":"**Step 4: Write the answer (generate) — the full pipeline**\n\nThe LLM writes the reply; good products show **links, pages, and chunk IDs**. Log **k, chunk size, budget, and scores** to tune search when **hallucinations or stale answers** increase.\n\nEnd to end: **ingest/chunk** stores **pizza-slice** pages (splitter, vector DB); **retrieve** bookmarks the best matches with **embedding, cos, Top-$k$**; **augment** pastes **open-book references** within the **template and budget**; **generate** lets the LLM **write from those references**. In one line: **build library → find pages → paste into template → generate**."},"problemSolving":{"0":"For practice problems, start with **closed-notes exam (closed-book) vs open-book lookup (RAG)**. Closed-book has **no retrieval**; RAG **finds pages**, fills a template with cos(q,c), **Top-$k$**, **budget**, then **generates**.\n\nRemember **retrieve → augment → generate**. Numeric items use **budget = ctx−prompt−query**, then **budget // chunk_size** for **max chunks**. **temperature** changes answer **randomness**, not search quality.\n\n**Common calculations:** ctx=**4096**, prompt=**512**, query=**200** → **3384** · Top-$k$=**5** → **5** chunks · budget=**1200**, size=**200** → 1200 // 200= **6**. Cosine: \"**returns**\" ↔ \"**refund policy**\" = **similar direction**; \"**cafeteria**\" = **far**.\n\nBank-style samples below.\n\n---\n\n**Example (concept · concept)** — closest to **RAG**? ② **retrieve → augment → generate** → **answer 2**\n\n---\n\n**Example (true/false · ox)** — RAG uses **embedding search** → **1**\n\n---\n\n**Example (true/false · ox)** — closed-book **Top-k** on PDFs → **0**","1":"Sessions often include **context budget (vote), Top-k (vote), and chunk count (aggregate/config)**. The pattern is: **subtract prompt and query from ctx**, then **// by chunk size**.\n\n**Example (context budget · vote)** — limit **4096**, prompt **512**, query **200** → chunk budget? → 4096-512-200= **3384**\n\n---\n\n**Example (Top-k · vote)** — Top-k **5** → how many chunks? → **5**\n\n---\n\n**Example (chunk count · aggregate)** — budget **1200**, chunk size **200** → 1200 // 200= **6**\n\n---\n\n**Example (integer divide · config)** — 1400 // 200 is closest to → **7**","2":"**Example (scenario · scenario)**\n\n\"Urgent: refresh **latest policy PDFs** in QA. **First step?** ① full LLM retrain ② **re-chunk docs & refresh vector DB** ③ remove softmax\"\n→ **answer 2**\n\n---\n\n**Example (scenario · scenario)**\n\n\"Retrieval works but answers are wrong. **Check first?** ① temperature=0 only ② **Top-k, chunk size, re-rank** ③ GPU driver\"\n→ **answer 2**\n\n---\n\n**Example (concept · concept)**\n\n\"If **Top-k** is too large, a common downside is? ① no search ② **more noise & cost** ③ zero embedding dim\"\n→ **answer 2**\n\n---\n\n**Example (concept · concept)**\n\n\"**Embedding** is best described as? ① optimizer name ② **map text to vectors for similarity search** ③ batch norm only\"\n→ **answer 2**","3":"**Example (true/false · ox)**\n\n\"A **prompt template** slots retrieved text into `{context}` and `{question}`.\"\n→ **answer 1**\n\n---\n\n**Example (true/false · ox)**\n\n\"Large **cosine similarity** always means small **Euclidean distance** (simplified teaching).\"\n→ cos is about **direction** → **answer 0**\n\n---\n\n**Example (pipeline · ensemble)**\n\n\"Top-k **2**, **4** summary sentences per chunk → total sentences? ($2 \\times 4$)\"\n→ **8**\n\n---\n\n**Example (context budget · vote)**\n\n\"Limit **8192**, prompt **1024**, query **256** → chunk budget?\"\n→ **6912**\n\n---\n\n**Example (chunk count · config)**\n\n\"Budget **2400**, chunk size **200** → 2400 // 200= **12**\""},"summary":"$43","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary","problemSolving":"Notes for problem solving"},"formulaGuide":{"title":"Reading the formulas (RAG)","linear":"**1. Cosine similarity + Top-k — the search heart**\n\nRetrieval scores how aligned question **q** and chunk **c** are in meaning. That score is **cosine similarity**:\n$$\\text{cos}(\\mathbf{q},\\mathbf{c})=\\dfrac{\\mathbf{q}\\cdot\\mathbf{c}}{\\|\\mathbf{q}\\|\\|\\mathbf{c}\\|}$$\n\n**$\\mathbf{q}$** is the query embedding and **$\\mathbf{c}$** is the chunk embedding. **Higher** values usually mean more related topics (often read near **0–1**). **Top-$k$** keeps only the **top $k$** chunks — **k=5** means **five chunks**.\n\nLike a video site ranking **titles/descriptions** in embedding space, RAG finds the **closest policy pages**. If scores are **0.92, 0.81, 0.55, 0.30, 0.12** and **Top-k=3**, use the **first three**. A \"**travel expense claim**\" query should align with **meals & transport** chunks, not a **club newsletter**. Remember: **angle → top k → attach evidence**.","xavierVariance":"**2. Chunking — bite-sized pages**\n\nYou cannot paste whole PDFs every time, so documents are stored as **chunks** — bite-sized slices. Think of splitting a 1000-page cookbook into **recipe cards**; for \"**broth**\" you pull **soup cards** only.\n\nTeams tune **chunk size** (e.g. **200** tokens) and **overlap** (e.g. **40** tokens) so sentences are not cut in half. If \"**refund within 7 days**\" splits badly, **7 days** may land in the **next** chunk and search misses it. **20-token** chunks lack context; **800-token** chunks can consume the whole **budget** per slot. **Size + overlap** is the foundation of retrieval quality.","heVariance":"**3. Context budget — one exam sheet**\n\n**ctx** is one **exam sheet** per call. **Instructions (prompt)** and the **question (query)** use space first; only the **remainder** holds chunks:\n$$\\text{chunk\\_budget}=\\text{ctx}-\\text{prompt}-\\text{query}$$\n\nWith ctx=**4096**, prompt=**512**, query=**200**, the budget is **3384**. At chunk size **200**, 1200 // 200= **6** chunks fit. With ctx=**8192**, prompt=**1024**, query=**256**, **6912** tokens remain for chunks. **Subtraction** and **//** are two steps of the same story.","xavierUniform":"**4. Prompt template — fill the blanks**\n\nTeams send a **fixed template** and fill `{retrieved_chunks}` with **Top-$k$** evidence only — like an exam sheet that says \"**Use the passage below only**\" plus a **paste area** and the **question**.\n\n```\nContext:\n{retrieved_chunks}\n\nQuestion: {user_query}\nAnswer:\n```\n\nA filled prompt might paste `[chunk1] Refund within 7 days…` and ask `Is international shipping refundable?` Adding a **context-only** instruction reduces **hallucinations**. The template is the **box where augment happens**."},"formulaGuideDiagramCaption":"In the figure, bar height is **cos(q,c)**. Keep only **Top-k** chunks below the green line for the template blanks.","formulaGuideDiagramAria":"RAG diagram: query-chunk cosine similarity and Top-k","formulaGuideDiagramFrozenHint":"Similarity","advDlCh14FormulaGuideLossHint":"Top-k · similar chunks","advDlCh14VisualInputLabel":"User query","visual":"The visual above shows how the **same question** flows through **memory-only (closed-book)** vs **open-book RAG**: **build a library → search relevant pages → fill template blanks → answer with citations**. Follow the step labels alongside the narrative sections.","problemSolvingLabel":"Notes for problem solving","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"All **60 bank items** are **RAG (retrieval-augmented generation)**. Each session draws **5** problems **easy→hard** (2·2·1) with **no duplicate type (prefix + difficulty)**. Numeric items test **Top-k, context budget, and chunk counts**; concept, T/F, and scenario items test **closed-book vs RAG**. Read **Notes for problem solving** first if you want a walkthrough.","practiceProblemsInstruction":"Choose the best option.","practiceProblemsInstructionCalc":"Compute the value, then choose the best option.","practiceProblemsInstructionConcept":"Concept question — choose the best option.","practiceProblemsInstructionOx":"1 if true, 0 if false.","practiceProblemsInstructionScenario":"Pick the best action for the scenario.","practiceProblemsInstructionVote":"Choose the option matching your calculation.","practiceProblemsInstructionAggregate":"Choose the option matching Top-k, budget, or chunk-count calculations.","practiceProblemsInstructionConfig":"Choose the value for budget // chunk_size settings.","practiceProblemsInstructionEnsemble":"Choose the value for RAG pipeline / Top-k / chunk combo calculations.","advDlCh14VisualIntro":"For **'What is the refund policy?'** **closed-book** answers from **memory only**; **RAG** **embeds**, **Top-k retrieves**, **augments**, then **generates** with evidence.","advDlCh14VisualConceptTitle":"Closed-book: memory only · RAG: retrieve then generate","advDlCh14VisualSectionTitle":"RAG at a glance","advDlCh14VisualMetaphor":"**Closed-book** = closed-notes exam; **RAG** = **open-book** with lookup.","advDlCh14VisualClosedBookLaneTitle":"Closed-book — generate without retrieval","advDlCh14VisualRagLaneTitle":"RAG — retrieve, augment, generate","advDlCh14VisualTwoStageHint":"No external docs → **hallucination & stale info** risk.","advDlCh14VisualRagHint":"**Store → embed → Top-k → augment prompt → generate** with citations.","advDlCh14VisualQueryLabel":"Query","advDlCh14VisualEmbedLabel":"Embed","advDlCh14VisualRetrieveLabel":"Retrieve","advDlCh14VisualAugmentLabel":"Augment","advDlCh14VisualGenerateLabel":"Generate","advDlCh14VisualTopInputLabel":"User query","advDlCh14VisualTopFeatureLabel":"Embedding vectors","advDlCh14VisualTopOutputLabel":"Grounded answer","advDlCh14VisualBackboneLabel":"Document store","advDlCh14VisualProposalLabel":"Vector search","advDlCh14VisualProposalHint":"cos(q,c) candidate chunks","advDlCh14VisualRoILabel":"Top-k select","advDlCh14VisualRoIHint":"Top k similar chunks","advDlCh14VisualHeadLabel":"LLM generation head","advDlCh14VisualBboxLabel":"Chunk","advDlCh14VisualGridCellLabel":"Chunk size","advDlCh14VisualGridHint":"Split docs by tokens","advDlCh14VisualAnchorHint":"Overlap preserves boundary context","advDlCh14VisualNmsLabel":"Context budget","advDlCh14VisualNmsHint":"ctx − prompt − query","advDlCh14VisualMapLabel":"Prompt template","advDlCh14VisualMapHint":"{context} + {question}","advDlCh14VisualLossLabel":"Hallucination / grounding","advDlCh14VisualFlowTitle":"What happens to one question","advDlCh14VisualStep0":"A **user question** arrives; we need **grounded, up-to-date** answers.","advDlCh14VisualStep1":"**Embed** query and chunks for **similarity** in the same space.","advDlCh14VisualStep2":"**Closed-book** skips retrieval. **RAG** picks **Top-k** chunks.","advDlCh14VisualStep3":"**Augment** prompt from template within **context budget**.","advDlCh14VisualStep4":"**Generate**; log **sources, k**; monitor **hallucinations**.","advDlCh14VisualClosedBookStep0":"① User query","advDlCh14VisualClosedBookStep1":"② Parametric memory only","advDlCh14VisualClosedBookStep2":"③ No external search","advDlCh14VisualClosedBookStep3":"④ Hallucination / stale risk","advDlCh14VisualRagStep0":"① Embed query + stored chunks","advDlCh14VisualRagStep1":"② **Top-k** similar chunks","advDlCh14VisualRagStep2":"③ **Augment prompt → generate**","problems":{"concept_0":"Which best matches **RAG**?\n① Only enlarge LLM weights\n② **Search external docs, insert into prompt, generate**\n③ Split pixels only\n④ Zero learning rate","concept_1":"**Chunking** — best description?\n① One char each\n② **Split long docs for search/context**\n③ GPU temp\n④ Remove softmax","concept_2":"**Embedding** — closest role?\n① Random text\n② **Text → vectors for similarity search**\n③ Batch norm only\n④ Delete model","concept_3":"**Cosine similarity** — core idea?\n① Zero L2\n② **Closer directions → higher relevance**\n③ Intersection/union\n④ LR schedule","concept_4":"**Top-k** with k=5?\n① 5 classes\n② 5 sentences\n③ **Top 5 similar chunks**\n④ 5 epochs","concept_5":"**Augment** — closest?\n① Drop chunks\n② **Insert chunks into prompt template**\n③ Zero weights\n④ **Closed-book only, no retrieval**","ox_0":"RAG **searches external store** then generates.\n1 if true, 0 if false.","ox_1":"RAG needs **no embeddings or vector search**.\n1 if true, 0 if false.","ox_2":"**Chunking** long docs helps fit context windows.\n1 if true, 0 if false.","ox_3":"Larger **Top-k always** improves accuracy only.\n1 if true, 0 if false.","ox_4":"**Cosine similarity** often compares query and chunks.\n1 if true, 0 if false.","ox_5":"**Prompt templates** with retrieved context reduce hallucinations.\n1 if true, 0 if false.","scenario_0":"GPU **OOM** while embedding. Try **first**?\n① Infinite batch\n② **Reduce batch, chunk length, or dims**\n③ Delete store\n④ Top-k=0","scenario_1":"**Duplicate chunks** in results. **First**?\n① Shuffle labels\n② **Dedupe / re-rank**\n③ Delete embedder\n④ Infinite budget","scenario_2":"Need answers from **latest policy PDF**. Best structure?\n① **Closed-book only**\n② Tune LR\n③ **RAG: store + retrieve + generate**\n④ Stop augmentation","scenario_3":"Chunks **too small**, context lost. **First**?\n① Threshold 0\n② **Tune chunk size & overlap**\n③ Top-k=1 only\n④ Delete template","vote_0":"**Top-k=5** — how many chunks?","vote_1":"**Top-k=3** — how many chunks?","vote_2":"Budget **1000**, **200** tokens/chunk — max chunks? (1000 // 200)","vote_3":"Budget **2000**, **500**/chunk — max chunks? (2000 // 500)","scenario_4":"RAG answers look fine but **cited chunks mismatch the question**. Check **first**?\n① temperature=0 only\n② **Top-k, cos similarity, chunk size, re-rank (recall@k)**\n③ Zero LR only\n④ Remove softmax","scenario_5":"Chunk size **mismatch**. **First**?\n① **Retune chunk size & overlap**\n② Redefine cosine\n③ Remove Top-k\n④ Zero budget","scenario_6":"Similarity **threshold too low**. Expected?\n① Always 100%\n② **Irrelevant chunks hurt quality**\n③ No embedding\n④ Becomes closed-book","scenario_7":"Improve **domain docs** accuracy?\n① **Add docs / fine-tune embeddings**\n② Random guess\n③ Delete store\n④ Remove loss","scenario_8":"Too many **hallucinated citations**. Tune **first**?\n① Disable search\n② **Citations, temperature, Top-k**\n③ Zero classes\n④ Delete LLM","scenario_9":"Fair **RAG comparison**?\n① Loss only\n② **Grounded accuracy, recall@k**\n③ Count chunks only\n④ Never use similarity","vote_4":"Limit **4096**, prompt **512**, query **200** — chunk token budget? (4096-512-200)","vote_5":"**3** chunks × **400** tokens — total? ($3 \\times 400$)","vote_6":"**Top-k=4** — how many chunks?","vote_7":"Budget **1500**, **300**/chunk — max chunks? (1500 // 300)","vote_8":"Limit **8192**, prompt **1024**, query **256** — budget? (8192-1024-256)","vote_9":"Budget **3600**, **600**/chunk — max chunks? (3600 // 600)","aggregate_0":"Budget **1200**, **200**/chunk — max chunks? (1200 // 200)","aggregate_1":"Budget **1400**, **200**/chunk — max chunks? (1400 // 200)","aggregate_2":"Limit **2048**, prompt **256**, query **128** — chunk budget? (2048-256-128)","aggregate_3":"Budget **3000**, **100**/chunk — max chunks? (3000 // 100)","ensemble_0":"Top-k **2**, **4** sentences/chunk — total sentences? ($2 \\times 4$)","ensemble_1":"Top-k **3**, **3** bullets/chunk — total bullets? ($3 \\times 3$)","ensemble_2":"**Top-k=5** — how many chunks?","ensemble_3":"Top-k **3**, **4** meta fields/chunk — total fields? ($3 \\times 4$)","aggregate_4":"Budget **2000**, **200**/chunk — max chunks? (2000 // 200)","aggregate_5":"**35** chunks × **100** tokens — total? ($35 \\times 100$)","aggregate_6":"Top-k **2**, **4** source tags/chunk — total tags? ($2 \\times 4$)","aggregate_7":"**Top-k=4** — how many chunks?","aggregate_8":"Budget **1500**, **100**/chunk — max chunks? (1500 // 100)","aggregate_9":"Budget **2000**, **100**/chunk — max chunks? (2000 // 100)","config_0":"Budget **1200**, **200**/chunk — max chunks?","config_1":"Budget **1400**, **200**/chunk — max chunks?","config_2":"Budget **1600**, **200**/chunk — max chunks?","config_3":"Budget **1800**, **200**/chunk — max chunks?","config_4":"Budget **2000**, **200**/chunk — max chunks?","config_5":"Budget **800**, **200**/chunk — max chunks?","config_6":"Budget **600**, **200**/chunk — max chunks?","config_7":"Budget **2400**, **200**/chunk — max chunks?","config_8":"Budget **1000**, **200**/chunk — max chunks?","config_9":"Budget **1200**, **200**/chunk — max chunks?","ensemble_4":"**Top-k=5** — how many chunks?","ensemble_5":"**Top-k=4** — how many chunks?","ensemble_6":"Top-k **3**, **3** fields/chunk — total? ($3 \\times 3$)","ensemble_7":"**Top-k=7** — how many chunks?"},"problemSolutions":{"concept_0":"**Example:** Search PDF then answer.\n\n**Steps:** RAG → **2**.","concept_1":"**Example:** 512-token splits.\n\n**Steps:** **2**.","concept_2":"**Example:** query/doc vectors.\n\n**Steps:** **2**.","concept_3":"**Example:** cos(q,c) rank.\n\n**Steps:** **2**.","concept_4":"**Example:** top 5 chunks.\n\n**Steps:** **3**.","concept_5":"**Example:** insert context.\n\n**Steps:** **2**.","ox_0":"**Example:** External KB.\n\n**Steps:** True **1**.","ox_1":"**Example:** Embeddings required.\n\n**Steps:** False **0**.","ox_2":"**Example:** Chunking helps.\n\n**Steps:** True **1**.","ox_3":"**Example:** Large k adds noise.\n\n**Steps:** False **0**.","ox_4":"**Example:** Cosine sim.\n\n**Steps:** True **1**.","ox_5":"**Example:** Template+context.\n\n**Steps:** True **1**.","scenario_0":"**Steps:** OOM → shrink **2**.","scenario_1":"**Steps:** Dedupe **2**.","scenario_2":"**Steps:** RAG **3**.","scenario_3":"**Steps:** Tune chunks **2**.","vote_0":"**Calc:** Top-k=5. **Answer 5**.","vote_1":"**Calc:** Top-k=3. **Answer 3**.","vote_2":"**Calc:** 1000 // 200 = 5. **Answer 5**.","vote_3":"**Calc:** 2000 // 500 = 4. **Answer 4**.","scenario_4":"**Steps:** Wrong pages retrieved → tune search **2**.","scenario_5":"**Steps:** Retune chunks **1**.","scenario_6":"**Steps:** Low threshold **2**.","scenario_7":"**Steps:** Domain docs **1**.","scenario_8":"**Steps:** Hallucination **2**.","scenario_9":"**Steps:** recall@k **1**.","vote_4":"**Calc:** 3384. **Answer 3384**.","vote_5":"**Calc:** 1200. **Answer 1200**.","vote_6":"**Calc:** Top-k=4. **Answer 4**.","vote_7":"**Calc:** 5. **Answer 5**.","vote_8":"**Calc:** 6912. **Answer 6912**.","vote_9":"**Calc:** 6. **Answer 6**.","aggregate_0":"**Calc:** 6. **Answer 6**.","aggregate_1":"**Calc:** 7. **Answer 7**.","aggregate_2":"**Calc:** 1664. **Answer 1664**.","aggregate_3":"**Calc:** 30. **Answer 30**.","ensemble_0":"**Calc:** 8. **Answer 8**.","ensemble_1":"**Calc:** 9. **Answer 9**.","ensemble_2":"**Calc:** 5. **Answer 5**.","ensemble_3":"**Calc:** 12. **Answer 12**.","aggregate_4":"**Calc:** 10. **Answer 10**.","aggregate_5":"**Calc:** 3500. **Answer 3500**.","aggregate_6":"**Calc:** 8. **Answer 8**.","aggregate_7":"**Calc:** 4. **Answer 4**.","aggregate_8":"**Calc:** 15. **Answer 15**.","aggregate_9":"**Calc:** 20. **Answer 20**.","config_0":"**Calc:** 6. **Answer 6**.","config_1":"**Calc:** 7. **Answer 7**.","config_2":"**Calc:** 8. **Answer 8**.","config_3":"**Calc:** 9. **Answer 9**.","config_4":"**Calc:** 10. **Answer 10**.","config_5":"**Calc:** 4. **Answer 4**.","config_6":"**Calc:** 3. **Answer 3**.","config_7":"**Calc:** 12. **Answer 12**.","config_8":"**Calc:** 5. **Answer 5**.","config_9":"**Calc:** 6. **Answer 6**.","ensemble_4":"**Calc:** 5. **Answer 5**.","ensemble_5":"**Calc:** 4. **Answer 4**.","ensemble_6":"**Calc:** 9. **Answer 9**.","ensemble_7":"**Calc:** 7. **Answer 7**."},"problemAnswers":{"concept_0":2,"concept_1":2,"concept_2":2,"concept_3":2,"concept_4":3,"concept_5":2,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":1,"scenario_0":2,"scenario_1":2,"scenario_2":3,"scenario_3":2,"vote_0":5,"vote_1":3,"vote_2":5,"vote_3":4,"scenario_4":2,"scenario_5":1,"scenario_6":2,"scenario_7":1,"scenario_8":2,"scenario_9":1,"vote_4":3384,"vote_5":1200,"vote_6":4,"vote_7":5,"vote_8":6912,"vote_9":6,"aggregate_0":6,"aggregate_1":7,"aggregate_2":1664,"aggregate_3":30,"ensemble_0":8,"ensemble_1":9,"ensemble_2":5,"ensemble_3":12,"aggregate_4":10,"aggregate_5":3500,"aggregate_6":8,"aggregate_7":4,"aggregate_8":15,"aggregate_9":20,"config_0":6,"config_1":7,"config_2":8,"config_3":9,"config_4":10,"config_5":4,"config_6":3,"config_7":12,"config_8":5,"config_9":6,"ensemble_4":5,"ensemble_5":4,"ensemble_6":9,"ensemble_7":7},"problemTestCodes":{"concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 3\nassert answer == 3","concept_5":"answer = 2\nassert answer == 2","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 1\nassert answer == 1","scenario_0":"answer = 2\nassert answer == 2","scenario_1":"answer = 2\nassert answer == 2","scenario_2":"answer = 3\nassert answer == 3","scenario_3":"answer = 2\nassert answer == 2","vote_0":"answer = 5\nassert answer == 5","vote_1":"answer = 3\nassert answer == 3","vote_2":"answer = 1000 // 200\nassert answer == 5","vote_3":"answer = 2000 // 500\nassert answer == 4","scenario_4":"answer = 2\nassert answer == 2","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 2\nassert answer == 2","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 2\nassert answer == 2","scenario_9":"answer = 1\nassert answer == 1","vote_4":"answer = 4096 - 512 - 200\nassert answer == 3384","vote_5":"answer = 3 * 400\nassert answer == 1200","vote_6":"answer = 4\nassert answer == 4","vote_7":"answer = 1500 // 300\nassert answer == 5","vote_8":"answer = 8192 - 1024 - 256\nassert answer == 6912","vote_9":"answer = 3600 // 600\nassert answer == 6","aggregate_0":"answer = 1200 // 200\nassert answer == 6","aggregate_1":"answer = 1400 // 200\nassert answer == 7","aggregate_2":"answer = 2048 - 256 - 128\nassert answer == 1664","aggregate_3":"answer = 3000 // 100\nassert answer == 30","ensemble_0":"answer = 2 * 4\nassert answer == 8","ensemble_1":"answer = 3 * 3\nassert answer == 9","ensemble_2":"answer = 5\nassert answer == 5","ensemble_3":"answer = 3 * 4\nassert answer == 12","aggregate_4":"answer = 2000 // 200\nassert answer == 10","aggregate_5":"answer = 35 * 100\nassert answer == 3500","aggregate_6":"answer = 2 * 4\nassert answer == 8","aggregate_7":"answer = 4\nassert answer == 4","aggregate_8":"answer = 1500 // 100\nassert answer == 15","aggregate_9":"answer = 2000 // 100\nassert answer == 20","config_0":"assert 1200 // 200 == 6","config_1":"assert 1400 // 200 == 7","config_2":"assert 1600 // 200 == 8","config_3":"assert 1800 // 200 == 9","config_4":"assert 2000 // 200 == 10","config_5":"assert 800 // 200 == 4","config_6":"assert 600 // 200 == 3","config_7":"assert 2400 // 200 == 12","config_8":"assert 1000 // 200 == 5","config_9":"assert 1200 // 200 == 6","ensemble_4":"answer = 5\nassert answer == 5","ensemble_5":"answer = 4\nassert answer == 4","ensemble_6":"answer = 3 * 3\nassert answer == 9","ensemble_7":"answer = 7\nassert answer == 7"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh17":{"chapter":"Chapter 17","title":"Autoencoder: Compress and Reconstruct","description":"Feeding images or high-dimensional data $x$ into a network, the model first **encodes** a compact **summary code $z$** (latent representation), then **decodes** to $\\hat{x}$ with the same shape—an **autoencoder**. Training minimizes **reconstruction loss** between $x$ and $\\hat{x}$. This is classic **unsupervised** learning: no class labels; the data itself is the target.\n\nA narrow **bottleneck** enables **dimensionality reduction** and **anomaly detection**. Chapter 18 (**VAE**) adds a probabilistic latent model for **generation**; here we build the compress–reconstruct foundation.","sectionTitle":"Autoencoder: Compress and Reconstruct","whatIs":{"0":"**1. Symmetric encoder–decoder structure**\n\n**Concept:** The **encoder** $f_\\theta$ maps input $x$ to a latent vector $z=f_\\theta(x)$; the **decoder** $g_\\phi$ maps $z$ to $\\hat{x}=g_\\phi(z)$. The dimension of $z$ is forced into a **much smaller bottleneck** than the original input.\n\n**Intuition:** Like a witness describing a face to a sketch artist with a few traits ($z$) instead of every pixel—the decoder redraws the face from that summary.","1":"**2. Loss: how close is the reconstruction?**\n\n**Concept:** For continuous real-valued features, **MSE** $\\frac{1}{d}\\sum_i (x_i-\\hat{x}_i)^2$ is typical; for $[0,1]$ grayscale images, **BCE** is also used.\n\n**Intuition:** Like overlaying the original and the copy and scoring per-pixel mismatch.","2":"**3. Why the bottleneck matters**\n\nIf $z$ were as large as $x$, the network could trivially **copy the input** (identity). A narrow bottleneck forces the model to keep only **real patterns** in $z$.\n\n**Practice (anomaly detection):** Train on **normal** images only; **high reconstruction error** on novel “abnormal” inputs flags defects.","3":"**4. Denoising autoencoder (DAE)**\n\n**Use:** Add noise or masking, then train to recover the **clean target**. The model learns more **robust** features that ignore superficial corruption.","4":"**5. What is the latent space?**\n\n**Concept:** The **latent space** is the **low-dimensional vector space where the encoder’s codes $z$ live**—not the raw pixel/input space. Each sample becomes **one point (a coordinate vector)** in this space; after training, similar inputs often land **nearby**, while different patterns map **farther apart**, so the space can acquire **geometric structure**.\n\n**In an autoencoder:** The bottleneck dimension $k$ **is** the latent space dimension. The decoder $g_\\phi$ **maps** points in this space back to high-dimensional $\\hat{x}$. (Chapter 18 **VAE** adds a **probability model** on this space for sampling and **generation**.)","5":"**6. What is PCA?**\n\n**Concept:** **PCA (Principal Component Analysis)** is a **linear** dimensionality-reduction method: it finds directions in which the data **variance** is largest, in order, and builds **orthogonal axes** called **principal components**. **Projecting** data onto the first few axes yields a **low-dimensional summary** that keeps as much **variance** as possible (along discarded axes you lose that variance).\n\n**Versus autoencoders:** PCA uses **linear maps only**; autoencoders with nonlinear activations can learn **richer, curved** structure. On complex data, AEs are often more flexible. (A **linear** AE trained with MSE connects to PCA intuition under certain conditions.)"},"whyImportant":{"0":"**Beyond PCA: powerful dimensionality reduction**\n\n**PCA**, as described above, is essentially **linear** dimensionality reduction. **Autoencoders**, by contrast, use **nonlinear** activations to compress and visualize high-dimensional data in 2–3D **more flexibly**.","1":"**Unsupervised feature learning**\n\nLabeling is expensive. An AE can extract features $z$ from raw data alone; a pretrained encoder is a strong starting point for **transfer learning** into classifiers.","2":"**Gateway to generative AI**\n\nBeyond compression, tweaking latent $z$ to synthesize new faces or images leads to **VAEs** and **GANs**."},"howUsed":{"0":"**Step 1: Normalize and scale**\n\nMap image pixels from $0$–$255$ to $[0,1]$ with **min–max**, or **standardize** per channel. Keep **RGB** channel order $(R,G,B)$ fixed and apply the same preprocessing every batch. Inconsistent scaling changes MSE gradients and can slow or destabilize training.","1":"**Step 2: Architecture, bottleneck $k$, and loss**\n\n**Images:** prefer a **convolutional AE (CAE)** to preserve locality. **Vectors or sequences:** use 1D convs or fully connected stacks. **$k$:** smaller $k$ → stronger compression but more detail loss; larger $k$ → easier reconstruction but weaker summarization—pick $k$ with **validation loss**. Outputs in $\\mathbb{R}$ → **MSE**; $[0,1]$-like grayscale → consider **BCE**.","2":"**Step 3: Training loop, output activation, stability**\n\nBackpropagate MSE or BCE each minibatch. For $[0,1]$ targets, put **sigmoid** on the decoder’s last layer. Use **Adam** (or similar), a **learning-rate schedule**, and **gradient clipping** if needed. Split **train/validation**; if validation loss worsens, try **early stopping**, **dropout/weight decay**, or **denoising AE**.","3":"**Step 4: Evaluation, plots, downstream**\n\nDo not rely on the loss curve alone—**inspect** $\\hat{x}$. **Project** latent $z$ to 2D (e.g., t-SNE) to see structure or outliers. For **anomaly detection**, train on normal data only and set a reconstruction-error **threshold** on a validation set. **Freeze or fine-tune** the encoder for **few-label classification** or **clustering**.","4":"**Uses at a glance**\n\n| Goal | Idea |\n| --- | --- |\n| **Anomaly detection** | Train on **normal** data only → flag **high reconstruction error** |\n| **Denoising** | **DAE:** corrupted input → clean target |\n| **Dim. reduction / viz** | Small $z$ or 2D projection of $z$ |\n| **Pretraining** | Reuse the encoder as a front end for **transfer** |"},"problemSolving":{"0":"Autoencoder items are easiest if you keep the one-liner **$z=f_\\theta(x)$, $\\hat{x}=g_\\phi(z)$** and the goal **reconstruction loss** matching $x$ to $\\hat{x}$. At the **bottleneck**, usually **$k \\ll d$**. For one **fully connected** layer $d \\to k$, count about **$d\\cdot k$ weights + $k$ biases**. **Flattened image length** is height×width (×3 for RGB); **patch count** (no CLS) is $(H/p)\\times(W/p)$—same line of reasoning as **ViT patch/grid** (Chapter 5 review).","1":"**Anomaly detection:** train reconstruction on **normal** data, then flag samples with **large reconstruction error**. **Denoising AE** maps corrupted inputs toward clean targets for **robust** features. Use **MSE** for real pixels; **BCE** is common for $[0,1]$ grayscale. When **$k/d$** or a **percent** appears, align numerator and denominator carefully.","2":"**Convolutional AE** stacks **CNN** encoders/decoders to keep **local structure** (Chapter 12). If **$k$ is too large**, the net can approach an **identity copy**; questions often test the **compression vs. expressivity** trade-off when **shrinking $k$**.","3":"Next chapter **VAE** puts a **probability model** on **latent $z$** for **generation**. If the stem says **probabilistic latent** or **sampling/generation**, think **VAE**."},"summary":"**One-liner:** The encoder squeezes data through a narrow bottleneck $z$; the decoder maps back to $\\hat{x}$; training minimizes reconstruction error so the network discovers salient structure.\n\n**Links:** Combine **Dense** and **CNN** blocks for encoder/decoder; CAEs help on complex spatial data.\n\n**Next (Chapter 18):** **VAE** places a **probability distribution** on $z$ for **generation**.","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Reading the formulas (autoencoder)","linear":"**1. Encoder and decoder in one line**\n\n$z=f_\\theta(x)$, $\\hat{x}=g_\\phi(z)$. Loss example: $\\mathcal{L}=\\|x-\\hat{x}\\|_2^2$.\n\n- **$z$:** **Latent code** at the bottleneck\n- **$\\hat{x}$:** **Reconstructed output**","xavierVariance":"**2. Bottleneck and compression**\n\nInput dim $d$, latent $k\\ll d$: compression ratio is about $k/d$.\n\n- **Smaller $k$:** stronger compression (more information loss possible)\n- **Larger $k$:** easier reconstruction, weaker summarization","heVariance":"**3. Linear AE and PCA**\n\nWith linear activations and MSE, intuition links to **principal directions** (depends on data and constraints).\n\n- **Nonlinear** activations allow richer representations","xavierUniform":"**4. Practical tips**\n\nMatch data scale; adjust bottleneck and depth; use **DAE** for robust features when needed."},"formulaGuideDiagramCaption":"**In one line:** $x$ is compressed to a narrow $z$, expanded to $\\hat{x}$, and compared to $x$.","formulaGuideDiagramAria":"Autoencoder diagram: input encoder bottleneck latent decoder reconstruction loss","formulaGuideDiagramFrozenHint":"Bottleneck","advDlCh17FormulaGuideLossHint":"Compare x and x̂ · reconstruction loss","advDlCh17VisualInputLabel":"Input","visual":"Animation: stages light up in order — input → encoder → bottleneck z → decoder → reconstruction x̂ → reconstruction loss.","problemSolvingLabel":"Problem-solving notes","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"All **60** bank items are **autoencoder**-themed (compress/reconstruct, bottleneck, loss, anomaly detection, CAE, image/patch geometry, linear-layer params, etc.). Each session draws **10** problems: **4 easy → 3 medium → 3 hard**, and **session types (prefix + difficulty) do not repeat** within one run.","practiceProblemsInstruction":"Choose the best option.","practiceProblemsInstructionCalc":"Choose the best option.","practiceProblemsInstructionConcept":"Choose the best option.","practiceProblemsInstructionOx":"Choose the best option.","practiceProblemsInstructionScenario":"Choose the best option.","practiceProblemsInstructionVote":"Choose the best option.","practiceProblemsInstructionAggregate":"Choose the best option.","practiceProblemsInstructionConfig":"Choose the best option.","practiceProblemsInstructionEnsemble":"Choose the best option.","advDlCh17VisualIntro":"The **encoder** compresses input **$x$** to latent **bottleneck $z$**; the **decoder** expands **$z$** to **$\\hat{x}$**. Lower **reconstruction loss** means outputs closer to the input.","advDlCh17VisualConceptTitle":"Concept: encoder → bottleneck → decoder","advDlCh17VisualSectionTitle":"Autoencoder: compress & reconstruct","advDlCh17VisualMetaphor":"Like summarizing text on a sticky note, then rewriting it in full.","advDlCh17VisualTopInputLabel":"Input image","advDlCh17VisualTopLatentLabel":"Latent representation","advDlCh17VisualTopReconLabel":"Reconstructed image","advDlCh17VisualEncoderLabel":"Encoder","advDlCh17VisualBottleneckLabel":"Bottleneck z","advDlCh17VisualBottleneckHint":"Where dimensionality shrinks the most","advDlCh17VisualDecoderLabel":"Decoder","advDlCh17VisualReconLabel":"Reconstruction x̂","advDlCh17VisualLossLabel":"Loss","advDlCh17VisualFlowTitle":"Training flow","advDlCh17VisualStep0":"**① Input:** feed $x$.","advDlCh17VisualStep1":"**② Encoder:** map $x$ to $z$.","advDlCh17VisualStep2":"**③ Bottleneck:** small $z$ summarizes information.","advDlCh17VisualStep3":"**④ Decoder:** map $z$ to $\\hat{x}$.","advDlCh17VisualStep4":"**⑤ Loss:** minimize mismatch between $x$ and $\\hat{x}$.","advDlCh17VisualStage0":"Input x","advDlCh17VisualStage1":"Encoder","advDlCh17VisualStage2":"Bottleneck z","advDlCh17VisualStage3":"Decoder","advDlCh17VisualStage4":"Loss","problems":{"concept_0":"Which is closest to an **autoencoder** training objective?\n① Maximize classification accuracy only\n② **Minimize reconstruction loss so inputs are reconstructed well**\n③ Maximize RL reward only\n④ Delete the dataset","concept_1":"Which best describes latent vector $z$?\n① Always same dimension as input\n② **A compressed summary representation**\n③ Only store class probabilities\n④ Store learning rate","concept_2":"Common reconstruction loss for grayscale image vectors?\n① **MSE**\n② BCE only (always)\n③ Accuracy\n④ F1","concept_3":"If bottleneck $k$ **decreases**, what is generally expected?\n① Reconstruction always becomes easier\n② More information is always kept\n③ **Stronger compression (tighter representation capacity)**\n④ Loss becomes meaningless","concept_4":"Which matches **Denoising AE**?\n① Set all labels to 0\n② **Train to reconstruct clean targets from corrupted inputs**\n③ Always learn identity\n④ Remove attention","concept_5":"Which application fits **large reconstruction error** after training on normal data only?\n① Always classification\n② **Anomaly detection**\n③ Only augmentation\n④ Quantization","ox_0":"Autoencoders are often composed of **encoder and decoder**.\n1 if true, 0 if false.","ox_1":"Bottleneck $z$ must always have higher dimension than input $x$.\n1 if true, 0 if false.","ox_2":"Reducing reconstruction loss is a typical training goal.\n1 if true, 0 if false.","ox_3":"A **linear** autoencoder with **linear activations** and MSE is **always** identical to a GAN.\n1 if true, 0 if false.","ox_4":"Convolutional layers can leverage spatial structure for reconstruction.\n1 if true, 0 if false.","ox_5":"Autoencoders can be trained **without class labels** using reconstruction only.\n1 if true, 0 if false.","scenario_0":"**While training an autoencoder**, GPU runs out of memory. What should you try **first**?\n① Reduce batch size, input size, or model width\n② Increase learning rate without bound\n③ Delete all data\n④ Remove the loss","scenario_1":"For anomaly detection?\n① **Train reconstruction on normal data and flag large errors**\n② Shuffle labels randomly\n③ Always full fine-tuning only\n④ Change optimizer only","scenario_2":"Want robust features under noisy images?\n① Fill data with zeros only\n② **Denoising AE: reconstruct clean targets from noisy inputs**\n③ Use zero layers\n④ Stop training","scenario_3":"If the bottleneck is too wide (near identity), what should you do?\n① **Shrink bottleneck or add regularization**\n② Quantization only\n③ Use half the data\n④ Freeze LR to 0","vote_0":"Flattened dimension $d$ of a $28\\times28$ grayscale image?","vote_1":"Flattened $d$ for $16\\times16$ grayscale?","vote_2":"Flattened $d$ for $32\\times32$ grayscale?","vote_3":"Patch count for $224\\times224$ with $16\\times16$ patches (no CLS)?","scenario_4":"Validation MSE is much higher than training MSE. Suspect?\n① **Overfitting**\n② Training too slow\n③ Batch size always 1\n④ Optimizer name","scenario_5":"Pixels are in [0,255]. What should you consider?\n① Always leave as-is\n② **Normalize (e.g., to [0,1])**\n③ Increase labels\n④ Remove channels","scenario_6":"For a **probabilistic** latent space and generation, what is natural next?\n① **VAE**\n② Identity only\n③ k-means only\n④ PCA only","scenario_7":"Strategy close to using $z$ as classifier input?\n① **Representation learning then linear classifier with few labels**\n② Always random guess\n③ Drop data\n④ Remove loss","scenario_8":"Why use a CNN encoder?\n① **Use local patterns and spatial structure**\n② Always zero parameters\n③ RNN only\n④ Forbid padding","scenario_9":"Main purpose of noise in DAE?\n① **Learn robust features**\n② Always zero accuracy\n③ Delete data\n④ Stop training","vote_4":"Flattened $d$ for $32\\times16$ grayscale?","vote_5":"Flattened $d$ for $32\\times32$ **RGB (3 channels)**?","vote_6":"Flattened $d$ for width 16, height 8 grayscale?","vote_7":"Weight count (no bias) for one FC layer $d_{in}=100$, $d_{out}=20$?","vote_8":"Flatten length of a $6\\times6\\times2$ tensor?","vote_9":"With $d=1000$, $k=500$, express **$k/d$ as an integer percent** (e.g., 50% → **50**).","aggregate_0":"In an **AE experiment**, three bottleneck $k$ candidates were logged as **[3,4,5]**. What is their **sum**?","aggregate_1":"Same setup: candidates **[2,6,7]** — **sum**?","aggregate_2":"Bottleneck candidate **6** chosen three times ($6+6+6$) — **sum**?","aggregate_3":"Candidates **[2,3,6]** — **sum**?","ensemble_0":"**Image input** $224\\times224$ split into $16\\times16$ patches — **without CLS**, how many patch tokens?","ensemble_1":"**Patch grid**: square grid with **8** patches per side — total patches?","ensemble_2":"One **linear encoder** layer with $d_{in}=20$, $d_{out}=20$ — **weight count** (no bias)?","ensemble_3":"$$96\\times96$ input, patches $8\\times8$, **no CLS** — how many patches?","aggregate_4":"Multiple trials logged bottleneck candidates **[7,7,7,7]** — **sum**?","aggregate_5":"Candidates **[11,11,11]** — **sum**?","aggregate_6":"Candidate **3** logged **7** times — **sum**? ($3\\times7$)","aggregate_7":"Candidates **[4,5,10]** — **sum**?","aggregate_8":"Log **[3,4,5,6,6]** — **sum**?","aggregate_9":"Same candidate **5** added **6** times — **sum**? ($5\\times6$)","config_0":"**Image→patch grid**: square grid with **8** patches along each side — total cells?","config_1":"**9** patches per side — total?","config_2":"**10** patches per side — total?","config_3":"**11** patches per side — total?","config_4":"**12** patches per side — total?","config_5":"**6** patches per side — total?","config_6":"**7** patches per side — total?","config_7":"**16** patches per side — total?","config_8":"**20** patches per side — total?","config_9":"**25** patches per side — total?","ensemble_4":"Flattened $d$ for $30\\times30$ grayscale?","ensemble_5":"**196** patch tokens + **1** CLS — sequence length?","ensemble_6":"Linear encoder weights only (no bias): $d=16$, $k=2$ — how many weights?","ensemble_7":"Flatten $32\\times32$ grayscale to one vector — length?"},"problemSolutions":{"concept_0":"**Example:** MNIST reconstruction: minimize MSE.\n\n**Steps:** The goal is to shrink the gap between $x$ and $\\hat{x}$ → **2**.","concept_1":"**Example:** $z$ is a low-dimensional summary.\n\n**Steps:** **2**.","concept_2":"**Example:** MSE for real-valued pixels.\n\n**Steps:** **1**.","concept_3":"**Example:** Smaller $k$ → stronger compression.\n\n**Steps:** **3**.","concept_4":"**Example:** Noisy input → clean target.\n\n**Steps:** **2**.","concept_5":"**Example:** Train on normal data only → large error flags anomalies.\n\n**Steps:** **2**.","ox_0":"**Example:** Usually encoder–decoder.\n\n**Steps:** True **1**.","ox_1":"**Example:** Bottleneck is usually smaller than input.\n\n**Steps:** False **0**.","ox_2":"**Example:** Typical objective.\n\n**Steps:** True **1**.","ox_3":"**Example:** GAN has a different objective and architecture.\n\n**Steps:** False **0**.","ox_4":"**Example:** Conv AE uses spatial structure.\n\n**Steps:** True **1**.","ox_5":"**Example:** Unsupervised reconstruction is possible.\n\n**Steps:** True **1**.","scenario_0":"**Steps:** OOM → shrink batch/model first → **1**.","scenario_1":"**Steps:** Normal training + error threshold → **1**.","scenario_2":"**Steps:** DAE fits noisy→clean robust features → **2**.","scenario_3":"**Steps:** Narrow bottleneck / stronger regularization → **1**.","vote_0":"**Calc:** $28\\times28=784$. **Answer: 784**.","vote_1":"**Calc:** $16\\times16=256$. **Answer: 256**.","vote_2":"**Calc:** $32\\times32=1024$. **Answer: 1024**.","vote_3":"**Calc:** $(224/16)^2=14^2=196$. **Answer: 196**.","scenario_4":"**Steps:** Much higher val MSE → suspect overfitting → **1**.","scenario_5":"**Steps:** Scale normalization → **2**.","scenario_6":"**Steps:** Probabilistic latent → **VAE** → **1**.","scenario_7":"**Steps:** Representation + few labels → **1**.","scenario_8":"**Steps:** CNN uses spatial structure → **1**.","scenario_9":"**Steps:** DAE aims for robust features → **1**.","vote_4":"**Calc:** $32\\times16=512$. **Answer: 512**.","vote_5":"**Calc:** $32\\times32\\times3=3072$. **Answer: 3072**.","vote_6":"**Calc:** $16\\times8=128$. **Answer: 128**.","vote_7":"**Calc:** $100\\times20=2000$. **Answer: 2000**.","vote_8":"**Calc:** $6\\times6\\times2=72$. **Answer: 72**.","vote_9":"**Calc:** $k/d=500/1000=0.5$ → percent **50**.","aggregate_0":"**Example:** $3+4+5=12$. **Answer: 12**.","aggregate_1":"**Example:** $2+6+7=15$. **Answer: 15**.","aggregate_2":"**Example:** $6+6+6=18$. **Answer: 18**.","aggregate_3":"**Example:** $2+3+6=11$. **Answer: 11**.","ensemble_0":"**Calc:** $(224/16)^2=196$. **Answer: 196**.","ensemble_1":"**Calc:** $8\\times8=64$. **Answer: 64**.","ensemble_2":"**Calc:** Weights only $20\\times20=400$. **Answer: 400**.","ensemble_3":"**Calc:** $(96/8)^2=144$. **Answer: 144**.","aggregate_4":"**Example:** $7\\times4=28$. **Answer: 28**.","aggregate_5":"**Example:** $11\\times3=33$. **Answer: 33**.","aggregate_6":"**Example:** $3\\times7=21$. **Answer: 21**.","aggregate_7":"**Example:** $4+5+10=19$. **Answer: 19**.","aggregate_8":"**Example:** $3+4+5+6+6=24$. **Answer: 24**.","aggregate_9":"**Example:** $5\\times6=30$. **Answer: 30**.","config_0":"**Calc:** $8\\times8=64$. **Answer: 64**.","config_1":"**Calc:** $9\\times9=81$. **Answer: 81**.","config_2":"**Calc:** $10\\times10=100$. **Answer: 100**.","config_3":"**Calc:** $11\\times11=121$. **Answer: 121**.","config_4":"**Calc:** $12\\times12=144$. **Answer: 144**.","config_5":"**Calc:** $6\\times6=36$. **Answer: 36**.","config_6":"**Calc:** $7\\times7=49$. **Answer: 49**.","config_7":"**Calc:** $16\\times16=256$. **Answer: 256**.","config_8":"**Calc:** $20\\times20=400$. **Answer: 400**.","config_9":"**Calc:** $25\\times25=625$. **Answer: 625**.","ensemble_4":"**Calc:** $30\\times30=900$. **Answer: 900**.","ensemble_5":"**Calc:** $196+1=197$. **Answer: 197**.","ensemble_6":"**Calc:** Weights only $16\\times2=32$. **Answer: 32**.","ensemble_7":"**Calc:** $32\\times32=1024$. **Answer: 1024**."},"problemAnswers":{"concept_0":2,"concept_1":2,"concept_2":1,"concept_3":3,"concept_4":2,"concept_5":4,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":0,"scenario_0":1,"scenario_1":1,"scenario_2":2,"scenario_3":1,"vote_0":784,"vote_1":256,"vote_2":1024,"vote_3":196,"scenario_4":1,"scenario_5":2,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":512,"vote_5":3072,"vote_6":128,"vote_7":2000,"vote_8":72,"vote_9":50,"aggregate_0":12,"aggregate_1":15,"aggregate_2":18,"aggregate_3":11,"ensemble_0":196,"ensemble_1":64,"ensemble_2":400,"ensemble_3":144,"aggregate_4":28,"aggregate_5":33,"aggregate_6":21,"aggregate_7":19,"aggregate_8":24,"aggregate_9":30,"config_0":64,"config_1":81,"config_2":100,"config_3":121,"config_4":144,"config_5":36,"config_6":49,"config_7":256,"config_8":400,"config_9":625,"ensemble_4":900,"ensemble_5":197,"ensemble_6":32,"ensemble_7":1024},"problemTestCodes":{"concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 3\nassert answer == 3","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 4\nassert answer == 4","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 0\nassert answer == 0","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 2\nassert answer == 2","scenario_3":"answer = 1\nassert answer == 1","vote_0":"answer = 784\nassert answer == 784","vote_1":"answer = 256\nassert answer == 256","vote_2":"answer = 1024\nassert answer == 1024","vote_3":"answer = 196\nassert answer == 196","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 2\nassert answer == 2","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"answer = 512\nassert answer == 512","vote_5":"answer = 3072\nassert answer == 3072","vote_6":"answer = 128\nassert answer == 128","vote_7":"answer = 2000\nassert answer == 2000","vote_8":"answer = 72\nassert answer == 72","vote_9":"answer = 50\nassert answer == 50","aggregate_0":"values = [3, 4, 5]\nassert sum(values) == 12","aggregate_1":"values = [2, 6, 7]\nassert sum(values) == 15","aggregate_2":"values = [6, 6, 6]\nassert sum(values) == 18","aggregate_3":"values = [2, 3, 6]\nassert sum(values) == 11","ensemble_0":"answer = 196\nassert answer == 196","ensemble_1":"answer = 64\nassert answer == 64","ensemble_2":"answer = 400\nassert answer == 400","ensemble_3":"answer = 144\nassert answer == 144","aggregate_4":"values = [7, 7, 7, 7]\nassert sum(values) == 28","aggregate_5":"values = [11, 11, 11]\nassert sum(values) == 33","aggregate_6":"values = [3, 3, 3, 3, 3, 3, 3]\nassert sum(values) == 21","aggregate_7":"values = [4, 5, 10]\nassert sum(values) == 19","aggregate_8":"values = [3, 4, 5, 6, 6]\nassert sum(values) == 24","aggregate_9":"values = [5, 5, 5, 5, 5, 5]\nassert sum(values) == 30","config_0":"assert 8 * 8 == 64","config_1":"assert 9 * 9 == 81","config_2":"assert 10 * 10 == 100","config_3":"assert 11 * 11 == 121","config_4":"assert 12 * 12 == 144","config_5":"assert 6 * 6 == 36","config_6":"assert 7 * 7 == 49","config_7":"assert 16 * 16 == 256","config_8":"assert 20 * 20 == 400","config_9":"assert 25 * 25 == 625","ensemble_4":"answer = 900\nassert answer == 900","ensemble_5":"answer = 197\nassert answer == 197","ensemble_6":"answer = 32\nassert answer == 32","ensemble_7":"answer = 1024\nassert answer == 1024"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh18":{"chapter":"Chapter 21","title":"GAN Basics: Generator vs Discriminator","description":"A GAN (Generative Adversarial Network) is an innovative setup where the **Generator ($G$)** that creates new content and the **Discriminator ($D$)** that judges real vs fake keep competing and improving. Think of a breathless mind game between a genius counterfeiter and a veteran forensic detective: the counterfeiter keeps refining fakes, and the detective keeps raising detection skills. In this tense **minimax** tug of war, the counterfeiter may eventually produce outputs humans cannot tell from real data. This chapter explores the math behind GANs, the minimax game, and **mode collapse** when the generator falls into mannerism—with rich examples.","sectionTitle":"GAN: Generator vs. Discriminator","whatIs":{"0":"**1. Core GAN architecture: generator vs discriminator**\n\nA GAN is a structure where two networks fight endlessly and grow stronger. The **Generator ($G$)** tries to make fake data look real, while the **Discriminator ($D$)** sharply judges real vs fake.\n* **Analogy:** A forger (generator) brings a fake painting, and an appraiser (discriminator) uses a magnifying glass to tell originals from fakes. Each side keeps sharpening its craft.","1":"**2. The minimax objective**\n\nThe core GAN objective is:\n$\\min_G \\max_D V(D, G) = \\mathbb{E}_{x}[\\log D(x)] + \\mathbb{E}_{z}[\\log(1 - D(G(z)))]$\n* **Discriminator ($D$), maximize:** on real $x$, push $D(x)$ toward $1$; on fake $G(z)$, push $D(G(z))$ toward $0$.\n* **Generator ($G$), minimize:** make the discriminator treat $G(z)$ as real ($D(G(z)) \\to 1$) so the second term shrinks.","2":"**3. Latent noise $z$**\n\n**Latent noise $z$** is the random vector fed to the generator as a starting point.\n* **Analogy:** Like a lump of clay handed to a sculptor—small changes in $z$ can change expression, color, or style in the finished image.","3":"**4. Mode collapse**\n\nA notorious failure mode: the generator stops exploring diversity and **keeps copying one sample that already fooled the discriminator**.\n* **Analogy:** A restaurant that earns a perfect score for kimchi stew, then serves only kimchi stew to every guest all year.","4":"**5. Conditional GAN (cGAN)**\n\nAdd a **condition ($y$)**—class label or text—alongside $z$ to steer generation, e.g. \"draw a cat\" or \"colorize this sketch\"."},"whyImportant":{"0":"**1. A true starting point for generative AI**\n\nWhere classifiers answer \"this is a dog,\" GANs **paint dogs that never existed**—a backbone of modern generative AI across images, audio, and voice.","1":"**2. Sharp, vivid detail**\n\nUnlike blurry average-seeking models, GANs must pass a harsh critic, so hair strands and skin texture can look **razor-sharp**.","2":"**3. Data augmentation**\n\nTrain on a few snowy-night driving photos and synthesize thousands more; rare medical or defect images can be multiplied for downstream models."},"howUsed":{"0":"**Step 1: Normalize inputs (tanh)**\n\nScale pixels (often 0–255) to $[-1,1]$. If the generator ends with **$tanh$**, match real images to the same range so the discriminator compares fairly.","1":"**Step 2: BCE loss and label smoothing**\n\nUse **binary cross-entropy (BCE)** for real vs fake. **Label smoothing** (e.g. targets $0.9$ instead of $1.0$) can curb an overconfident discriminator.","2":"**Step 3: Alternate training**\n\nDo not update $G$ and $D$ in lockstep. Often train $D$ for $k$ steps, then $G$ once. If $D$ dominates, $G$ may vanish-gradient; balance learning rates and update ratios.","3":"**Step 4: Stability checks and FID**\n\nWatch for mode collapse visually. **FID** compares real vs fake feature distributions—lower FID usually means closer match to real data."},"problemSolving":{"0":"**Start with one line:** **Generator $G$** turns noise $z$ into **fakes**; **discriminator $D$** judges **real vs fake**. First decide **who makes** and **who judges**, then add **minimax**, **alternating updates**, and **mode collapse** when needed.\n\n**When numbers appear:** flattened length is (height)×(width)(×3 for RGB); patch grids without CLS use $(H/p)\\times(W/p)$; one fully connected layer is roughly $d_{\\mathrm{in}}\\times d_{\\mathrm{out}}$ weights.","1":"**Example (flatten)** — GAN grayscale $28\\times28$ flattened $d$? → **784**\n\n---\n\n**Example (patch grid)** — $224\\times224$, $16\\times16$ patches, no CLS → $14^2=$ **196**","2":"**Example (concept)** — Generator role in a GAN? ② **Turn noise $z$ into fakes** → **2**\n\n---\n\n**Example (calculation)** — RGB $32\\times32$ with 3 channels flattened $d$? → **3072**\n\n---\n\n**Example (application)** — Discriminator too strong? ① **Rebalance G/D updates**","3":"**Definition** — Mode collapse means repeating nearly the same sample. → pick that description\n\n---\n\n**True/false** — A conditional GAN can use labels or conditions. → **1**"},"summary":"**One-line summary:** A GAN is a generator–discriminator game that learns to produce realistic samples from noise $z$.\n\n**Key point:** stability, balance, and diversity are the main things to watch.\n\n**Next:** conditional and more stable variants extend the same idea.","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Reading the formulas (GAN)","linear":"$44"},"formulaGuideDiagramCaption":"**In one line:** noise $z$ goes into the generator, which creates fake samples, and the discriminator competes to tell real from fake.","formulaGuideDiagramAria":"GAN diagram: noise generator fake sample discriminator real fake competition","formulaGuideDiagramFrozenHint":"competition","advDlCh18FormulaGuideLossHint":"real/fake prediction · adversarial loss","advDlCh18VisualInputLabel":"Image (real or fake)","visual":"Animation: random noise → generator → generated image, then image → discriminator → realness score.","problemSolvingLabel":"Problem-solving notes","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"All **60** bank items are **GAN** themed (generator/discriminator, noise vectors, mode collapse, stability, image sizes, patch counts, and weight calculations). Each session draws **10** problems: **4 easy → 3 medium → 3 hard**, and session types do not repeat within one run.","practiceProblemsInstruction":"Choose the best option.","practiceProblemsInstructionCalc":"Choose the best option.","practiceProblemsInstructionConcept":"Choose the best option.","practiceProblemsInstructionOx":"Choose the best option.","practiceProblemsInstructionScenario":"Choose the best option.","practiceProblemsInstructionVote":"Choose the best option.","practiceProblemsInstructionAggregate":"Choose the best option.","practiceProblemsInstructionConfig":"Choose the best option.","practiceProblemsInstructionEnsemble":"Choose the best option.","advDlCh18VisualIntro":"**Real photos** and **fakes from noise** enter the **discriminator** as **real** or **fake**. First tell **who makes (G)** and **who judges (D)**.","advDlCh18VisualConceptTitle":"Concept: generator vs discriminator","advDlCh18VisualSectionTitle":"GAN: make and tell apart","advDlCh18VisualMetaphor":"Like a counterfeiter and an expert keeping each other sharp.","advDlCh18VisualBrandTitle":"GAN :","advDlCh18VisualLatentBridgeHint":"In AE and VAE, this becomes the latent vector","advDlCh18VisualSamplingHint":"Sampling from real examples","advDlCh18VisualRealPoolLabel":"Real data","advDlCh18VisualRealSampleLabel":"Real sample","advDlCh18VisualNoiseDistHint":"Latent variables sampled from noise distribution","advDlCh18VisualFakeGenHint":"Generate fake image","advDlCh18VisualGenBadge":"G","advDlCh18VisualDiscBadge":"D","advDlCh18VisualOutputRealLabel":"Real","advDlCh18VisualOutputFakeLabel":"Fake","advDlCh18VisualTopInputLabel":"Random noise","advDlCh18VisualTopLatentLabel":"Generated image","advDlCh18VisualTopReconLabel":"Prediction","advDlCh18VisualEncoderLabel":"Generator","advDlCh18VisualBottleneckLabel":"Latent noise","advDlCh18VisualBottleneckHint":"Random vector as the starting point","advDlCh18VisualDecoderLabel":"Discriminator","advDlCh18VisualReconLabel":"Real / fake decision","advDlCh18VisualLossLabel":"Adversarial loss","advDlCh18VisualScoreHint":"Predicts how real the image is","advDlCh18VisualFlowTitle":"Training flow","advDlCh18VisualStep0":"**① Real images:** Take one **real example** $x$ from the training data.","advDlCh18VisualStep1":"**② Random noise:** Sample **noise** $z$ that decides what to generate.","advDlCh18VisualStep2":"**③ Generator:** Turn $z$ into a **fake sample** $\\hat{x}$.","advDlCh18VisualStep3":"**④ Discriminator:** Look at $x$ and $\\hat{x}$ and tell **real vs fake**.","advDlCh18VisualStep4":"**⑤ Take turns:** Update **G** and **D** one after another, a little at a time.","advDlCh18VisualStage0":"Real sampling","advDlCh18VisualStage1":"Random noise","advDlCh18VisualStage2":"Generator (G)","advDlCh18VisualStage3":"Discriminator (D)","advDlCh18VisualStage4":"Real / fake","problems":{"concept_0":"In **GANs**, which best describes the generator?\n① It decides real vs fake\n② **It turns noise $z$ into fake samples**\n③ It stores labels only\n④ It removes the loss","concept_1":"Which description best fits latent noise $z$?\n① Always same dimension as the input\n② **A random vector that starts sample generation**\n③ Stores only the target class\n④ Stores the learning rate","concept_2":"What is the usual interpretation of a GAN discriminator output?\n① A regression value only\n② **The probability that the input is real**\n③ The dataset name\n④ The patch count","concept_3":"Why is GAN training often unstable?\n① Because the answer is fixed\n② **Because both networks keep changing together**\n③ Only because the dataset is small\n④ Only because CNNs are used","concept_4":"What is mode collapse?\n① Outputs become too diverse\n② **The generator repeats nearly the same sample**\n③ The learning rate becomes zero\n④ There are only two labels","concept_5":"What is a benefit of a conditional GAN?\n① It ignores conditions completely\n② **It can control the sample type with labels or conditions**\n③ It never uses a loss\n④ It removes the discriminator","ox_0":"GANs usually have a **generator and a discriminator** competing.\n1 if true, 0 if false.","ox_1":"The generator usually takes **noise $z$** as input and creates samples.\n1 if true, 0 if false.","ox_2":"The discriminator decides whether an input is real or fake.\n1 if true, 0 if false.","ox_3":"GAN training is always stable and mode collapse never happens.\n1 if true, 0 if false.","ox_4":"Matching the image scale with the generator output range can help.\n1 if true, 0 if false.","ox_5":"A conditional GAN can use labels or conditions together.\n1 if true, 0 if false.","scenario_0":"Early in GAN training, the discriminator reaches nearly 100% accuracy too fast. What should you try first?\n① **Adjust the balance between generator and discriminator updates**\n② Increase the learning rate without limit\n③ Delete the entire dataset\n④ Remove the discriminator","scenario_1":"The generated samples all look too similar. What is the most likely issue?\n① Only overfitting\n② **Mode collapse**\n③ Only a padding bug\n④ Only too much normalization","scenario_2":"Input images are in [0,255] while the generator uses tanh output. What should you do first?\n① **Normalize the input scale**\n② Increase the number of labels\n③ Remove the discriminator\n④ Set the batch to zero","scenario_3":"You want to generate only the digit 7 with a conditional GAN. What is the right approach?\n① Generate with no condition\n② **Feed label 7 as a condition**\n③ Never freeze the discriminator\n④ Remove the noise","vote_0":"In a GAN, flattened input dimension $d$ for a $28\\times28$ grayscale image?","vote_1":"In a GAN, flattened $d$ for a $16\\times16$ grayscale image?","vote_2":"In a GAN, flattened $d$ for a $32\\times32$ grayscale image?","vote_3":"For a GAN discriminator, patch count for $224\\times224$ with $16\\times16$ patches and no CLS?","scenario_4":"If a GAN uses batch size 64, what is half of it?\n① 16\n② **32**\n③ 48\n④ 64","scenario_5":"What does a noise vector length of 100 usually represent?\n① **Latent noise dimension**\n② Batch size\n③ Patch count\n④ Number of classes","scenario_6":"The discriminator is too strong and generator gradients nearly vanish. What should you consider?\n① **Adjust learning rate, update ratio, and regularization**\n② Remove half the dataset\n③ Remove the noise\n④ Fix outputs to zero","scenario_7":"When checking GAN results, what should you look at besides the loss?\n① **Sample quality and diversity**\n② File names\n③ Numeric sorting\n④ Patch indices only","scenario_8":"In a conditional GAN, you want to control outputs with a text description. What matters most?\n① **Matching the condition with the generation goal**\n② Removing the discriminator\n③ Setting all noise to zero\n④ Fixing batch size to 1","scenario_9":"When you see mode collapse, what problem comes to mind first?\n① **Lack of diversity**\n② Only many patches\n③ Only too many labels\n④ The loss is not zero","vote_4":"In GAN preprocessing, flattened $d$ for a $32\\times16$ grayscale image?","vote_5":"In GAN RGB generation, flattened $d$ for a $32\\times32$ image with 3 channels?","vote_6":"In a GAN, flattened $d$ for a grayscale image with width 16 and height 8?","vote_7":"In a GAN, if the latent noise vector length is 100, what is the dimension of $z$?","vote_8":"In GAN training, if you split a batch of 64 in half, what is one side?","vote_9":"In a GAN, with input $d=1000$ and latent noise $k=500$, express $k/d$ as an integer percent.","vote_10":"In a GAN, flattened $d$ for a $24\\times24$ grayscale image?","vote_11":"In a GAN, flattened $d$ for a $48\\times48$ grayscale image?","vote_12":"In GAN RGB generation, flattened $d$ for a $16\\times16$ image with 3 channels?","vote_13":"In GAN training, if you split a batch of 128 in half, what is one side?","ensemble_0":"For a GAN discriminator, patch token count for $224\\times224$ split into $16\\times16$ patches without CLS?","ensemble_1":"In a GAN feature map viewed as an 8-by-8 patch square grid, total patch count?","ensemble_2":"In one GAN discriminator linear layer with $d_{in}=20$ and $d_{out}=20$, weight count?","ensemble_3":"For a GAN discriminator, patch count for $96\\times96$ split into $8\\times8$ patches?","vote_14":"In a GAN, flattened $d$ for a $64\\times64$ grayscale image?","vote_15":"In GAN RGB generation, flattened $d$ for a $48\\times48$ image with 3 channels?","vote_16":"In a GAN, if the generator noise vector length is 256, what is the dimension of $z$?","vote_17":"In GAN training, if you split a batch of 256 in half, what is one side?","vote_18":"In a GAN, flattened $d$ for an $80\\times80$ grayscale image?","vote_19":"In GAN RGB generation, flattened $d$ for a $24\\times24$ image with 3 channels?","config_0":"In a GAN discriminator feature-map grid with 8 patches per side, total cells?","config_1":"In a GAN discriminator feature-map grid with 9 patches per side, total cells?","config_2":"In a GAN discriminator feature-map grid with 10 patches per side, total cells?","config_3":"In a GAN discriminator feature-map grid with 11 patches per side, total cells?","config_4":"In a GAN discriminator feature-map grid with 12 patches per side, total cells?","config_5":"In a GAN discriminator feature-map grid with 6 patches per side, total cells?","config_6":"In a GAN discriminator feature-map grid with 7 patches per side, total cells?","config_7":"In a GAN discriminator feature-map grid with 16 patches per side, total cells?","config_8":"In a GAN discriminator feature-map grid with 20 patches per side, total cells?","config_9":"In a GAN discriminator feature-map grid with 25 patches per side, total cells?","ensemble_4":"In a GAN, flattened $d$ for a $30\\times30$ grayscale image?","ensemble_5":"In a GAN discriminator sequence, if you add 1 CLS token to 196 patch tokens, sequence length?","ensemble_6":"In one GAN generator linear layer with input $d=16$ and bottleneck $k=2$, weight count?","ensemble_7":"In a GAN, flattened length of a $32\\times32$ grayscale image without patches?"},"problemSolutions":{"concept_0":"Think about the GAN concept: the answer is 2.","concept_1":"Think about the GAN concept: the answer is 2.","concept_2":"Think about the GAN concept: the answer is 1.","concept_3":"Think about the GAN concept: the answer is 3.","concept_4":"Think about the GAN concept: the answer is 2.","concept_5":"Think about the GAN concept: the answer is 4.","ox_0":"Judge the statement as true or false; the answer is 1.","ox_1":"Judge the statement as true or false; the answer is 0.","ox_2":"Judge the statement as true or false; the answer is 1.","ox_3":"Judge the statement as true or false; the answer is 0.","ox_4":"Judge the statement as true or false; the answer is 1.","ox_5":"Judge the statement as true or false; the answer is 0.","scenario_0":"The most natural choice is 1.","scenario_1":"The most natural choice is 1.","scenario_2":"The most natural choice is 2.","scenario_3":"The most natural choice is 1.","vote_0":"A direct calculation gives 784.","vote_1":"A direct calculation gives 256.","vote_2":"A direct calculation gives 1024.","vote_3":"A direct calculation gives 196.","scenario_4":"The most natural choice is 1.","scenario_5":"The most natural choice is 2.","scenario_6":"The most natural choice is 1.","scenario_7":"The most natural choice is 1.","scenario_8":"The most natural choice is 1.","scenario_9":"The most natural choice is 1.","vote_4":"A direct calculation gives 512.","vote_5":"A direct calculation gives 3072.","vote_6":"A direct calculation gives 128.","vote_7":"A direct calculation gives 100.","vote_8":"A direct calculation gives 32.","vote_9":"A direct calculation gives 50.","vote_10":"A direct calculation gives 576.","vote_11":"A direct calculation gives 2304.","vote_12":"A direct calculation gives 768.","vote_13":"A direct calculation gives 64.","ensemble_0":"The structure-based count is 196.","ensemble_1":"The structure-based count is 64.","ensemble_2":"The structure-based count is 400.","ensemble_3":"The structure-based count is 144.","vote_14":"A direct calculation gives 4096.","vote_15":"A direct calculation gives 6912.","vote_16":"A direct calculation gives 256.","vote_17":"A direct calculation gives 128.","vote_18":"A direct calculation gives 6400.","vote_19":"A direct calculation gives 1728.","config_0":"The grid/square count is 64.","config_1":"The grid/square count is 81.","config_2":"The grid/square count is 100.","config_3":"The grid/square count is 121.","config_4":"The grid/square count is 144.","config_5":"The grid/square count is 36.","config_6":"The grid/square count is 49.","config_7":"The grid/square count is 256.","config_8":"The grid/square count is 400.","config_9":"The grid/square count is 625.","ensemble_4":"The structure-based count is 900.","ensemble_5":"The structure-based count is 197.","ensemble_6":"The structure-based count is 32.","ensemble_7":"The structure-based count is 1024."},"problemAnswers":{"concept_0":2,"concept_1":2,"concept_2":1,"concept_3":3,"concept_4":2,"concept_5":4,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":0,"scenario_0":1,"scenario_1":1,"scenario_2":2,"scenario_3":1,"vote_0":784,"vote_1":256,"vote_2":1024,"vote_3":196,"scenario_4":1,"scenario_5":2,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":512,"vote_5":3072,"vote_6":128,"vote_7":100,"vote_8":32,"vote_9":50,"vote_10":576,"vote_11":2304,"vote_12":768,"vote_13":64,"ensemble_0":196,"ensemble_1":64,"ensemble_2":400,"ensemble_3":144,"vote_14":4096,"vote_15":6912,"vote_16":256,"vote_17":128,"vote_18":6400,"vote_19":1728,"config_0":64,"config_1":81,"config_2":100,"config_3":121,"config_4":144,"config_5":36,"config_6":49,"config_7":256,"config_8":400,"config_9":625,"ensemble_4":900,"ensemble_5":197,"ensemble_6":32,"ensemble_7":1024},"problemTestCodes":{"concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 3\nassert answer == 3","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 4\nassert answer == 4","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 0\nassert answer == 0","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 2\nassert answer == 2","scenario_3":"answer = 1\nassert answer == 1","vote_0":"answer = 784\nassert answer == 784","vote_1":"answer = 256\nassert answer == 256","vote_2":"answer = 1024\nassert answer == 1024","vote_3":"answer = 196\nassert answer == 196","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 2\nassert answer == 2","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"answer = 512\nassert answer == 512","vote_5":"answer = 3072\nassert answer == 3072","vote_6":"answer = 128\nassert answer == 128","vote_7":"answer = 100\nassert answer == 100","vote_8":"answer = 32\nassert answer == 32","vote_9":"answer = 50\nassert answer == 50","vote_10":"answer = 576\nassert answer == 576","vote_11":"answer = 2304\nassert answer == 2304","vote_12":"answer = 768\nassert answer == 768","vote_13":"answer = 64\nassert answer == 64","ensemble_0":"answer = 196\nassert answer == 196","ensemble_1":"answer = 64\nassert answer == 64","ensemble_2":"answer = 400\nassert answer == 400","ensemble_3":"answer = 144\nassert answer == 144","vote_14":"answer = 4096\nassert answer == 4096","vote_15":"answer = 6912\nassert answer == 6912","vote_16":"answer = 256\nassert answer == 256","vote_17":"answer = 128\nassert answer == 128","vote_18":"answer = 6400\nassert answer == 6400","vote_19":"answer = 1728\nassert answer == 1728","config_0":"assert 8 * 8 == 64","config_1":"assert 9 * 9 == 81","config_2":"assert 10 * 10 == 100","config_3":"assert 11 * 11 == 121","config_4":"assert 12 * 12 == 144","config_5":"assert 6 * 6 == 36","config_6":"assert 7 * 7 == 49","config_7":"assert 16 * 16 == 256","config_8":"assert 20 * 20 == 400","config_9":"assert 25 * 25 == 625","ensemble_4":"answer = 900\nassert answer == 900","ensemble_5":"answer = 197\nassert answer == 197","ensemble_6":"answer = 32\nassert answer == 32","ensemble_7":"answer = 1024\nassert answer == 1024"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","vote_10":"medium","vote_11":"medium","vote_12":"medium","vote_13":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","vote_14":"hard","vote_15":"hard","vote_16":"hard","vote_17":"hard","vote_18":"hard","vote_19":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","vote_10","vote_11","vote_12","vote_13","ensemble_0","ensemble_1","ensemble_2","ensemble_3","vote_14","vote_15","vote_16","vote_17","vote_18","vote_19","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"paperReviewInfluenceKernelVonMises":{"chapter":"Chapter PR-01","title":"Kernel von Mises Formula of the Influence Function","description":"This paper replaces the old bottleneck—deriving influence functions (IF) by hand for every model—with a data-driven procedure built on kernels and spectral expansions. In particular it eases numerical ill-conditioning that often arises with point-mass perturbations, and through a regularized estimator it aims for both **practical computability** and **theoretical consistency**.","sectionTitle":"Learn / Paper Review / Theory & Math / CPAL2026","viewOriginalPdf":"View original paper (PDF)","coreFlow":{"0":"**[Abstract & Intro] Three-sentence summary + problem**\n\n① Classical influence-function computation forces a fresh derivation whenever the model changes, so automation is difficult.\n② The traditional approach—poking the distribution with a point mass—makes the response sharp and prone to numerical instability.\n③ This paper splits the data into several smooth patterns, computes influence for each, and recombines them so a computer—not hand derivation—can estimate the IF more stably.\n\n**Everyday analogy:** Imagine a complex hot-pot recipe and you want to know how one piece of firm tofu changes the broth. The old style jabs the pot like a needle, so readings swing wildly. This paper nudges gently in several directions like soft ripples and aggregates the responses—closer to a stable taste meter.","1":"$45","2":"**[Proposed method: core idea]**\n\nThe paper avoids point-mass perturbation directly; along eigenfunction-direction path perturbations $P_t^j$ it computes pathwise derivatives of $\\theta$ to reconstruct the IF. The centerpiece is **Theorem 3.3 (Spectral von Mises formula)**, expressing the IF as a sum of per-mode contributions. A regularization strength $\\lambda$ suppresses blow-up of small-eigenvalue modes and improves computational stability.","3":"$46","4":"$47","5":"$48"},"mainMethodFiveSteps":{"0":"**1) Core proposal (concept)**\n\nRather than jabbing with a point mass, perturb the distribution smoothly along kernel eigenfunction axes and compose pathwise derivatives to obtain the IF.","1":"**2) Everyday analogy (intuition)**\n\nPlucking one guitar string hard adds harsh noise; blending several strings yields a steadier chord. Likewise, one sharp stimulus is noisier than synthesizing multiple modes for the IF.","2":"**3) Formula deep dive (math)**\n\nThe weight $\\frac{1}{1+2\\lambda/\\sigma_j}$ damps blow-up of small-$\\sigma_j$ modes, lowering variance; a very large $\\lambda$ can inflate bias.","3":"**4) Math to code**\n\nThe code below evaluates $\\psi_{P,\\lambda}(x)$ using $\\sigma_j$, pathwise-derivative approximations, and $e_j(x)$. It is a compact replay of the paper pipeline (mode decomposition → per-mode sensitivity → shrunk weighted sum) with symbols mapped 1:1 to variable names.","4":"**5) Applied AI uses**\n\n- Detect training samples with outsized impact on predictions\n- Prioritize review of outliers and mislabeled data\n- Compare sensitivity before and after model updates"},"mathToCodeTitle":"Paper algorithm in code (NumPy)","mathToCodeCode":"import numpy as np\n\n# Eigenvalues sigma_j from the paper\nsigma = np.array([8, 4, 2, 1], dtype=float)\n\n# Pathwise derivative approximations [d/dt theta(P_t^j)]_{t=0}\ndtheta = np.array([6, 4, 2, 2], dtype=float)\n\n# e_j(x): eigenfunction values at a fixed x\ne_x = np.array([3, 2, 1, 1], dtype=float)\n\n# Regularization hyperparameter lambda\nlambda_reg = 2.0\n\n# Shrinkage 1 / (1 + 2*lambda/sigma_j)\nshrink = 1.0 / (1.0 + 2.0 * lambda_reg / sigma)\n\n# Per-mode contribution = shrink_j * dtheta_j * e_j(x)\nterm = shrink * dtheta * e_x\n\n# Low-rank IF with r=4\npsi_hat = int(np.round(np.sum(term)))\n\nprint('shrink =', shrink.astype(int))\nprint('term =', term.astype(int))\nprint('psi_hat =', psi_hat)","mathToCodeOutput":"shrink = [0 0 0 0]\nterm = [10 4 1 0]\npsi_hat = 16","visualPlanTitle":"Diagram: a stark contrast — limitations vs. proposal","visualPlan":"The **left block** highlights the **classical failure mode**: point-mass spikes make sensitivity swing wildly. The **right pipeline** shows the **paper’s fix**: spectral modes plus **regularized weighting** rebuilds a **smooth, suppressible** influence curve—so the gap is hard to miss.","visualLimitBannerTitle":"Classical limitation","visualLimitBannerDetail":"Point-mass · spikes → volatile, ill-conditioned sensitivity","visualProposalBannerTitle":"Paper’s proposal","visualProposalBannerDetail":"Spectral split → regularized reconstruction → stable IF","visualStep1Heading":"1) Point-mass perturbation","visualStep1Body":"Large sensitivity swings from spikes","visualStep2Heading":"2) Spectral decomposition","visualStep2Body1":"Per-mode $(\\sigma_j, e_j)$","visualStep2Body2":"Small $\\sigma_j$ modes are down-weighted","visualStep3Heading":"3) Regularized reconstruction","visualStep3Body1":"Weighted sum restores a smooth IF","visualStep3Body2":"$$\\frac{1}{1+2\\lambda/\\sigma_j}$ suppresses noisy modes","visualVsLabel":"VS","visualVsAria":"Separator between the classical limitation block and the proposed pipeline","summary":"The paper redefines IF estimation away from model-by-model hand derivation toward a kernel–spectral, data-driven procedure—an important shift. In practice you can **more stably** track which samples move predictions, tying directly to data quality checks, outlier analysis, and model debugging. Remaining work includes balancing bias and variance via regularization strength, sharper **convergence-rate** theory, and **fully automating** pathwise derivatives; theory and systems still have room to grow.","problemSolvingLabel":"Problem-Solving Guide","problemSolving":{"0":"| Type | How to solve (paper symbols → answer) |\n| :--- | :--- |\n| Symbols | $\\lambda$ regularization strength, $\\sigma_j$ eigenvalue, $e_j(x)$ eigenfunction value |\n| Count | Five eigenfunctions ⇒ five terms in the sum |\n| Shrinkage | $\\sigma_j=4$, $\\lambda=2$ gives denominator $1+2\\lambda/\\sigma_j=2$ |\n| Toy sum | Contributions [8,4,2,2] sum to 16 |\n| Trend | Larger $\\lambda$ usually shrinks small-$\\sigma_j$ contributions |\n| Code map | $\\lambda \\leftrightarrow$ lambda_reg, $\\sigma_j \\leftrightarrow$ sigma |","1":"**Example A**\n\nPrompt: With $\\sigma_j=4$ and $\\lambda=2$, compute the shrinkage denominator $1+2\\lambda/\\sigma_j$.\n\nWork: $1+2\\times2/4=2$\n\nAnswer: 2","2":"**Example B**\n\nPrompt: Per-mode contributions are [6, 4, 2, 4]. Total IF approximation?\n\nWork: $6+4+2+4=16$\n\nAnswer: 16"},"practiceProblemsTitle":"Practice Problems","practiceProblemsIntro":"Below are 10 problems drawn at random from a pool of 60. Difficulty order is easy 4 · medium 3 · hard 3; enter integers only.","practiceProblemsInstruction":"There is a blank line between the instruction and the actual question. Answers must be integers.","problems":{"q00":"Instruction: choose the main contribution.\n\nQuestion: What is the key contribution? ① stronger point-mass perturbation ② kernel spectral IF estimator ③ CNN classifier","q01":"Instruction: choose symbol meaning.\n\nQuestion: In the formula, $\\lambda$ means? ① regularization strength ② sample count ③ class count","q02":"Instruction: choose symbol meaning.\n\nQuestion: In the formula, $\\sigma_j$ is? ① eigenvalue ② batch size ③ layer count","q03":"Instruction: choose symbol meaning.\n\nQuestion: $e_j(x)$ is? ① eigenfunction value at x ② loss function ③ optimizer","q04":"Instruction: true/false.\n\nQuestion: Point-mass perturbation can be numerically unstable. true=1 false=0","q05":"Instruction: true/false.\n\nQuestion: Larger $\\lambda$ usually decreases small-$\\sigma_j$ mode contributions. true=1 false=0","q06":"Instruction: true/false.\n\nQuestion: The method reconstructs IF by summing mode-wise contributions. true=1 false=0","q07":"Instruction: count terms.\n\nQuestion: If $r=6$, how many terms are in the sum?","q08":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=2$, $\\sigma_j=4$, compute $1+2\\lambda/\\sigma_j$.","q09":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=3$, $\\sigma_j=3$, compute $1+2\\lambda/\\sigma_j$.","q10":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=1$, $\\sigma_j=2$, compute $1+2\\lambda/\\sigma_j$.","q11":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=4$, $\\sigma_j=8$, compute $1+2\\lambda/\\sigma_j$.","q12":"Instruction: sum contributions.\n\nQuestion: Sum [5,4,3].","q13":"Instruction: sum contributions.\n\nQuestion: Sum [6,2,2,2].","q14":"Instruction: sum contributions.\n\nQuestion: Sum [9,1,3,3].","q15":"Instruction: compute decrease.\n\nQuestion: before=20, after=16. decrease?","q16":"Instruction: compute session size.\n\nQuestion: easy/medium/hard ratio is 4/3/3. total per session?","q17":"Instruction: count all.\n\nQuestion: easy=20, medium=20, hard=20. total?","q18":"Instruction: count terms.\n\nQuestion: 4 eigenfunctions with 1 contribution each. total?","q19":"Instruction: compute decrease.\n\nQuestion: value dropped from 5 to 2. decrease?","q20":"Instruction: toy sum.\n\nQuestion: Sum [8,4,2,2].","q21":"Instruction: toy sum.\n\nQuestion: Sum [10,3,1,2].","q22":"Instruction: toy sum.\n\nQuestion: Sum [7,5,4].","q23":"Instruction: toy sum.\n\nQuestion: Sum [12,6,2].","q24":"Instruction: toy sum.\n\nQuestion: Sum [4,4,4,4].","q25":"Instruction: toy sum.\n\nQuestion: Sum [3,3,5,5].","q26":"Instruction: toy sum.\n\nQuestion: Sum [15,1].","q27":"Instruction: toy sum.\n\nQuestion: Sum [11,2,3].","q28":"Instruction: toy sum.\n\nQuestion: Sum [6,6,2,2].","q29":"Instruction: toy sum.\n\nQuestion: Sum [14,2].","q30":"Instruction: count terms.\n\nQuestion: if $r=10$, number of terms?","q31":"Instruction: count terms.\n\nQuestion: if $r=12$, number of terms?","q32":"Instruction: count terms.\n\nQuestion: if $r=15$, number of terms?","q33":"Instruction: count terms.\n\nQuestion: if $r=18$, number of terms?","q34":"Instruction: denominator.\n\nQuestion: $\\lambda=6,\\sigma_j=6$, compute $1+2\\lambda/\\sigma_j$.","q35":"Instruction: denominator.\n\nQuestion: $\\lambda=8,\\sigma_j=4$, compute $1+2\\lambda/\\sigma_j$.","q36":"Instruction: denominator.\n\nQuestion: $\\lambda=5,\\sigma_j=10$, compute $1+2\\lambda/\\sigma_j$.","q37":"Instruction: denominator.\n\nQuestion: $\\lambda=9,\\sigma_j=9$, compute $1+2\\lambda/\\sigma_j$.","q38":"Instruction: compare estimates.\n\nQuestion: before=28, after=20. decrease?","q39":"Instruction: compare estimates.\n\nQuestion: before=35, after=27. decrease?","q40":"Instruction: hard sum.\n\nQuestion: Sum [20,10,6,4].","q41":"Instruction: hard sum.\n\nQuestion: Sum [18,12,8,2].","q42":"Instruction: hard sum.\n\nQuestion: Sum [16,9,7,4].","q43":"Instruction: hard sum.\n\nQuestion: Sum [22,8,5,1].","q44":"Instruction: hard sum.\n\nQuestion: Sum [14,14,6,2].","q45":"Instruction: hard sum.\n\nQuestion: Sum [25,5,4,2].","q46":"Instruction: hard sum.\n\nQuestion: Sum [30,4,1,1].","q47":"Instruction: hard sum.\n\nQuestion: Sum [19,9,5,3].","q48":"Instruction: hard sum.\n\nQuestion: Sum [17,11,6,2].","q49":"Instruction: hard sum.\n\nQuestion: Sum [24,7,3,2].","q50":"Instruction: hard denominator.\n\nQuestion: $\\lambda=10,\\sigma_j=5$, compute $1+2\\lambda/\\sigma_j$.","q51":"Instruction: hard denominator.\n\nQuestion: $\\lambda=12,\\sigma_j=6$, compute $1+2\\lambda/\\sigma_j$.","q52":"Instruction: hard denominator.\n\nQuestion: $\\lambda=14,\\sigma_j=7$, compute $1+2\\lambda/\\sigma_j$.","q53":"Instruction: hard denominator.\n\nQuestion: $\\lambda=16,\\sigma_j=8$, compute $1+2\\lambda/\\sigma_j$.","q54":"Instruction: hard denominator.\n\nQuestion: $\\lambda=18,\\sigma_j=9$, compute $1+2\\lambda/\\sigma_j$.","q55":"Instruction: hard denominator.\n\nQuestion: $\\lambda=20,\\sigma_j=10$, compute $1+2\\lambda/\\sigma_j$.","q56":"Instruction: set size.\n\nQuestion: 60 total, 10 used in one session. remaining?","q57":"Instruction: set size.\n\nQuestion: easy 20, use 4 in one session. remaining easy?","q58":"Instruction: set size.\n\nQuestion: medium 20, use 3 in one session. remaining medium?","q59":"Instruction: set size.\n\nQuestion: hard 20, use 3 in one session. remaining hard?"},"problemAnswers":{"q00":2,"q01":1,"q02":1,"q03":1,"q04":1,"q05":1,"q06":1,"q07":6,"q08":2,"q09":3,"q10":2,"q11":2,"q12":12,"q13":12,"q14":16,"q15":4,"q16":10,"q17":60,"q18":4,"q19":3,"q20":16,"q21":16,"q22":16,"q23":20,"q24":16,"q25":16,"q26":16,"q27":16,"q28":16,"q29":16,"q30":10,"q31":12,"q32":15,"q33":18,"q34":3,"q35":5,"q36":2,"q37":3,"q38":8,"q39":8,"q40":40,"q41":40,"q42":36,"q43":36,"q44":36,"q45":36,"q46":36,"q47":36,"q48":36,"q49":36,"q50":5,"q51":5,"q52":5,"q53":5,"q54":5,"q55":5,"q56":50,"q57":16,"q58":17,"q59":17},"problemSolutions":{"q00":"Correct choice is 2.","q01":"Correct choice is 1.","q02":"Correct choice is 1.","q03":"Correct choice is 1.","q04":"True -> 1.","q05":"True -> 1.","q06":"True -> 1.","q07":"Term count is 6.","q08":"$$1+2\\times2/4=2$.","q09":"$$1+2\\times3/3=3$.","q10":"$$1+2\\times1/2=2$.","q11":"$$1+2\\times4/8=2$.","q12":"$$5+4+3=12$.","q13":"$$6+2+2+2=12$.","q14":"$$9+1+3+3=16$.","q15":"$$20-16=4$.","q16":"$$4+3+3=10$.","q17":"$$20+20+20=60$.","q18":"$$4\\times1=4$.","q19":"$$5-2=3$.","q20":"$$8+4+2+2=16$.","q21":"$$10+3+1+2=16$.","q22":"$$7+5+4=16$.","q23":"$$12+6+2=20$.","q24":"$$4+4+4+4=16$.","q25":"$$3+3+5+5=16$.","q26":"$$15+1=16$.","q27":"$$11+2+3=16$.","q28":"$$6+6+2+2=16$.","q29":"$$14+2=16$.","q30":"Answer 10.","q31":"Answer 12.","q32":"Answer 15.","q33":"Answer 18.","q34":"$$1+2\\times6/6=3$.","q35":"$$1+2\\times8/4=5$.","q36":"$$1+2\\times5/10=2$.","q37":"$$1+2\\times9/9=3$.","q38":"$$28-20=8$.","q39":"$$35-27=8$.","q40":"Sum is 40.","q41":"Sum is 40.","q42":"Sum is 36.","q43":"Sum is 36.","q44":"Sum is 36.","q45":"Sum is 36.","q46":"Sum is 36.","q47":"Sum is 36.","q48":"Sum is 36.","q49":"Sum is 36.","q50":"Value is 5.","q51":"Value is 5.","q52":"Value is 5.","q53":"Value is 5.","q54":"Value is 5.","q55":"Value is 5.","q56":"$$60-10=50$.","q57":"$$20-4=16$.","q58":"$$20-3=17$.","q59":"$$20-3=17$."},"problemTestCodes":{"q00":"answer = 2\nassert answer == 2","q01":"answer = 1\nassert answer == 1","q02":"answer = 1\nassert answer == 1","q03":"answer = 1\nassert answer == 1","q04":"answer = 1\nassert answer == 1","q05":"answer = 1\nassert answer == 1","q06":"answer = 1\nassert answer == 1","q07":"assert 6 == 6","q08":"assert 1 + 2 * 2 // 4 == 2","q09":"assert 1 + 2 * 3 // 3 == 3","q10":"assert 1 + 2 * 1 // 2 == 2","q11":"assert 1 + 2 * 4 // 8 == 2","q12":"values = [5,4,3]\nassert sum(values) == 12","q13":"values = [6,2,2,2]\nassert sum(values) == 12","q14":"values = [9,1,3,3]\nassert sum(values) == 16","q15":"before = 20\nafter = 16\nassert before - after == 4","q16":"assert 4 + 3 + 3 == 10","q17":"assert 20 + 20 + 20 == 60","q18":"assert 4 * 1 == 4","q19":"assert 5 - 2 == 3","q20":"values = [8,4,2,2]\nassert sum(values) == 16","q21":"values = [10,3,1,2]\nassert sum(values) == 16","q22":"values = [7,5,4]\nassert sum(values) == 16","q23":"values = [12,6,2]\nassert sum(values) == 20","q24":"values = [4,4,4,4]\nassert sum(values) == 16","q25":"values = [3,3,5,5]\nassert sum(values) == 16","q26":"values = [15,1]\nassert sum(values) == 16","q27":"values = [11,2,3]\nassert sum(values) == 16","q28":"values = [6,6,2,2]\nassert sum(values) == 16","q29":"values = [14,2]\nassert sum(values) == 16","q30":"answer = 10\nassert answer == 10","q31":"answer = 12\nassert answer == 12","q32":"answer = 15\nassert answer == 15","q33":"answer = 18\nassert answer == 18","q34":"assert 1 + 2 * 6 // 6 == 3","q35":"assert 1 + 2 * 8 // 4 == 5","q36":"assert 1 + 2 * 5 // 10 == 2","q37":"assert 1 + 2 * 9 // 9 == 3","q38":"assert 28 - 20 == 8","q39":"assert 35 - 27 == 8","q40":"values = [20,10,6,4]\nassert sum(values) == 40","q41":"values = [18,12,8,2]\nassert sum(values) == 40","q42":"values = [16,9,7,4]\nassert sum(values) == 36","q43":"values = [22,8,5,1]\nassert sum(values) == 36","q44":"values = [14,14,6,2]\nassert sum(values) == 36","q45":"values = [25,5,4,2]\nassert sum(values) == 36","q46":"values = [30,4,1,1]\nassert sum(values) == 36","q47":"values = [19,9,5,3]\nassert sum(values) == 36","q48":"values = [17,11,6,2]\nassert sum(values) == 36","q49":"values = [24,7,3,2]\nassert sum(values) == 36","q50":"assert 1 + 2 * 10 // 5 == 5","q51":"assert 1 + 2 * 12 // 6 == 5","q52":"assert 1 + 2 * 14 // 7 == 5","q53":"assert 1 + 2 * 16 // 8 == 5","q54":"assert 1 + 2 * 18 // 9 == 5","q55":"assert 1 + 2 * 20 // 10 == 5","q56":"assert 60 - 10 == 50","q57":"assert 20 - 4 == 16","q58":"assert 20 - 3 == 17","q59":"assert 20 - 3 == 17"},"problemDifficulty":{"q00":"easy","q01":"easy","q02":"easy","q03":"easy","q04":"easy","q05":"easy","q06":"easy","q07":"easy","q08":"easy","q09":"easy","q10":"easy","q11":"easy","q12":"easy","q13":"easy","q14":"easy","q15":"easy","q16":"easy","q17":"easy","q18":"easy","q19":"easy","q20":"medium","q21":"medium","q22":"medium","q23":"medium","q24":"medium","q25":"medium","q26":"medium","q27":"medium","q28":"medium","q29":"medium","q30":"medium","q31":"medium","q32":"medium","q33":"medium","q34":"medium","q35":"medium","q36":"medium","q37":"medium","q38":"medium","q39":"medium","q40":"hard","q41":"hard","q42":"hard","q43":"hard","q44":"hard","q45":"hard","q46":"hard","q47":"hard","q48":"hard","q49":"hard","q50":"hard","q51":"hard","q52":"hard","q53":"hard","q54":"hard","q55":"hard","q56":"hard","q57":"hard","q58":"hard","q59":"hard"},"problemOrder":["q00","q01","q02","q03","q04","q05","q06","q07","q08","q09","q10","q11","q12","q13","q14","q15","q16","q17","q18","q19","q20","q21","q22","q23","q24","q25","q26","q27","q28","q29","q30","q31","q32","q33","q34","q35","q36","q37","q38","q39","q40","q41","q42","q43","q44","q45","q46","q47","q48","q49","q50","q51","q52","q53","q54","q55","q56","q57","q58","q59"]},"paperReviewCurseDepthLlm":{"chapter":"Chapter PR-02","title":"The Curse of Depth in Large Language Models","description":"This review explains why simply adding more layers does not always buy more representation power in large language models. The paper analyzes variance accumulation in Pre-LN transformers and shows that a single depth-aware rule, LayerNorm Scaling (LNS), can keep deep layers useful instead of letting them collapse into identity-like behavior.","viewOriginalPdf":"View original paper (PDF)","coreFlow":{"0":"### [Abstract & Introduction] 3-line summary + problem setup\n\n- **Core problem:** Many deep layers in large LLMs contribute less than expected and can drift toward identity-like behavior.\n- **Classical limitation:** Pre-LN improves optimization stability, but variance can still accumulate with depth.\n- **Key fix:** LNS multiplies the normalized signal by $\\frac{1}{\\sqrt{l}}$, suppressing deep-layer variance growth and restoring useful layer participation.\n\n**Analogy:** Imagine a stadium audio chain with 100 amplifiers in series. Without careful control, tiny noise added at each stage eventually drowns out the original voice. LNS acts like a smart limiter that lowers the volume more aggressively in later amplifiers so the original signal survives all the way to the end.","1":"$49","2":"$4a","3":"### [Toy walkthrough] How the formula behaves in motion\n\nConsider a 6-layer transformer where residual additions gradually increase activation amplitude.\n\n1. At $l=1$, the scale is $1.0$, so almost the full signal is passed through.\n2. At $l=2$, the scale becomes about $0.707$, slightly damping the rising amplitude.\n3. At $l=3$, the scale is about $0.577$, further suppressing accumulated noise.\n4. At $l=4$, the scale reaches $0.5$, making later-layer growth much more controlled.\n5. At $l=5$ and $l=6$, the scale becomes even smaller, preventing deep-layer blow-up while preserving meaningful transformations.\n\nThe intuition is simple: early layers keep enough freedom to build features, while later layers are prevented from turning into unstable amplifiers.","4":"### [Experiments and results]\n\nThe paper reports that LNS improves convergence behavior from smaller models up to multi-billion-parameter scale.\n\n- It is **hyperparameter-free** in the sense that the rule is fixed by depth.\n- It lowers final loss in large-scale experiments compared with vanilla Pre-LN.\n- It preserves more angular diversity across deep-layer representations, suggesting that late layers remain meaningfully distinct instead of collapsing toward similar states.\n\nFrom an engineering perspective, this is attractive because the implementation cost is tiny while the potential payoff on depth efficiency is large.","5":"### [Conclusion and limitations]\n\n- **Practical value 1:** Better depth utilization creates a stronger starting point for pruning, quantization, and efficiency work.\n- **Practical value 2:** More useful deep features can help downstream fine-tuning and task adaptation.\n- **Practical value 3:** The method is easy to insert into existing Pre-LN pipelines without architectural surgery.\n\n**Limitations:** The analysis mainly targets Pre-LN transformers. Generalization to Post-LN, normalization-free models, and multimodal branches remains an open direction."},"visualPlanTitle":"Visualization plan: uncontrolled amplification vs controlled depth scaling","visualPlan":"The left panel shows variance growing with depth in legacy Pre-LN, while the right panel shows how LNS keeps amplitude under control as layers get deeper. For responsive UI, keep `minHeight: 320px` and use an SVG `viewBox`-based layout.","visualLegacyTitle":"Legacy Pre-LN","visualLegacyBody":"Variance keeps building up, so late layers drift toward identity-like behavior.","visualProposedTitle":"Proposed LNS","visualProposedBody":"Depth-aware damping stabilizes amplitude and keeps deep layers useful.","visualAxisStart":"Layer 1","visualAxisEnd":"Layer L","visualLegacyCurveLabel":"Variance growth","visualProposedCurveLabel":"Controlled amplitude","visualContributionLabel":"Layer contribution","visualLegacyBadgeLabel":"Late layers become near-identity","visualProposedBadgeLabel":"Deep layers stay useful","summary":"The appeal of LNS is that it attacks the curse of depth with almost no architectural overhead. The paper turns \"more depth\" from a fragile scaling strategy into something much closer to usable learning capacity."},"paperReviewAlphaFormer":{"sectionTitle":"Learn / Paper review / Core architecture & algorithms / CPAL2026 / AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","title":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","description":"In quant practice, alpha factors still sit awkwardly between **hand-crafted formulas** and **black-box models**. AlphaFormer **pre-trains a Transformer on synthetic time series**, then—given new market data—**emits interpretable symbolic formulas** end-to-end. This article dissects the linear alpha pool, IC-based metrics, and PPO-style stabilization line by line.","viewOriginalPdf":"Open original PDF","coreFlow":{"0":"$4b","1":"$4c","2":"$4d","3":"$4e","4":"**[Experiments & results]**\n\n- **Search efficiency:** Strong baselines need **far more candidate factors**; AlphaFormer reaches **top-tier IC / Rank IC** on CSI300 & CSI500 with **~one-third the generation budget** in the paper’s story—not a wider needle, but a steadier hand.\n- **Inference efficiency:** **No massive online parameter re-fit** during inference—important for near-real-time stacks.\n- **Generalization:** **Ensembling multiple generative architectures** for synthetics boosts IC; **China-pretrained models zero-shot to US S&P 500** still compete—suggesting partial transfer of **time-series / operator grammar**, not only venue noise.\n\n**Practical read:** If you want **interpretable factors** under **GPU-hour budgets**, “synthetic pre-train + bounded RL fine-tune” is an attractive MLOps compromise.","5":"**[Conclusion & limitations]**\n\n**Takeaways for practitioners (≤3)**\n\n1. **White-box signals:** RPN / operator trees are easy to share with risk as **literal formulas**.\n2. **Lower search tax:** Grammar compression means **less cold-start symbolic search** on every new tape.\n3. **End-to-end story:** generate → pool → IC → (optional) PPO keeps **pipelines short and reproducible**.\n\n**Limitations / future work**\n\n- **Hardware:** GPU-centric training & inference may **exclude CPU-only** legacy stacks.\n- **Regimes:** Impressive zero-shot transfer still may need **retrain or domain adaptation** after structural breaks.\n- **Labels:** IC is only as honest as your **forward-return definition and leakage controls**."},"visualPlanTitle":"Visualization plan: chaotic search vs. controlled generation","visualPlan":"Left: a **search-space scatter** of trials plus a **jagged path** that barely approaches the **IC goal**—cold-start symbolic mining. Right: a single **pipeline**—synthetic series → pre-training → tokenized formula generation → IC/pool—for AlphaFormer’s end-to-end story.","visualLegacyTitle":"Legacy: GP / RL symbolic search","visualLegacyBody":"Each new dataset restarts wide exploration; many candidates still yield noisy IC paths.","visualProposedTitle":"Proposed: AlphaFormer","visualProposedBody":"Grammar from synthetics; fewer generations lift IC steadily and zero-shot transfer is plausible.","visualAxisStart":"Trial 1","visualAxisEnd":"Trial N","visualLegacyCurveLabel":"Random search","visualProposedCurveLabel":"Pre-trained gen.","visualContributionLabel":"Cumulative gain","visualLegacyBadgeLabel":"Over-exploration","visualProposedBadgeLabel":"Few factors, high IC","summary":"AlphaFormer reframes “restart symbolic search every market” as **grammar pre-training + safely clipped RL fine-tuning**. Pool, L1, IC, and PPO play roles like **mixer, scissors, judges, seat belt**. Respect **GPU dependence** and **label hygiene** when you pilot."},"paperReviewPolarQuant":{"sectionTitle":"Learn / Paper Review / Model Optimization & Efficient AI / PolarQuant: Quantizing KV Caches with Polar Transformation","title":"Chapter 1: PolarQuant: Quantizing KV Caches with Polar Transformation","description":"In long-context LLM serving, the bottleneck is often not the model weights but the **KV cache memory**. PolarQuant attacks that bottleneck directly: after random preconditioning, it rewrites a KV vector in polar form and **stores angles compactly**, cutting the usual burden of **extra “how to reconstruct the numbers” side information**. This review unpacks the main formulas, why the angle distribution concentrates near $\\pi/4$, and what that means for real systems.","viewOriginalPdf":"View original PDF","coreFlow":{"0":"$4f","1":"$50","2":"$51","3":"$52","4":"$53","5":"**[Conclusion & Limitations]**\n\n**Practical significance**\n\n1. PolarQuant shows that quantization does not have to carry normalization metadata forever.\n2. It directly targets the memory hotspot of long-context serving.\n3. It changes the cache representation without requiring a new attention mechanism.\n\n**Limitations**\n\n- Codebook construction still leaves room for better analytic designs.\n- The paper is strongest on KV-cache quantization; extending the idea to weights or activations needs more evidence.\n- Real deployment still depends on efficient kernels, packing layouts, and careful implementation.","6":"**[Visualization Plan]**\n\nThe left panel should depict traditional block quantization: many blocks, each carrying **extra helper numbers to reconstruct the stored values**. The right panel should depict PolarQuant: random preconditioning, polar conversion, one radius, and highly concentrated angles near $45^\\circ$."},"visualPlanTitle":"KV storage at a glance","visualPlan":"Legacy stacks FP16 metadata per block; PolarQuant keeps r and angles.","visualLegacyTitle":"Block quant","visualLegacyBody":"Each block still needs extra numbers to decode the short codes, so overhead remains even when values look compressed.","visualProposedTitle":"PolarQuant","visualProposedBody":"After random preconditioning, the method moves to polar coordinates and quantizes concentrated angles instead of storing normalization metadata.","visualAxisStart":"Baseline","visualAxisEnd":"PolarQuant","visualLegacyCurveLabel":"metadata overhead ↑","visualProposedCurveLabel":"footprint ↓","visualContributionLabel":"memory efficiency","visualLegacyBadgeLabel":"+FP16 meta / block","visualProposedBadgeLabel":"r + θ codebook","visualGlossary":{"title":"How to read the diagram labels","items":[{"term":"FP16","hint":"**Half-precision** floating point (16 bits per number). About **half the footprint of FP32** for the same count of values; slightly coarser grid of representable numbers."},{"term":"Quantization","hint":"Rounding real values onto a **small set of integer codes** to save bits. At use time you **dequantize**; you often need **per-block helper numbers** to map codes back to the right range."},{"term":"KV","hint":"A chunk of cached Key/Value vectors for past tokens (attention memory)."},{"term":"INT4","hint":"Values packed into 4-bit integers—small, but not usable without extra info."},{"term":"+meta / FP16","hint":"High-precision helper numbers (scale, zero-point, etc.) needed to dequantize; stored separately."},{"term":"× N","hint":"Roughly: that metadata repeats for every block, so cost grows with N."},{"term":"S","hint":"Random matrix that mixes coordinates (preconditioning) before the polar transform."},{"term":"r","hint":"Radius: overall magnitude in polar form."},{"term":"θ","hint":"Angle (direction). Often stored as a codebook index instead of a full float."},{"term":"codebook","hint":"A small table of typical angles—like a palette—so you only store an index."}]},"summary":"PolarQuant is elegant because it changes the coordinate system of the problem. Instead of forcing raw coordinates into low bits and paying normalization overhead, it stores one radius and a set of structured angles. That makes it especially attractive when KV-cache memory, not model size, is the true serving bottleneck."},"paperReviewAutomlAgent":{"sectionTitle":"Learn / Paper Review / Automated ML & ML Pipelines / ICML 2025 / AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","title":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","description":"AutoML-Agent goes beyond “helping AutoML”: it automates the whole loop from data retrieval, preprocessing, model design, HPO, code generation, to deployment—using a multi-agent LLM framework. This article dissects the paper’s core math (input → planning → decomposition → execution → verification) line by line.","viewOriginalPdf":"Open original PDF","coreFlow":{"0":"$54","1":"$55","2":"$56","3":"$57","4":"$58","5":"**[Conclusion & Limitations]**\n\n**Final meaning & practical value (≤3):**\n\n1. **Full-pipeline mindset:** defines AutoML as a continuous pipeline, not a single step.\n2. **RAP + multi-agent:** turns plan search from single-shot generation into guided candidate exploration.\n3. **Verification-first reliability:** reduces the typical LLM failure mode—“looks right, but breaks.”\n\n**Limitations / Future work:**\n\n- **Skeleton/template reliance:** genuinely novel tasks may still require stronger base templates.\n- **Backbone LLM dependency:** stronger LLMs usually produce better plans and code.\n- **Metric sensitivity:** success/performance depends on how SR/NPS (and the verification criteria) are defined.\n\nFinally, a single orchestration diagram summarizes the full pipeline."},"visualPlanTitle":"[Diagram] Full-pipeline orchestration board","visualPlan":"One flowchart panel: standardize the user instruction into $R$, strengthen planning with **RAP**, run **data → model → code** stages on decomposed sub-tasks in parallel, then advance only verified artifacts to **deployment**.","visualLegacyTitle":"Legacy: single-shot plan bottleneck","visualLegacyBody":"Planning and execution are partially automated; failures require manual debugging and iteration.","visualProposedTitle":"AutoML-Agent: RAP + multi-agent + multi-stage verification","visualProposedBody":"Standardize request $R$, generate candidate plans with RAP, decompose into data/model tasks, run in parallel, and verify until deployment-ready.","visualAxisStart":"Natural language","visualAxisEnd":"Deploy","visualDiagramUserNode":"User task","visualDiagramStdNode":"Parsed request","visualDiagramStdCaption":"Structured for tools","visualLegacyCurveLabel":"Cost↑, success↓","visualProposedCurveLabel":"Success↑","visualContributionLabel":"Full-Pipeline Control","visualLegacyBadgeLabel":"Uncontrolled","visualProposedBadgeLabel":"Precision control","visualDiagramData":"Data","visualDiagramModel":"Model","visualDiagramOps":"Code","visualDiagramVerify":"Checks","visualDiagramShip":"Ship","visualAnimPhases":["**User task** — the natural-language instruction (paper’s $I$).","**Parsed request** — **structured** so tools and retrieval can use it (paper’s $R$).","**RAP** — **retrieve** papers/code/examples to strengthen planning.","**Data** stage — prepares splits, cleaning, and features.","**Model** stage — architecture, training, and tuning.","**Code** stage — runnable scripts and packaging toward deploy.","**Multi-stage checks** — run, metrics, and deploy readiness gates.","Only outputs that **pass every gate** move to release."],"datasetSectionTitle":"Datasets and Evaluation Setup","datasetSectionContent":"Experiments cover image, text, tabular, time-series, and graph benchmarks, evaluating both success rate and normalized performance.","summary":"AutoML-Agent treats automation as an end-to-end system: RAP accelerates planning, decomposition enables parallel execution, and multi-stage verification locks reliability. So even with long math, the whole story compresses into one flow: input standardization → candidate plans → parallel execution → deployable final code."},"paperReviewSela":{"sectionTitle":"Learn / Paper review / AutoML & ML pipelines / ICLR 2025 / SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","title":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","description":"LLM agents often produce **low-diversity, suboptimal** ML code even after many tries, while classical AutoML is tied to **fixed pipelines** and lacks flexibility.\n\n**MCTS (Monte Carlo Tree Search)** builds a **tree** of decisions/experiments and uses rollouts plus **validation scores** to choose **which branch to try next**. **UCT-DP** tweaks the usual **UCT score** for picking the next node so **deep, expensive training steps** are not always behind **shallow, cheap exploration**.\n\n**SELA** represents pipelines as such a **tree**, schedules experiments with **MCTS**, and uses **UCT-DP** to prioritize **deeper training-heavy** branches. **Formulas stay as in the paper; each block starts with plain-language notes.**","viewOriginalPdf":"View PDF (arXiv)","chapter1Lead":"# Chapter 1: SELA and tree-search AutoML\n\nSame as above in plain words: MCTS walks the tree using rollouts and validation scores to choose which branch to try next; UCT-DP changes the UCT score used when picking the next node so that deep, expensive training steps are less often pushed aside by shallow exploration.","mctsIntroTitle":"What is Monte Carlo Tree Search (MCTS)?","mctsIntroDescription":"**In short:** future experiments are arranged as a **tree**, and the same four steps repeat.\n\n- **① Pick (selection):** rules like UCT choose **which node to visit next**.\n\n- **② Grow (expansion):** attach a **new child node** (a new try) if it did not exist.\n\n- **③ Roll (rollout):** run code or a simulation on that branch to get a **validation score**.\n\n- **④ Push up (backpropagation):** send that score **up to parents** to update visit counts and averages.\n\nSELA explores LLM-proposed pipeline branches with **these four steps** and validation scores.\n\n**What is UCT?** (Upper Confidence Bound applied to trees) It is the scoring rule for choosing **which sibling child to visit next**. It mixes **average reward so far** (exploit) with **how under‑visited a branch is** (explore) in **one formula**, so you pick the next node by comparing numbers. The paper’s **UCT-DP** tweaks this UCT so **deep training-heavy** steps are not always behind **shallow** search.","mctsPhaseRowTitle":"Four steps (one cycle)","mctsPhase1":"① Pick","mctsPhase2":"② Grow","mctsPhase3":"③ Roll","mctsPhase4":"④ Push up","mctsSvgRoot":"Root","mctsSvgLeft":"Branch A","mctsSvgRight":"Branch B","mctsSvgLeaf":"Rollout","mctsSvgScore":"val. score s","mctsCaption":"Purple dashed line: one example path. Repeated runs accumulate scores on each branch.","coreFlow":{"0":"### [Abstract & intro] Three-line summary\n\n**Summary**\n\n- **LLM limits:** Code is often **low-diversity** and **fails to converge** to a good solution.\n- **Classical AutoML:** **Fixed pipelines** (e.g., Auto-sklearn-style) resist **dynamic reconfiguration** when tasks change.\n- **SELA:** Represent pipelines as a **tree**, schedule experiments with **MCTS**, use **validation scores** as feedback. **UCT-DP** biases search toward **deep, training-heavy** nodes.\n\n**Analogy:** Following only the **factory service playbook** ≈ classical AutoML. **Changing suspension, engine map, and tire pressure all at once and doing a single lap** ≈ one-shot LLM codegen. SELA is like a **race engineer who reads sector times and telemetry** (validation scores) and **branches on what to tune next**.","1":"# Chapter 2: Background — five concepts\n\n### [Background]\n\n- **AutoML:** Automate preprocessing, models, hyperparameters; often **try–measure–iterate**.\n\n- **LLM agent:** From natural-language task + data summary, **generate and run code**. SELA splits **plan** vs **code/execute**.\n\n- **Search space:** All feasible **preprocessing × model × hyperparameter** tuples—too large to exhaust.\n\n- **MCTS:** Rollouts + statistics on a tree; balances **exploration** vs **exploitation**.\n\n- **Exploration vs exploitation:** Visit rare children vs deepen high-reward paths. **UCT-DP** adds **prefer deep training**.","2":"$59","3":"$5a","4":"# Chapter 5: Experiments\n\n### [Results]\n\nOn **20 ML datasets** (arXiv abstract), SELA reports roughly **65%–80% win rate** vs each baseline—**consistent advantage**. **MCTS beats random search**; more **rollouts** generally **improve** scores—useful for budgeting API/time.","5":"# Chapter 6: Conclusion & visualization guide\n\n### [Conclusion]\n\n**Takeaways (≤3)**\n\n1. **Strong AutoML baselines** without hand-picking every step.\n2. **Cache rollouts** to cut API/GPU cost.\n3. **Tree logs** explain **which branch** was taken.\n\n**Limits:** Generalizing to robotics/SWE; larger search spaces need better sample efficiency; clearer **interpretability** needs UI work.\n\n### [Diagram] Summary\n\n- **Classic:** Linear / one-shot flows—weak feedback may not reach target quality.\n- **SELA:** **MCTS + UCT-DP** on a tree, updated with **validation scores**—the **left/right panels** below only sketch this contrast."},"visualPlanTitle":"At-a-glance comparison","visualPlan":"**Left:** fixed order and one-shot generation—feedback can be weak. **Right:** tree search that uses validation scores to choose branches. **Simplified figure** below.","visualLegacyTitle":"Baseline: fixed pipeline / one-shot generation","visualLegacyBody":"One-shot full pipelines or rule-only flows have weak feedback loops; points may not converge to a good score.","visualProposedTitle":"SELA: tree search + UCT-DP","visualProposedBody":"Branch per stage, update average reward from validation scores, and bias toward deep training when promising.","visualAxisStart":"Start","visualAxisEnd":"Target quality","visualLegacyCurveLabel":"Scattered trial path","visualProposedCurveLabel":"Convergence on the tree","visualContributionLabel":"Experiment difficulty axis","visualLegacyBadgeLabel":"Hard to control","visualProposedBadgeLabel":"Controlled experiments","visualLegacyTemplateLabel":"Fixed AutoML template (locked order)","visualLegacyStageFe":"FE / prep","visualLegacyStageModel":"Model","visualLegacyStageTrain":"Train · val","visualLegacyDeadEndHint":"Mismatch → dead end","visualLegacyOneshotLabel":"One-shot LLM: full pipeline code σ in one pass","visualLegacyOpenLoopLabel":"Validation score s rarely feeds back to revise Λ","visualProposedInsightLabel":"Insight pool Λ (LLM)","visualProposedPrunedLabel":"Low UCT · pruned","visualProposedFeedbackLabel":"Validation s → update v(x), n","visualProposedCacheLabel":"Cache σ & intermediates","visualProposedUctDpLabel":"UCT-DP: favor deep training","visualProposedRolloutLabel":"MCTS rollouts · simulation","visualProposedBestScoreLabel":"Close to target score","visualSvgLabelPrep":"Data prep","visualSvgLabelModel":"Choose model","visualSvgLabelTrain":"Train & validate","visualSvgLabelStuck":"Stops here","visualSvgLabelOneShot":"All-in-one code","visualSvgLabelLowVal":"Low validation score","visualSvgLabelStart":"Start","visualSvgLabelSkip":"Weaker branch","visualSvgLabelAvg":"Running average","visualSvgLabelDone":"Near the goal","visualSvgFeedbackLine":"Scores feed back upward","summary":"SELA **places LLM ideas on a tree and uses MCTS**; **UCT-DP** avoids **wasting time** on shallow branches before **real training**. **NS** = “make every dataset’s score point the **same way** (higher = better)”; **Rescaled NS** = “set SELA to **1** and see others as **multiples**.” Caching and logs help **cost and explainability**. One line: **search is experiment with feedback**."},"mlChapters":{"mlSectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Summary"},"mlKnnProblemSolvingLabel":"Explanation for solving the problems","mlKnnVisualIntro":"Pick the K=3 nearest neighbors to the new point (?), then predict by majority vote of their labels.","mlKnnVisualCaption":"Dashed circles: distance order. K=3 neighbors (purple) labels: 1, 2, 2 → majority 2","mlKnnVisualStep0":"① Training data — points in feature space (labels 1 or 2)","mlKnnVisualStep1":"② New point (?) appears — we predict its label","mlKnnVisualStep2":"③ Find distance to the K=3 nearest (dashed circles)","mlKnnVisualStep3":"④ Connect to K=3 neighbors — in order of distance","mlKnnVisualStep4":"⑤ Majority vote: labels 1, 2, 2 → predict 2","mlLinearRegressionVisualIntro":"Find the line $\\hat y = w x + b$ that best fits the data points.","mlLinearRegressionVisualStep0":"① Training data — (x, y) scatter plot","mlLinearRegressionVisualStep1":"② Wrong initial line — before gradient descent","mlLinearRegressionVisualStep2":"③ Line learns and moves to optimal position","mlLinearRegressionVisualStep3":"④ Learning complete — predict $\\hat y$ from new $x$","mlLinearRegressionVisualCaption":"$$y \\approx 0.7x + 1.1$ — $w$, $b$ learned by gradient descent","mlLinearRegressionVisualLearningBadge":"Learning...","mlLinearRegressionVisualPlay":"Watch line learning process","mlLinearRegressionVisualReplay":"Replay","mlLinearRegressionProblemSolvingLabel":"Explanation for solving the problems","mlMseVisualIntro":"**Regression loss example:** MSE is the average of squared errors between prediction $\\hat y$ and actual $y$. (For classification we use cross-entropy.)","mlMseVisualStep0":"① Data points and prediction line $\\hat y = w x + b$","mlMseVisualStep1":"② Error (residual) bars from each point to the line","mlMseVisualStep2":"③ Squared errors $(y_i - \\hat y_i)^2$ visualized","mlMseVisualStep3":"④ MSE $= \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$","mlMseVisualCaption":"MSE $= \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$ — the smaller the loss, the better the line fits the data.","mlMseVisualSquaresLabel":"Squared error = area (side length = |residual|)","mlMseProblemSolvingLabel":"Explanation for solving the problems","mlLogisticProblemSolvingLabel":"Explanation for solving the problems","mlLogisticVisualIntro":"The larger the linear score $z$, the closer $\\sigma(z)$ is to 1, so we classify as class 1. $z=0$ is the decision boundary.","mlLogisticVisualCaption":"Sigmoid: $\\sigma(z) = \\frac{1}{1+e^{-z}}$. When $z>0$, $\\hat y=1$; when $z \\le 0$, $\\hat y=0$.","mlLogisticVisualFormulaExplain":"**How to read the formula** — When $z$ is large and negative, $e^{-z}$ is large so $\\sigma(z) \\approx 0$. When $z=0$, $\\sigma(0)=0.5$. When $z$ is large and positive, $e^{-z} \\approx 0$ so $\\sigma(z) \\approx 1$. So the formula squeezes any $z$ into a probability between 0 and 1.","mlLogisticVisualXAxisLabel":"z (linear score)","mlLinearRegressionProblemSolvingTable":"$5b","mlKnnProblemSolvingTable":"**Algorithm steps**\n\n- **Input** — New feature vector $\\mathbf{x}$\n- **Stored** — Labeled examples $(\\mathbf{x}_i, y_i)$\n- **1** — Compute distance $d(\\mathbf{x}, \\mathbf{x}_i)$ to each $\\mathbf{x}_i$\n- **2** — Select K smallest distances\n- **3 (classification)** — Predict $\\hat y$ by **majority vote** of the K labels\n- **3 (regression)** — Predict $\\hat y$ as **average** of the K $y_i$ values","mlDecisionTreeProblemSolvingLabel":"Explanation for solving the problems","mlDecisionTreeVisualIntro":"From the root, follow branches by answering yes/no to each question; the leaf gives the prediction.","mlDecisionTreeVisualStep0":"① Root node — first question (e.g. is feature $x_1 \\le 3$?)","mlDecisionTreeVisualStep1":"② Move to left (0 = no) or right (1 = yes) child","mlDecisionTreeVisualStep2":"③ Repeat questions at internal nodes","mlDecisionTreeVisualStep3":"④ Leaf node — output prediction (class or value) with no further split","mlDecisionTreeVisualPathCaption0":"① Root node — ask the first question. Follow branches by yes/no.","mlDecisionTreeVisualPathCaption1":"④ Follow path: Yes(1) → Leaf 0","mlDecisionTreeVisualPathCaption2":"⑤ Follow path: No(0) → Leaf 1","mlDecisionTreeVisualStep0Description":"① Root node — at the first question, branch by yes/no and go down the left or right branch.","mlDecisionTreeVisualLabelRoot":"Root","mlDecisionTreeVisualLabelYes":"Yes(1)","mlDecisionTreeVisualLabelNo":"No(0)","mlDecisionTreeVisualLabelQuestion":"Question","mlDecisionTreeVisualLabelLeaf0":"Leaf 0","mlDecisionTreeVisualLabelLeaf1":"Leaf 1","mlDecisionTreeVisualDiagramAriaLabel":"Decision tree structure: root — question — leaf","mlEnsembleVisualIntro":"Combine predictions from multiple models (trees) by voting or averaging to get the final prediction.","mlEnsembleVisualStep0":"① Draw bootstrap samples from training data and train multiple trees","mlEnsembleVisualStep1":"② Each tree predicts independently","mlEnsembleVisualStep2":"③ Classification: majority vote; Regression: average → final prediction","mlEnsembleVisualStep3":"④ The final prediction is determined","mlEnsembleVisualLabelData":"Data","mlEnsembleVisualLabelVote":"Vote/Average","mlEnsembleVisualLabelPrediction":"Prediction","mlEnsembleVisualLabelTree1":"Tree 1","mlEnsembleVisualLabelTree2":"Tree 2","mlEnsembleVisualLabelTree3":"Tree 3","mlEnsembleVisualAriaLabel":"Ensemble flow: Data → Trees → Vote/Average → Prediction","mlKmeansProblemSolvingLabel":"Explanation for solving problems","mlKmeansVisualIntro":"Assign each point to the nearest center, then move centers to the mean of assigned points; repeat.","mlKmeansVisualStep0":"① Data — unlabeled points in feature space","mlKmeansVisualStep1":"② Initialize K centers — place K centroids","mlKmeansVisualStep2":"③ Assign — assign each point to the nearest center (by color)","mlKmeansVisualStep3":"④ Update centers — set each center to the mean of its assigned points","mlKmeansVisualStep4":"⑤ Repeat until assignment and centers no longer change","mlKmeansVisualCaption":"K-Means: repeat assign → update to minimize SSE (distortion).","mlKmeansVisualAriaLabel":"K-Means flow: Data → Initial centers → Assign → Update → Converge","mlKmeansVisualMeanLabel":"mean","mlKmeansVisualPointDataLabel":"Point: Data","mlKmeansVisualLineCaption":"Line: from each point to its assigned center (μ)","mlKmeansVisualCenterMoveCaption":"Centers move to cluster mean","mlCrossValidationProblemSolvingLabel":"Explanation for solving the problems","mlCrossValidationVisualIntro":"Split data into train/validation/test; in K-Fold, take turns validating and estimate performance by the mean score.","mlCrossValidationVisualTitle":"① 5-Fold","mlCrossValidationVisualFoldLabel":"Fold{n}","mlCrossValidationVisualTrainLabel":"Train","mlCrossValidationVisualValLabel":"Validation","mlCrossValidationVisualScoreLabel":"Validation score","mlCrossValidationVisualMeanLabel":"Mean μ","mlCrossValidationVisualStep0":"① Full data — sample set for training and validation","mlCrossValidationVisualStep1":"② Train/Val/Test split — train to learn, validate to tune, test for final evaluation","mlCrossValidationVisualStep2":"③ K-Fold — split into K parts, use one part as validation and the rest for training each time","mlCrossValidationVisualStep3":"④ Per-fold validation scores — get $S_1, S_2, \\ldots, S_K$ from each fold","mlCrossValidationVisualStep4":"⑤ Mean $\\bar{S} = \\frac{1}{K}\\sum_{k=1}^K S_k$ — final performance estimate","mlCrossValidationVisualCaption":"Cross validation: practice tests (validation) to estimate skill, final exam (test) to confirm.","mlCrossValidationVisualAriaLabel":"Cross validation flow: data → split → K-Fold → per-fold scores → mean","mlCrossValidationProblemPrompt":"Read the instruction below and enter your answer in the blank (?).","mlCrossValidationProblemPromptDefinition":"If the following statement is **true**, choose **True**; otherwise choose **False**.\n\n{statement}","mlCrossValidationProblemPromptDefinitionChoice":"Choose the option that best matches the question.\n\n{question}","mlCrossValidationProblemPromptHoldoutTrain":"With {n} samples and training ratio {trainRatio}, how many training samples? (integer)","mlCrossValidationProblemPromptHoldoutTest":"With {n} samples and training ratio {trainRatio}, how many test samples? (integer)","mlCrossValidationProblemPromptKfoldSize":"With {n} samples and {K}-Fold, what is the size of one fold (validation set)? (integer quotient)","mlCrossValidationProblemPromptKfoldScoreMean":"K-Fold validation scores (%) are {scores}. Find the mean (integer).","mlCrossValidationProblemPromptScenario":"Choose the most suitable method for the scenario.\n\n{scenario}","mlCrossValidationProblemPromptStratified":"Choose the option that best matches the question.\n\n{question}","mlCrossValidationStatement_0":"Cross validation estimates performance by splitting data into train/validation/test instead of scoring only on training data.","mlCrossValidationStatement_1":"The validation set is used like a practice test for hyperparameter selection or model comparison.","mlCrossValidationStatement_2":"In K-Fold, data is split into K parts and each part is used once as validation; the mean of validation scores is the final estimate.","mlCrossValidationStatement_3":"The test set is used only once for final performance reporting.","mlCrossValidationStatement_4":"Hold-out splits data once into train and validation (or train and test).","mlCrossValidationStatement_5":"Overfitting is suspected when training score is high but validation/test score is low.","mlCrossValidationStatement_6":"The training set is the data used to learn model weights and parameters.","mlCrossValidationStatement_7":"In K-Fold, one fold size is usually the integer quotient of n/K.","mlCrossValidationStatement_10":"It is fine to report final performance on the validation set after training on it.","mlCrossValidationStatement_11":"Hold-out always gives more stable estimates than K-Fold.","mlCrossValidationStatement_12":"The test set can be used multiple times to choose models.","mlCrossValidationStatement_13":"Performance measured only on training data gives an accurate picture of generalization.","mlCrossValidationStatement_14":"In K-Fold, a larger K means fewer validation runs.","mlCrossValidationQuestionChoice_0":"The main purpose of cross validation is? ① Estimate generalization ② Speed up training ③ Data augmentation","mlCrossValidationQuestionChoice_1":"When data is limited, which is more advantageous? ① Hold-out ② K-Fold ③ Stratified only","mlCrossValidationQuestionChoice_2":"What corresponds to a practice test? ① Train ② Validation ③ Test","mlCrossValidationQuestionChoice_3":"Which keeps class proportions in each fold? ① Hold-out ② Plain K-Fold ③ Stratified K-Fold","mlCrossValidationQuestionChoice_4":"What corresponds to the final exam? ① Train ② Validation ③ Test","mlCrossValidationQuestionChoice_5":"Which set is used to choose hyperparameters? ① Train ② Validation ③ Test","mlCrossValidationQuestionChoice_6":"Which validates by using different splits multiple times? ① Hold-out ② K-Fold ③ Test only","mlCrossValidationQuestionChoice_7":"When might we suspect overfitting? ① High train and high validation ② High train and low validation ③ Low train and high validation","mlCrossValidationScenario_0":"We have 10,000 samples and want to evaluate once with a single split.","mlCrossValidationScenario_1":"We have only 500 samples and want a stable validation estimate by splitting multiple ways.","mlCrossValidationScenario_2":"We split once 80% train, 20% test and use the test set only once at the end.","mlCrossValidationScenario_3":"Classification with 90:10 class imbalance; we want each fold to preserve that ratio.","mlCrossValidationScenario_4":"We want to run validation 5 times and report the average accuracy.","mlCrossValidationScenario_5":"We split once 70:30 and use that split.","mlCrossValidationScenario_6":"We run K validation runs to reduce the variance of the estimate.","mlCrossValidationScenario_7":"Binary classification; we want to keep the positive rate in each fold.","mlCrossValidationStratified_0":"What is an advantage of Stratified K-Fold? ① Preserve class ratio ② Faster ③ Less memory","mlCrossValidationStratified_1":"For imbalanced classes in classification, what is recommended? ① Hold-out only ② Stratified K-Fold ③ Skip validation","mlCrossValidationStratified_2":"Stratified is mainly used for? ① Regression only ② Classification (preserve class ratio) ③ Clustering","mlEvaluationProblemPrompt":"Read the instruction below and enter your answer in the blank (?).","mlEvaluationProblemSolvingLabel":"Explanation for solving the problems","mlEvaluationVisualIntro":"Fill the 2×2 confusion matrix with actual (rows) and predicted (columns), then compute accuracy, precision, recall, and F1.","mlEvaluationVisualStep0":"① Actual vs predicted — rows: actual pos/neg, columns: predicted pos/neg","mlEvaluationVisualStep1":"② Confusion matrix — fill the four cells TP, TN, FP, FN","mlEvaluationVisualStep2":"③ Accuracy — (TP+TN)/total, fraction correct","mlEvaluationVisualStep3":"④ Precision & recall — precision: TP/(TP+FP), recall: TP/(TP+FN)","mlEvaluationVisualStep4":"⑤ F1 — harmonic mean of precision and recall","mlEvaluationVisualCaption":"Read the model's report card via the confusion matrix and choose metrics that match your goal.","mlEvaluationVisualAriaLabel":"Classification evaluation: confusion matrix → accuracy, precision, recall, F1","mlEvaluationVisualMatrixTitle":"Confusion Matrix (2×2)","mlEvaluationVisualStepLineTP":"Actual positive · Predicted positive → TP","mlEvaluationVisualStepLineFN":"Actual positive · Predicted negative → FN","mlEvaluationVisualStepLineFP":"Actual negative · Predicted positive → FP","mlEvaluationVisualStepLineTN":"Actual negative · Predicted negative → TN","mlEvaluationVisualPredPos":"Predicted positive","mlEvaluationVisualPredNeg":"Predicted negative","mlEvaluationVisualActualPos":"Actual positive","mlEvaluationVisualActualNeg":"Actual negative","mlEvaluationVisualBadgeTP":"True positive ✓","mlEvaluationVisualBadgeFN":"False negative (actual pos → predicted neg)","mlEvaluationVisualBadgeFP":"False positive (actual neg → predicted pos)","mlEvaluationVisualBadgeTN":"True negative ✓","mlEvaluationVisualBadgeFixed":"After distinguishing TP, FN, FP, TN, compute accuracy, precision, recall, and F1.","mlEvaluationProblemPromptDefinition":"If the following statement is **true**, choose **True**; otherwise choose **False**.\n\n{statement}","mlEvaluationProblemPromptDefinitionChoice":"Choose the option that best matches the question.\n\n{question}","mlEvaluationProblemPromptScenario":"Choose the most suitable option for the scenario.\n\n{scenario}","mlEvaluationProblemPromptConfusionCount":"With TP={tp}, TN={tn}, FP={fp}, FN={fn} in the confusion matrix, what is the value (integer) of {cell}?","mlEvaluationProblemPromptTotalCount":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is the total count n (integer)?","mlEvaluationProblemPromptAccuracy":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is accuracy (%) (integer)?","mlEvaluationProblemPromptPrecision":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is precision (%) (integer)?","mlEvaluationProblemPromptRecall":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is recall (%) (integer)?","mlEvaluationProblemPromptF1":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is F1 score (%) (integer)?","mlEvaluationStatement_0":"The confusion matrix is a 2×2 table of actual class (rows) and predicted class (columns).","mlEvaluationStatement_1":"Accuracy is (TP+TN) divided by total count.","mlEvaluationStatement_2":"The denominator of precision is TP+FP.","mlEvaluationStatement_3":"The denominator of recall is TP+FN.","mlEvaluationStatement_4":"F1 is the harmonic mean of precision and recall.","mlEvaluationStatement_5":"TP is the count of actual positive and predicted positive.","mlEvaluationStatement_6":"FN is actual positive but predicted negative (miss).","mlEvaluationStatement_7":"With imbalanced data, accuracy alone can be misleading.","mlEvaluationStatement_10":"Precision and recall are always equal.","mlEvaluationStatement_11":"High accuracy always means the model is suitable for production.","mlEvaluationStatement_12":"FP is actual positive but predicted negative.","mlEvaluationStatement_13":"The denominator of recall is TP+FP.","mlEvaluationStatement_14":"TN is actual positive and predicted positive.","mlEvaluationQuestionChoice_0":"The numerator of accuracy is? ① TP+TN ② TP+FP ③ TP+FN","mlEvaluationQuestionChoice_1":"The denominator of precision is? ① TP+FN ② TP+FP ③ TN+FN","mlEvaluationQuestionChoice_2":"When is recall important? ① Allowing spam as normal ② When we must not miss disease ③ Minimizing false alarms","mlEvaluationQuestionChoice_3":"F1 is the harmonic mean of? ① Accuracy and precision ② Precision and recall ③ Recall and accuracy","mlEvaluationQuestionChoice_4":"TP means? ① Actual pos, predicted pos ② Actual neg, predicted pos ③ Actual pos, predicted neg","mlEvaluationQuestionChoice_5":"False positive is? ① FP ② FN ③ TN","mlEvaluationQuestionChoice_6":"False negative is? ① FP ② FN ③ Precision","mlEvaluationQuestionChoice_7":"Total count n is? ① TP+TN ② TP+TN+FP+FN ③ TP+FP+FN","mlEvaluationScenario_0":"We must not miss spam (some false positives acceptable). Important metric? ① Recall ② Precision ③ Accuracy","mlEvaluationScenario_1":"In medical diagnosis, we must not say 'no disease' when there is. Important metric? ① Accuracy ② Recall ③ Precision","mlEvaluationScenario_2":"In ad click prediction, we want to raise 'fraction of predicted clicks that are real'. Important metric? ① Recall ② Precision ③ F1","mlEvaluationScenario_3":"In fraud detection we must not miss fraud. Important metric? ① Precision ② Recall ③ Accuracy","mlEvaluationScenario_4":"To balance precision and recall we use? ① Accuracy ② F1 ③ TP","mlEvaluationScenario_5":"When classes are 99:1 imbalanced, accuracy alone? ① Is reliable ② Can be misleading ③ Equals F1","mlEvaluationScenario_6":"The metric closest to 'fraction of relevant docs in top 10' is? ① Recall ② Precision ③ FN","mlEvaluationScenario_7":"The metric for 'fraction of actual positives that the model got right' is? ① Precision ② Recall ③ Accuracy","mlRegularizationProblemPrompt":"Read the problem and choose the correct option below.","mlRegularizationProblemSolvingLabel":"Explanation for problem solving","mlRegularizationVisualIntro":"We add a penalty for the model becoming too complex, not just for data error, so the model generalizes instead of memorizing.","mlRegularizationVisualVs":"VS","mlRegularizationVisualLabelNoReg":"No regularization","mlRegularizationVisualLabelWithReg":"With regularization","mlRegularizationVisualLabelOverfit":"Overfitting","mlRegularizationVisualLabelGeneral":"Generalization","mlRegularizationVisualStep0":"① No regularization — minimizing only training loss leads to **overfitting**","mlRegularizationVisualStep1":"② Add regularization — Loss = data loss + λ × penalty; **larger λ shrinks weights**","mlRegularizationVisualStep2":"③ L2 — **penalty $\\sum w_j^2$ keeps weights small**","mlRegularizationVisualStep3":"④ L1 — **penalty $\\sum |w_j|$ drives some weights to zero (sparse)**","mlRegularizationVisualStep4":"⑤ Generalization — a suitable λ gives **good performance on both train and validation**","mlRegularizationVisualCaption":"Regularization: loss + λ·penalty to reduce overfitting and improve generalization.","mlRegularizationVisualAriaLabel":"Regularization flow: overfitting → loss+penalty → L1/L2 → generalization","mlRecommendationProblemPrompt":"Read the problem and choose the correct option below.","mlRecommendationProblemSolvingLabel":"Explanation for problem solving","mlRecommendationSubjectivePrompt":"Write a one-line reason (ungraded).","mlRecommendationSubjectivePlaceholder":"E.g., we predict the blank by averaging neighbors' ratings using similarity as weights.","mlRecommendationVisualIntro":"From the user-item rating matrix, find similar users (neighbors) and predict missing entries using their ratings.","mlRecommendationVisualStep0":"① Rating matrix — Rows: users, Columns: items. Known ratings and blanks (?)","mlRecommendationVisualStep1":"② Similarity — Compute how similar users (or items) are","mlRecommendationVisualStep2":"③ Neighbor selection — Select the K most similar neighbors","mlRecommendationVisualStep3":"④ Prediction — Weighted average of neighbors' ratings to fill the blank","mlRecommendationVisualStep4":"⑤ Recommendation — Recommend items with high predicted scores","mlRecommendationVisualHowItWorks":"① Find neighbors → ② Use their ratings → ③ Predict empty cell → ④ Recommend","mlRecommendationVisualRowTitle":"Neighbors' ratings for this item → fill in my predicted rating","mlRecommendationVisualCardNeighbor1":"Neighbor 1 (similar user)","mlRecommendationVisualCardNeighbor2":"Neighbor 2 (similar user)","mlRecommendationVisualCardItem":"This item (I haven't seen it yet)","mlRecommendationVisualCardNeighbor1Short":"Neighbor 1","mlRecommendationVisualCardNeighbor2Short":"Neighbor 2","mlRecommendationVisualCardItemShort":"This item","mlRecommendationVisualCalc":"Avg prediction: $\\hat{r}_{u,i}=\\frac{5+4}{2}=4.5\\approx4$ (neighbors rated ★5 and ★4) → predict ★4","mlRecommendationVisualBottomDesc":"Similar users gave this item ★5, ★4 → we recommend ★4!","mlRecommendationVisualCaption":"Collaborative filtering: predict $\\hat{r}_{u,i}$ from similar users.","mlRecommendationVisualAriaLabel":"Recommendation flow: rating matrix → similarity → neighbors → weighted average","ml00":{"chapter":"Chapter 00","title":"Data and Features: The Start of Machine Learning","description":"Machine learning starts with data. We turn images, text, and numbers into **features**—numeric representations that let the model learn patterns. The world of numbers and functions from Basic Math Ch00 becomes reality here.","sectionTitle":"What are Data and Features?","whatIs":{"0":"**Data is the raw material of machine learning** — As we learned in Basic Math Ch00, deep learning and machine learning turn images, text, and sound into **numbers**. These **numeric inputs** paired with **labels** (correct answers) form **data**. For example, 'cat image + cat' is one data point, and thousands of such pairs become the material for the model to learn from.","1":"**Features are the numeric essence of data** — A photo we see is just a pile of tens of thousands of pixel numbers to a computer. **Features** are the useful information—like ear shape, eye size, fur color—extracted and expressed as numbers. Mathematically they are **vectors**, extracted from raw data through **functions**. The 'functions that define input-output rules' from Ch00 handle this transformation.","2":"**In short** — Data is a collection of (input, label) pairs; features are the result of turning that input into **numeric vectors** the model can understand. Good features lead to better learning; bad features hurt performance even with lots of data. The start of machine learning is deciding what data to use and what features to extract."},"whyImportant":{"0":"**Without data, learning is impossible** — Every decision a model makes is the result of **numbers and functions**. As in Ch00, to follow the AI computation we need data expressed as **numbers**. If data is scarce or labels are wrong, the model learns the wrong patterns.","1":"**Feature design sets the model's limits** — Deciding which information to turn into numbers is called **feature engineering**. Using only 'yesterday's closing price' vs. adding 'moving average, volume, volatility' for stock prediction leads to very different results. **Vectors and matrices** bundle many features for batch computation—a core part of the Ch00 roadmap—and the quality of features drives model performance.","2":"**Bridge to the next chapters** — Ch02 KNN, Ch03 Linear Regression, Ch05 Logistic Regression, and all ML algorithms take **feature vectors** as input. Understanding data and features is needed to interpret why a model made a given prediction, and the later chapters on **differentiation** and **probability** build on this foundation."},"howUsed":{"0":"**Input → feature extraction → model → prediction** — The ML pipeline matches the **input → numeric conversion → repeated functions → output** structure from Ch00. Feature extraction is the 'numeric conversion' step; models (linear regression, KNN, etc.) are sets of **functions**. **Differentiation** is used to reduce error during training; **probability** expresses uncertainty in predictions like '90% chance this image is a cat'."},"problemSolving":{"0":"**Data and features** — **Data** are (input $\\mathbf{x}$, target $y$) pairs; **features** encode observations as numbers that form the **feature vector** $\\mathbf{x}$. The **target** is the $y$ you want to predict; the **model** learns $y \\approx f(\\mathbf{x})$; **evaluation** uses loss and metrics.","1":"**Example (concept)**\n\nWhich best describes a feature vector? ① labels only ② numeric encoding of inputs ③ the loss function\n\nFeatures are the numeric vector built from inputs. → **Answer ②**\n\n---\n\n**Example (analogy table)**\n\n| Concept | Real-estate analogy | ML |\n| --- | --- | --- |\n| **Data** | Past transactions | Pairs $(x,y)$ |\n| **Feature** | Size, location | **Input vector $\\mathbf{x}$** |\n| **Target** | Sale price | **Label $y$** |\n| **Model** | \"Price per unit area\" rule | **Function $y=f(x)$** |\n| **Evaluation** | Compare estimate vs actual | **Loss** |"}},"ml01":{"chapter":"Chapter 01","title":"Missing Value Handling: Strategies to Fill Data Gaps","description":"This chapter introduces practical missing-value handling from concept to deployment: single vs multiple imputation, outlier detection (Box Plot, Mahalanobis Distance, Isolation Forest, SVDD), and imbalance-aware resampling (Tomek Links, SMOTE, ADASYN, hybrid resampling).","sectionTitle":"Missing Value Handling: preprocessing that reduces gaps and raises trust","whatIs":{"0":"**What is a missing value?** An empty cell in a data table—like a puzzle with a tooth missing. In practice they come from skipped survey answers, sensor failures, data transfer loss, and more.","1":"**Missingness mechanisms (MCAR/MAR/MNAR)** ask *why* the blank appeared. **MCAR** (*Missing Completely at Random*) is like coffee spilled on a form—chance alone. **MAR** (*Missing at Random*) is like male respondents leaving “cosmetics spend” empty—linked to *other* observed variables. **MNAR** (*Missing Not at Random*) is like low-income people leaving “income” blank—the missingness itself carries meaning.","2":"**Handling strategies** fall into three broad types: **listwise deletion**, **single imputation** (fill with one value), and **multiple imputation** (fill several times and pool). Each trades off how much data you keep, speed, and statistical rigor—pick to fit the situation.","3":"**Single vs multiple imputation:** **Single imputation** fills each gap once with e.g. the mean or mode—fast but risky. **Multiple imputation** builds several plausible completed datasets (parallel “worlds”) and pools results for a more careful conclusion.","4":"**Two views on outliers:** **Univariate detection (box plot)** flags extreme values in one variable; **multivariate detection (Mahalanobis / Isolation Forest / SVDD)** flags odd *combinations* across variables. They answer different questions—in practice you often check both.","5":"**Class imbalance correction:** When one class dominates, models may behave as if the rare class barely exists. Practitioners combine Tomek Links (boundary cleaning), SMOTE/ADASYN (synthetic minority samples), and SMOTE+Tomek (synthesize then clean).","6":"**Core message:** Missing-value handling is not a standalone trick—it is **one pipeline design problem** tied to outlier checks and imbalance correction."},"whyImportant":{"0":"**Systems hate blanks.** If you leave gaps, the pipeline may error—like an OMR sheet that cannot be scored without marks.","1":"**Bad fills mislead.** Filling everything with 0 or the mean breaks the true distribution; the model may treat imputed values as real and become **overconfident**.","2":"**Preprocessing is a set menu.** Filling missing values is not the end—you should plan outlier screening and imbalance handling in the same breath so the model behaves in production.","3":"**Fairness and safety:** If missingness differs by group (MAR/MNAR), careless imputation can widen performance gaps between groups—check bias signals early.","4":"**It beats model choice to the punch:** With the same algorithm, better preprocessing can change outcomes more than swapping models—often “good data flow” wins over “good model name.”","5":"**Deployment stability:** If you define rules for missingness, outliers, and imbalance up front, new data can be handled consistently—retraining and monitoring get easier."},"howUsed":{"0":"**End-to-end flow:** EDA → hypothesize why values are missing → choose imputation → catch extremes (**outlier detection**, e.g. box plot) → adjust class mix (**imbalance correction**, e.g. SMOTE) → then train and evaluate.","1":"**Single-imputation formulas:** Mean fill: $x_{miss} \\leftarrow \\bar{x}$; median fill: $x_{miss} \\leftarrow \\mathrm{median}(x)$.","2":"**Multiple imputation:** Build $m$ completed datasets (“parallel worlds”), then pool estimates $\\theta_k$: $\\bar{\\theta}=\\frac{1}{m}\\sum_{k=1}^{m}\\theta_k$.","3":"**Box plot (IQR) rule:** Fences from $Q_1-1.5\\times IQR$ to $Q_3+1.5\\times IQR$; points outside are outlier *candidates*.","4":"**Covariance:** Measures how two variables move together—e.g. do taller people tend to weigh more? $\\mathrm{cov}(X,Y)=\\mathbb{E}[(X-\\mu_X)(Y-\\mu_Y)]$. Stacking covariances yields $\\Sigma$, which sets the orientation and stretch of the multivariate “cloud” (ellipses).","5":"**Mahalanobis distance:** Not plain Euclidean distance—it uses $\\Sigma^{-1}$ to weight directions by spread: $D_M(\\mathbf{x})=\\sqrt{(\\mathbf{x}-\\boldsymbol\\mu)^\\top\\Sigma^{-1}(\\mathbf{x}-\\boldsymbol\\mu)}$ (covariance is central).","6":"**Isolation Forest:** Outliers are points that become **isolated quickly** under random splits—few splits needed to separate them (short path length), often in high dimensions with weak distributional assumptions.","7":"**SVDD (one-class):** Learn a **boundary** around normal data (minimum-volume sphere or kernel-shaped region) and flag points outside as outliers—common in one-class anomaly detection.","8":"**Class imbalance:** With a very rare positive class, accuracy can look high while the model ignores positives—use Recall, Precision, F1, PR-AUC together and resample when needed.","9":"**Tomek Links:** Pairs of opposite-class mutual nearest neighbors near the boundary—often remove the majority point (or both) to **clean** overlap (undersampling-based cleaning).","10":"**SMOTE:** Interpolate between a minority point $\\mathbf{x}$ and a neighbor $\\mathbf{x}_{nn}$: $\\mathbf{x}_{new}=\\mathbf{x}+\\lambda(\\mathbf{x}_{nn}-\\mathbf{x})$, $\\lambda\\sim U(0,1)$—richer than copy-paste but can add bad samples if the boundary is noisy.","11":"**Hybrid resampling (e.g. SMOTE+Tomek):** **Oversample** the minority with SMOTE, then **clean** ambiguous boundary pairs with Tomek—think **oversample → clean**.","12":"**ADASYN:** Like SMOTE but allocates **more** synthetic samples to “hard” minority regions (surrounded by majority)—denser support where the classifier struggles."},"problemSolving":{"0":"$5c","1":"$5d"},"summary":"**One-page cheat sheet**\n- There is no universal “magic” imputation—start by **why** data are missing (**MCAR/MAR/MNAR**).\n- **Single imputation** is fast but ignores uncertainty; **multiple imputation** is stronger statistically but costs more compute.\n- Check outliers in **both** **univariate (box plot)** and **multivariate (Mahalanobis / Isolation Forest / SVDD)** ways to miss fewer cases.\n- For imbalance combine **Tomek (clean)**, **SMOTE/ADASYN (synthesize)**, **SMOTE+Tomek (hybrid)** as needed.\n- Always compare metrics before and after preprocessing (Recall, F1, PR-AUC, etc.) to see real gains."},"ml02":{"chapter":"Chapter 02","title":"Supervised, Unsupervised, and Self-Supervised Learning","description":"Machine learning is often divided into **supervised**, **unsupervised**, and **self-supervised** learning depending on how data is used. **Supervised learning** is like studying with an answer key; **unsupervised learning** is like finding patterns and grouping similar items without labels; **self-supervised learning** is like masking part of the data and learning by predicting the missing part. This chapter summarizes the core ideas, math, and real-world use of these three paradigms so you can build a solid base for the algorithms covered later.","sectionTitle":"Three Ways of Learning: Supervised, Unsupervised, Self-Supervised","whatIs":{"0":"**Supervised Learning: Learning from input–label pairs**\nThe model is given **input $\\mathbf{x}$** and the corresponding **label (target) $y$** as pairs. The goal is to approximate a function $y = f(\\mathbf{x})$. Formally we have a training set $\\mathcal{D} = \\{(\\mathbf{x}_1, y_1), (\\mathbf{x}_2, y_2), \\ldots\\}$ and find $f$ by **minimizing a loss** (e.g. MSE, cross-entropy). Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression are all supervised.\n* **Example 1 (classification)**: Spam filter—email content ($\\mathbf{x}$) → spam or not ($y$).\n* **Example 2 (regression)**: House price—area, location ($\\mathbf{x}$) → price ($y$).\n* **Example 3 (medical)**: Patient test values ($\\mathbf{x}$) and diagnosis ($y$) for decision support.","1":"**Unsupervised Learning: Discovering hidden structure**\nOnly **input $\\mathbf{x}$** is given; there is **no label $y$**. Think of it as \"only questions, no answer key.\" The goal is to find **structure, patterns, or clusters** using **distance and similarity** between $\\mathbf{x}$s: group similar points (clustering), compress to fewer dimensions (dimensionality reduction), or flag **anomalies** that fall outside the normal pattern.\n* **Example 1 (clustering)**: Customer age and purchase history ($\\mathbf{x}$) → segment similar customers.\n* **Example 2 (anomaly detection)**: Learn normal payment patterns ($\\mathbf{x}$), then flag unusual transactions.\n* **Example 3 (dimension reduction)**: Reduce many features to 2–3 numbers for visualization or denoising. (You’ll learn concrete methods later.)","2":"**Self-Supervised Learning: Creating targets from data**\nInstead of human labels, the model creates **pseudo-labels** from the data. Typical flow: (1) **Mask** part of the input (e.g. a word, an image patch). (2) **Predict** the masked part from the rest. (3) **Use** the learned representation for downstream tasks with a small amount of supervised data. This is how BERT, GPT, and many vision models are pre-trained on large unlabeled corpora.\n* **Example 1 (language)**: \"I ate [ MASK ]\" → predict the masked word from context (LLMs).\n* **Example 2 (vision)**: Mask a region of an image and reconstruct it from the rest.\n* **Example 3 (contrastive)**: Treat two augmented views of the same image as \"same\" and different images as \"different\" to learn representations."},"whyImportant":{"0":"**Data nature and cost** — Building labels for all data is expensive. When labels are sufficient, **supervised** is effective; when they are scarce, **unsupervised** or **self-supervised** use unlabeled data, then a small supervised fine-tuning step. **Interpretability** also differs: supervised allows some explanation via loss and decision path; unsupervised/self-supervised require separate interpretation (e.g. cluster names, visualization).","1":"**Pre-training and fine-tuning** — Modern pipelines often use **self-supervised** pre-training on large unlabeled data, then **supervised** fine-tuning on a small labeled set. **Unsupervised** is common in preprocessing and exploration—e.g. cluster customers with K-Means, assign human meanings to clusters (e.g. \"loyal\", \"churn risk\"), then build a supervised churn model. Choosing the right paradigm makes the pipeline clear and realistic given data size and label cost."},"howUsed":{"0":"**Supervised** — Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression learn from (input, label) pairs. **Classification**: spam filter, disease prediction, image classification. **Regression**: house price, sales, temperature—Ch03/Ch04 cover the math and optimization.","1":"**Unsupervised** — Ch08 K-Means clusters data without labels; **dimension reduction** (reducing many features to 2–3 numbers) is another key tool. **Clustering**: customer segmentation, topic grouping. **Anomaly detection**: learn a \"normal\" region, flag points outside it.","2":"**Self-supervised** — BERT (masked word prediction), GPT (next-token prediction), and **contrastive learning** in vision are widely used. After pre-training, a small amount of labeled data is used for QA, summarization, or classification."},"problemSolving":{"0":"For supervised vs unsupervised vs self-supervised, ask: are labels **human-given**, **absent**, or **derived from the data**? **Supervised** learning fits $y=f(\\mathbf{x})$ from $(\\mathbf{x},y)$ pairs; **unsupervised** learning finds clusters/structure from $\\mathbf{x}$ only; **self-supervised** learning builds targets (e.g. masked words, next token), learns representations, then often fine-tunes with a little labeled data.","1":"**Example (concept understanding)**\n\nLearning spam vs not-spam with **human-provided labels** is closest to? ① Supervised ② Unsupervised ③ Self-supervised\n\nTraining on human-annotated answers is the hallmark of supervised learning. → **Answer ①**\n\n---\n\n**Example (T/F)**\n\n\"Learning that clusters customers without any labels is unsupervised.\" Answer 1 if true, 0 if false.\n\nClustering without labels is exactly what unsupervised learning does. → **Answer 1**\n\n---\n\n**Example (application)**\n\nPredicting **masked tokens** to learn representations is closest to? ① Supervised only ② Clustering only ③ Masked LM / contrastive pre-training\n\nSelf-created targets from the input match self-supervised pre-training. → **Answer ③**"},"mlSupervisedUnsupervisedSelfVisualIntro":"Three learning paradigms: supervised (input–label pairs), unsupervised (no label), self-supervised (self-created target).","mlSupervisedUnsupervisedSelfVisualStep0":"Supervised: learn a prediction function from (input, label) pairs","mlSupervisedUnsupervisedSelfVisualStep1":"Unsupervised: discover structure and clusters without labels","mlSupervisedUnsupervisedSelfVisualStep2":"Self-supervised: learn representations from self-created targets","mlSupervisedUnsupervisedSelfProblemSolvingLabel":"Problem-solving guide","mlSupervisedUnsupervisedSelfVisualPhase0Title":"Supervised: input x and label y come in pairs","mlSupervisedUnsupervisedSelfVisualPhase0Caption":"When (x, y) pairs are given in order, the model learns the rule","mlSupervisedUnsupervisedSelfVisualPhase1Title":"Unsupervised: only input x (no label y)","mlSupervisedUnsupervisedSelfVisualPhase1Caption":"There is no y (label), only x. Some x blink on and off → the model still finds structure and clusters","mlSupervisedUnsupervisedSelfVisualPhase1NoLabelBadge":"No label","mlSupervisedUnsupervisedSelfVisualPhase2Title":"Self-supervised: mask part of the data and predict the gap","mlSupervisedUnsupervisedSelfVisualPhase2Caption1":"Mask part of the input","mlSupervisedUnsupervisedSelfVisualPhase2Caption2":"Model predicts the masked part","mlSupervisedUnsupervisedSelfVisualPhase2Caption3":"The gap is filled with the predicted word","mlSupervisedUnsupervisedSelfVisualPhase2Prefix":"I ","mlSupervisedUnsupervisedSelfVisualPhase2Suffix":" ate","mlSupervisedUnsupervisedSelfVisualPhase2Filled":"rice","mlSupervisedUnsupervisedSelfVisualPhase2Example":"e.g. fill in the blank → representation learning (BERT, etc.)","mlSupervisedUnsupervisedSelfVisualPhase2Step1":"Mask","mlSupervisedUnsupervisedSelfVisualPhase2Step2":"Predict","mlSupervisedUnsupervisedSelfVisualPhase2Step3":"Fill","mlSupervisedUnsupervisedSelfVisualAutoCycle":"All three types animate at the same time","problemAnswerHint":"Choose the matching learning paradigm below.","mcAnswerSupervised":"Supervised","mcAnswerUnsupervised":"Unsupervised","mcAnswerSelfSupervised":"Self-supervised","mcAnswerDistractor":"Reinforcement learning","problems":{"definition_1_0":"Learning from data where input and label (answer) are paired is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_1_1":"Learning $y=f(\\mathbf{x})$ from (input $\\mathbf{x}$, label $y$) pairs is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_1_2":"The learning type likened to a teacher marking answers with a red pen is? ①Supervised ②Unsupervised ③Self-supervised","definition_1_3":"Learning that uses human-provided labels for classification or regression is? ①Supervised ②Unsupervised ③Self-supervised","definition_1_4":"The main learning type that learns classification or regression from (input, label) pairs is? ①Supervised ②Unsupervised ③Self-supervised","definition_1_5":"Learning where the data comes with a target and the model is trained to match it is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_0":"Learning that finds structure, patterns, or clusters from input only, without labels, is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_1":"When there is no label $y$, only $\\mathbf{x}$, finding groups in the data is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_2_2":"Clustering similar data without labels corresponds to which learning type? ①Supervised ②Unsupervised ③Self-supervised","definition_2_3":"The learning type likened to finding and grouping types by yourself is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_4":"Label-free learning often used for dimensionality reduction or anomaly detection is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_5":"Discovering structure in data without human-provided answers is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_3_0":"Learning from a 'pseudo-label' created from the data itself is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_1":"Learning that creates its own target (e.g. masked word, next sentence) is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_2":"Learning by masking part of a sentence and predicting that part is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_3":"The paradigm used to learn representations from large unlabeled data is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_4":"The learning type likened to making your own practice test and solving it is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_5":"Learning that creates 'same vs. different' pairs by itself to learn representations is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_0":"Spam vs. non-spam classification (with labels) is which learning type? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_1":"Grouping similar customers from purchase data only, with no labels, is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_2":"Predicting masked words in sentences to learn word representations is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_3":"Predicting apartment price from size and location is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_4":"Grouping similar images with no labels (clustering) is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_5":"Pre-training on large text then fine-tuning with few labels—the pre-training stage is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_6":"Building a disease-prediction model from medical images and disease labels is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_7":"Customer segmentation by grouping similar customers only, with no labels, is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_8":"Learning context representations by predicting the next sentence is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_9":"Predicting exam score from study time is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_10":"Anomaly detection when only normal data exists and almost no anomaly labels is closest to? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_11":"Learning representations by predicting a masked part of an image from the rest is? ①Supervised ②Unsupervised ③Self-supervised","scenario_0":"A hospital trains a model on past patient data (symptoms, tests) and diagnosis (label) to predict 'Does this patient have disease A?' This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_1":"A store splits customers into groups using only purchase history, with no extra labels. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_2":"A model is trained by masking 15% of words in Wikipedia and predicting them. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_3":"A model predicts tomorrow's sales from weather, date, and past sales (label). This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_4":"Video data is indexed by grouping similar scenes with no labels. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_5":"Context is learned by predicting 'next sentence' on large documents, then fine-tuned with few QA labels. The first stage is? ①Supervised ②Unsupervised ③Self-supervised","scenario_6":"A classifier is trained on dog/cat images with species labels. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_7":"Stock price series only, no labels; patterns are split into segments. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_8":"Same sentence in different wording; 'same meaning' is used as target to learn representations. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_9":"An application (experience, education) and pass/fail (label) are used to build a pass-prediction model. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_10":"News articles only, no topic labels; articles are grouped by topic. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_11":"Speech representations are learned by masking and reconstructing parts of audio. This is? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_0":"\"Learning from data where input and label are paired\" describes supervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_1":"\"Finding only structure in data without labels\" describes unsupervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_2":"\"Learning from a target created from the data (e.g. masked word)\" describes self-supervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_3":"Fitting a function to predict a value from (input, label) pairs. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_4":"Splitting data into K groups using only the data, no labels. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_5":"Learning by predicting masked words in a sentence. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_6":"Learning from human-provided pass/fail labels. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_7":"\"Grouping similar items from data only, with no answers\" describes unsupervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_8":"Learning representations from self-created 'same/different' pairs. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_9":"Using (input, label) pairs at training time and predicting the label for new input. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_10":"In anomaly detection, learning a 'normal region' from normal data only is closest to unsupervised. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_11":"\"Learning context by predicting the next sentence\" is self-supervised. Which type is this? ①Supervised ②Unsupervised ③Self-supervised"}},"ml03":{"chapter":"Chapter 03","title":"K-Nearest Neighbors (KNN): Birds of a Feather","description":"**Birds of a feather flock together** — KNN finds the **K nearest** stored examples and uses their labels (majority vote) to predict the new one. No fancy training; just **distance** and neighbors.","sectionTitle":"K-Nearest Neighbors (KNN): Birds of a Feather","whatIs":{"0":"**What is KNN?** — For a new data point, we pick the **K closest** points among labeled data and assign the **majority label**. Example: if 4 of the 5 nearest emails are spam, the new email is classified as spam.","1":"**'Closest' means distance in feature space** — Usually **Euclidean distance**: $d(\\mathbf{x}, \\mathbf y) = \\sqrt{\\sum_{i}(x_i - y_i)^2}$. With two features, this is the straight-line distance on the plane.","2":"**K is a hyperparameter** — K=1 uses only the single nearest neighbor; larger K smooths the decision but can blur boundaries. Odd K is often used to avoid ties."},"whyImportant":{"0":"**No explicit training (lazy learning)** — KNN does not learn a compact model; at prediction time it computes distances to all stored points. Training cost is low; prediction cost can be high.","1":"**Interpretable** — We can explain a prediction by showing the K neighbors (e.g. \"spam because 4 of 5 similar emails were spam\"), which supports explainable AI.","2":"**Useful as a baseline** — Before trying complex models, KNN gives a quick sense of how well the data can be classified."},"howUsed":{"0":"**Classification** — Majority vote among the K neighbors' labels. Used in image classification, spam detection, risk bands, etc.","1":"**Regression** — Predict the **average** of the K neighbors' target values (e.g. house price from nearby sales).","2":"**Distance and scale** — If features have different scales, distance is dominated by one feature. **Normalization** or **standardization** is recommended before computing distances."},"problemSolving":{"0":"**KNN** — Compute **distances** from the new point $\\mathbf{x}$ to stored points, pick the **K nearest** neighbors, then **majority vote** (classification) or **average** (regression). **Lazy learning**: no weights stored at train time. **Normalize or standardize** features when scales differ so distance is fair.","1":"$5e"},"problemSolvingTable":"**Algorithm steps**\n\n- **Input** — New feature vector $\\mathbf{x}$\n- **Stored data** — Pairs $(\\mathbf{x}_i, y_i)$\n- **Step 1** — Compute $d(\\mathbf{x}, \\mathbf{x}_i)$ for every $\\mathbf{x}_i$\n- **Step 2** — Take the K smallest distances\n- **Step 3 (classification)** — **Majority vote** among the K labels → $\\hat y$\n- **Step 3 (regression)** — **Average** of the K target values → $\\hat y$"},"ml04":{"chapter":"Chapter 04","title":"Linear Regression: A Line Through the Data","description":"When data points are scattered, **linear regression** finds the **line that best fits** their trend and predicts values for new inputs. It is the first regression model where you can see how **functions**, **derivatives**, and **partial derivatives** from Basic Math lead directly to machine learning 'training'.","sectionTitle":"Linear Regression: A Line Through the Data","whatIs":{"0":"**What is linear regression?** — We assume a **linear relationship** $y = w_1 x + w_0$ (or $y = \\mathbf{w}^\\top \\mathbf{x} + b$ for multiple variables) between input $x$ and output $y$, and find the **weights $w$ and intercept $b$** that best fit the data. The **function** $y = f(x)$ from Basic Math Ch01 is here a concrete **linear function**.","1":"**What does 'best fit' mean?** — We minimize the **error** between predictions $\\hat y_i = w x_i + b$ and actual values $y_i$. The function that measures this error is the **loss function**; **MSE (Mean Squared Error)**, covered in Ch04, is the most common.","2":"**Difference from KNN** — KNN predicted by the 'average of neighbors'; linear regression learns and stores **one formula (a line)**. At prediction time, we only compute $\\hat y = w x + b$ without searching for neighbors."},"whyImportant":{"0":"**First application of differentiation and optimization** — To minimize error, we use **differentiation** (Basic Math Ch06). Following the **gradient** of the loss with respect to $w$ and $b$ leads to the minimum. This is **gradient descent**, the same principle behind deep learning training.","1":"**Interpretability** — The learned $w$ tells us 'how much $y$ changes when $x$ increases by 1'. For example, with house area ($x$) and price ($y$), $w > 0$ means 'larger area, higher price'—matching intuition. This **interpretability** matters when trusting and improving models in practice.","2":"**Foundation for other models** — Logistic regression (Ch05), a single neuron in a neural network—all use 'linear transformation + nonlinear function'. Understanding linear regression clarifies how their **linear part** works."},"howUsed":{"0":"**Regression** — Used to predict **continuous numbers**: house prices, sales, temperature, scores. With multiple features, $y = w_1 x_1 + w_2 x_2 + \\cdots + w_n x_n + b$ becomes **multiple linear regression**.","1":"**Feature importance** — Features with larger $|w_i|$ have more influence on predictions. When doing feature engineering (Ch01), we use these values to decide which features to keep or drop.","2":"**Normal equation vs gradient descent** — With few features, the **normal equation** gives the optimal solution in one step. With many features or large data, **gradient descent** updates $w$ iteratively. **Partial derivatives and gradients** from Basic Math Ch08 are the key tools here."},"visual":"Visualization of line fitting during linear regression training.","problemSolving":{"0":"**Linear regression** — Model $\\hat y = wx + b$. **Training** minimizes loss (e.g. MSE) to update $w$, $b$. **Predict** by substitution. **Slope** from two points, **intercept** $b=y-wx$, **residual** $y-\\hat y$.","1":"**Example (key terms)**\n\n- **Line** — $\\hat y = wx + b$\n- **Predict** — $\\hat y = wx + b$ for given $x$\n- **Slope** — $(y_2-y_1)/(x_2-x_1)$\n- **Intercept** — $b = y - wx$\n- **Residual** — $y - \\hat y$\n\n---\n\n**Example (predict)**\n\nLine $\\hat y = 2x + 1$, $x=3$: $\\hat y$?\n\n$7$. → **Answer 7**\n\n---\n\n**Example (slope)**\n\nThrough (1,3) and (4,9), slope $w$?\n\n$2$. → **Answer 2**\n\n---\n\n**Example (intercept)**\n\nSlope 2, through (3,7): $b$?\n\n$1$. → **Answer 1**\n\n---\n\n**Example (two-point predict)**\n\nThrough (0,1) and (2,5), $\\hat y$ at $x=1$?\n\n$3$. → **Answer 3**\n\n---\n\n**Example (residual)**\n\nLine $\\hat y=2x+1$, point (3,8): residual?\n\n$1$. → **Answer 1**\n\n---\n\n**Example (sum of residuals)**\n\nPoints (0,2), (1,4), line $\\hat y=2x+1$: sum of residuals?\n\n$2$. → **Answer 2**"}},"ml05":{"chapter":"Chapter 05","title":"Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error","sectionTitle":"Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error","description":"A **loss function** turns how wrong the model is into **one number**. For **regression**, we often use **mean squared error (MSE)** from the gap between $\\hat y$ and $y$, and we also look at **$R^2$** (coefficient of determination) to understand how much variation the model explains. For **classification**, we measure how far predicted class probabilities are from the truth with **cross-entropy**. The diagram below shows **MSE as a regression example** of how loss decreases.","whatIs":{"0":"**Regression: MSE**\n\nWe need a **loss** that summarizes error in one number.\n\n- **Residual** — difference between actual $y$ and prediction $\\hat y$.\n- **SSE** — sum of $(y_i - \\hat y_i)^2$ over all points (sum of squared errors).\n- **MSE** — SSE divided by the number of points $n$ (mean squared error).\n\n$\\text{MSE} = \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2 = \\text{SSE}/n$. Smaller MSE means a better fit.","1":"**Why square?**\n\n- Residuals $+2$ and $-2$ both mean \"off by 2\"; raw sums can cancel.\n- **Squaring** keeps values positive and compares magnitude only.\n- Large errors get a bigger **penalty**, so the model avoids large mistakes.","2":"**Linear regression**\n\nThe line $\\hat y = w x + b$ from Ch.03 is \"best\" when **MSE** (or **SSE**) is minimized—we choose $w$ and $b$ that minimize average squared error.\n\n**Gradient descent** updates $w$ and $b$ step by step in the direction that lowers MSE.","3":"**Regression: MSE is the average of squared residuals**\n\nMSE is an error score made by squaring residuals $y_i-\\hat y_i$ and taking the average. As predictions get closer to the true values, residuals shrink and MSE becomes smaller.","4":"**Unpacking MSE**\n\n$\\text{MSE} = \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$\n\n- **$i$** — sample index.\n- **$y_i$** — actual value at that point.\n- **$\\hat y_i$** — predicted value.\n- **$y_i - \\hat y_i$** — **residual**.\n- **$(y_i - \\hat y_i)^2$** — **squared error** at that point.\n- **$\\sum_i$** — sum over points = **SSE**.\n- **$\\frac{1}{n}$** — average = **MSE**.\n\nCloser predictions → smaller residuals and smaller MSE.","5":"$5f","6":"$60"},"whyImportant":{"0":"**Learning direction** — In regression with MSE loss, the model updates in directions that **reduce MSE**—a clear objective.","1":"**MSE: smooth and easy to optimize** — Squared error is smooth and easy to differentiate, so gradient descent works well.","2":"**RMSE** — MSE uses squared units; $\\sqrt{\\text{MSE}}$ (**RMSE**) restores the same units as $y$ for interpretation.","3":"**Match loss to task** — Continuous targets fit **MSE**; class probabilities fit **cross-entropy**, which aligns with **maximum likelihood**. **Ch.05 logistic regression** connects sigmoid outputs $\\hat p$ to this loss."},"howUsed":{"0":"**Regression training** — Train with **MSE** for prices, temperatures, etc.","1":"**Model comparison (regression)** — Smaller **MSE** means better fit.","2":"**Deep learning regression** — Neural nets predicting numbers often use MSE at the output.","3":"**Classification** — Logistic regression, softmax classifiers, and neural classifiers typically minimize **cross-entropy**."},"visual":"...","problemSolving":{"0":"**Loss functions** — Regression uses **MSE**; classification uses **cross-entropy** for probabilities; **$R^2$** compares model error to the mean baseline. **RMSE** $=\\sqrt{\\text{MSE}}$ restores units.","1":"**Example (SSE)**\n\n$n=2$, residuals 3 and −3: sum of squared errors?\n\n$18$. → **Answer 18**\n\n---\n\n**Example (cross-entropy idea)**\n\nIf $y=1$ and $\\hat p=0.9$, loss behaves like? ① very large ② about $-\\log 0.9$ (small) ③ always 0\n\n②. → **Answer ②**\n\n---\n\n**Example ($R^2$)**\n\nIf SSE < SST, $R^2$ can be? ① always negative ② between 0 and 1 ③ always 1\n\n②. → **Answer ②**"}},"ml06":{"chapter":"Chapter 06","title":"Logistic Regression: Pass or Fail?","description":"Where linear regression predicts a 'score', **logistic regression** is the specialist for **yes/no** classification—e.g. \"Will this score mean **pass (1)** or **fail (0)**?\" It uses the **sigmoid function** to turn a score into a probability between 0 and 1.","sectionTitle":"Logistic Regression: Pass or Fail?","whatIs":{"0":"**The S-curve: sigmoid** — The score $z$ from a linear model can be large or negative. Probabilities must lie between 0 and 1. The **sigmoid** $\\sigma(z) = \\frac{1}{1+e^{-z}}$ maps any real $z$ into (0, 1).","1":"**Decision boundary** — When the sigmoid outputs e.g. \"probability of pass = 0.7\", we need a rule. Usually we use **0.5**: if probability ≥ 0.5 we predict **1 (yes)**, otherwise **0 (no)**.","2":"**Same core as linear regression** — Logistic regression still computes a score $z = wx + b$ first; the only difference is passing that score through the **sigmoid** to get a probability.","3":"**How to read $\\sigma(z) = \\frac{1}{1+e^{-z}}$** — When $z$ is large and negative, $e^{-z}$ is large so $\\sigma(z) \\approx 0$. When $z=0$, $\\sigma(0)=0.5$. When $z$ is large and positive, $e^{-z} \\approx 0$ so $\\sigma(z) \\approx 1$. So any $z$ is squeezed into a probability in [0, 1]."},"whyImportant":{"0":"**Many real problems are yes/no** — Spam or not? Disease or not? Will the user buy? **Binary classification** is everywhere; logistic regression is the standard baseline.","1":"**Confidence as a number** — Saying \"pass with 98% probability\" is more useful than just \"pass\". Logistic regression gives a **probability**, which supports better decisions.","2":"**Bridge to deep learning** — A single neuron in a neural network behaves much like logistic regression. Mastering this makes deep learning easier later."},"howUsed":{"0":"**Spam filter** — Compute \"probability this email is spam\" from features; if above a threshold, send to spam.","1":"**Medical AI** — From X-rays or lab values, predict \"probability of disease\" to support diagnosis.","2":"**Marketing and recommendations** — Predict \"will this user churn?\" or \"will they click?\" for targeting and ads."},"visual":"Visualization of sigmoid output and decision boundary.","problemSolving":{"0":"**Logistic regression** — Linear score $z$, then **sigmoid** $\\sigma(z)=1/(1+e^{-z})$ for probability. Default: predict 1 if $\\sigma(z)\\ge 0.5$; $z=0$ is the **decision boundary**.","1":"**Example (T/F)**\n\n\"When $z=0$, $\\sigma(z)=0.5$.\" Answer 1 if true, 0 if false.\n\nTrue. → **Answer 1**\n\n---\n\n**Example (decision)**\n\nIf $\\sigma(z)=0.7$, default class at threshold 0.5? ① 0 ② 1\n\n②. → **Answer ②**\n\n---\n\n**Example (sign)**\n\nIf $z>0$, usual $\\hat y$? ① 0 ② 1\n\n②. → **Answer ②**"}},"ml07":{"chapter":"Chapter 07","title":"Decision Tree: Twenty Questions to the Answer","description":"A decision tree works like the game of **Twenty Questions**: ask yes/no questions, follow branches, and reach a prediction at a leaf. It is easy to interpret (you can see exactly why it made each decision) and is the building block for random forests and other ensemble methods.","sectionTitle":"Decision Tree: Twenty Questions to the Answer","whatIs":{"0":"**Basic structure** — Picture an upside-down tree. At the top is the **root node** (first question). From there you ask a condition (e.g. “Is feature $x_1 \\le 3$?”); **yes** and **no** lead to **internal nodes**. When you can’t split further, you reach a **leaf node** and output the **prediction** (class or value).","1":"**Same as Twenty Questions** — Just like guessing an animal by asking “Does it have four legs?” → “Is it a herbivore?” → “Tiger!”, the tree narrows down the answer step by step. Each question splits the data into two groups.","2":"**Good questions: reducing impurity** — **Impurity** measures how mixed the classes are at a node. We want splits that make nodes purer. Two common formulas: **Gini** $G = 1 - \\sum p_i^2$ and **Entropy** $H = -\\sum p_i \\log_2 p_i$. When one class has 100% ($p=1$), both are 0 (pure). When classes are half-and-half, impurity is high.","3":"**Information gain** — **Information gain** = impurity before the split minus (weighted) impurity after. It measures how much a question “cleans up” the data. The tree chooses the question with the highest information gain at each step.","4":"**Prediction at the leaf** — At a **leaf**, we output: for **classification**, the **majority class** of the samples there; for **regression**, the **average** of their target values. For new data, we just follow the path and read off the leaf’s prediction.","5":"**Pruning** — A tree that is too deep **overfits** (memorizes the training set). **Pruning** cuts branches to limit depth and improve generalization. These pruned trees are the base models used in **random forest** and other ensembles."},"whyImportant":{"0":"**Explainable AI** — Unlike many black-box models, a decision tree shows the exact path of questions that led to each prediction (e.g. “age < 30 and income ≥ 30M → approve loan”). This is valued in finance and healthcare.","1":"**Nonlinear boundaries** — Linear models cut the space with a single line; a tree can approximate **step-like** boundaries by repeated splits, capturing more complex patterns.","2":"**Foundation for ensembles** — A single tree can be unstable, but hundreds of trees (e.g. **random forest**) form a strong, robust model. Ch06 is the basis for Ch07 Ensemble."},"howUsed":{"0":"**Credit and loans** — Questions like “Income ≥ 50M?” and “Any default in the last year?” form a path to approve or deny.","1":"**Medical decision support** — Patient data (blood pressure, cholesterol, etc.) is used in a sequence of questions to predict disease risk and support diagnosis.","2":"**Marketing (churn, purchase)** — “Registered > 6 months?”, “Logins in the last month ≤ 3?” help find at-risk customers for targeted campaigns."},"problemSolving":{"0":"**Decision trees** — Follow questions from the **root** to a **leaf**; classification leaves use **majority vote**, regression leaves use **mean**. **Gini** $G=1-\\sum p_i^2$, **entropy** $H=-\\sum p_i\\log_2 p_i$ measure impurity.","1":"**Example (path)**\n\nFollow bits from the root to a leaf and read that leaf’s prediction.\n\n---\n\n**Example (Gini)**\n\nIf one class is 100% ($p=1$), Gini $G=1-\\sum p_i^2$ equals?\n\n$0$. → **Answer 0**\n\n---\n\n**Example (leaf majority)**\n\nLeaf has two class-0 and five class-1 points: predicted class?\n\n$1$. → **Answer 1**"},"visual":"Visualization of decision-tree branching and prediction path."},"ml08":{"chapter":"Chapter 08","title":"XGBoost, LightGBM, CatBoost","description":"Compare the boosting trio and learn practical model-selection criteria.","sectionTitle":"XGBoost, LightGBM, CatBoost","whatIs":{"0":"**XGBoost** is a strong default booster that emphasizes accuracy with regularization and second-order (Hessian) information.","1":"**LightGBM** is optimized for speed on large data with leaf-wise growth and histogram-based split finding.","2":"**CatBoost** handles categorical features robustly with ordered encoding, reducing preprocessing burden."},"whyImportant":{"0":"All three are gradient boosting trees, but their trade-offs differ in **speed, stability, and categorical handling**.","1":"In practice, picking the model that matches **data size, feature types, and time budget** matters more than chasing one universal winner."},"howUsed":{"0":"For tabular classification/regression, teams often start with XGBoost, try LightGBM for large-scale speed, and prioritize CatBoost when categorical columns are heavy.","1":"Final selection is based on validation score, training time, and overfitting behavior together."},"problemSolving":{"0":"Solve **model-selection** items by matching data traits to each algorithm, solve **true/false** by checking core properties, and solve **basic numeric** items by reading given rounds/trees directly.","1":"**Example 1 (model selection)**\n\nIf categorical features are dominant and you want to avoid heavy one-hot encoding, which model is preferred first? ① XGBoost ② LightGBM ③ CatBoost\n\nCatBoost is usually the first choice here. → **Answer ③**\n\n---\n\n**Example 2 (model selection)**\n\nFor very large tabular data where training speed is critical, which model is often tried first? ① XGBoost ② LightGBM ③ CatBoost\n\nLightGBM is commonly prioritized for speed. → **Answer ②**\n\n---\n\n**Example 3 (true/false)**\n\n\"XGBoost is a boosting method that uses regularization and second-order information.\" Enter 1 if true, 0 if false.\n\nThis statement is true. → **Answer 1**"},"visual":"A comparison view for choosing among XGBoost, LightGBM, and CatBoost by accuracy, speed, and categorical-feature handling."},"ml09":{"chapter":"Chapter 09","title":"Ensemble and Random Forest: The Wisdom of the Crowd","description":"Ensemble methods combine predictions from multiple models to produce a single, often better prediction. This chapter explains bagging, boosting, stacking, and random forest—where many decision trees vote or average—so beginners can follow the idea of collective intelligence.","sectionTitle":"Ensemble and Random Forest: The Wisdom of the Crowd","whatIs":{"0":"**The core idea of ensemble: many hands make light work** — An ensemble builds a **team of multiple models** and combines their predictions to reach a final answer. Like a jury voting on a verdict, using many models instead of one sharply reduces the chance of wrong answers (variance) and makes predictions **more stable**. For classification we use **majority vote**; for regression we use the **average** of predictions.","1":"**Why are many better than one? (Wisdom of the crowd)** — If you ask 100 people to guess a cow's weight, individual guesses may be off, but the **average** of 100 guesses is often surprisingly close to the true weight. When models **independently** judge and we combine results, their random errors tend to cancel out and the **shared signal** remains.","2":"**Three main ensemble methods: Bagging, Boosting, Stacking** — (1) **Bagging**: Each model gets a different random subset of data (like different practice tests); then they vote. (2) **Boosting**: The next model focuses on what the previous one got wrong, learning **sequentially** from mistakes. (3) **Stacking**: A meta-model takes the reports of base models and makes the final decision.","3":"**Random Forest: a forest of diverse trees** — Bagging with **decision trees**: grow hundreds of trees. To keep them diverse, each tree is trained on a **random subset of features** at each split. Some trees rely on \"age\", others on \"income\", maximizing **diversity**.","4":"**Voting and averaging in a formula** — For classification, majority vote means \"the class that most trees chose\". For regression (e.g. house price), average all tree predictions: **$\\hat y = \\frac{1}{B}\\sum_{b=1}^B \\hat y_b$** where $B$ is the number of trees and $\\hat y_b$ is the $b$-th tree's prediction. (e.g. three trees predict 100, 150, 200 → final prediction 150)","5":"**OOB (Out-of-Bag) evaluation** — In bagging/random forest, each tree is trained on a random sample of the data. The **left-out samples (Out-of-Bag)** can be used to evaluate those trees that did not see them—like a built-in validation set without holding out separate test data."},"whyImportant":{"0":"**A stable forest that doesn't sway** — A single decision tree can change a lot when data changes slightly. A **forest** of hundreds of trees stays stable; a few wrong trees don't change the overall vote. This leads to strong, reliable performance in practice.","1":"**Natural extension of Ch06 Decision Tree** — The same tree structure (impurity, information gain) is reused. You're not learning new rules—just how to **combine** many trees with voting, so the previous chapter's knowledge is fully used.","2":"**The go-to model in industry and competitions** — Random forest often works very well with little tuning, so it's many practitioners' first choice. It also provides **feature importance**, which helps explain which variables matter most."},"howUsed":{"0":"**General-purpose for business (classification and regression)** — From \"Is this email spam?\" to \"What will tomorrow's stock price be?\", ensembles are used across almost every business problem.","1":"**Finding what matters (feature importance)** — If trees in a loan model rely most on \"income\", that variable is the most important for the decision. This helps filter out unnecessary data.","2":"**Wide real-world use** — Fraud detection, recommendation systems (e.g. Netflix, YouTube), equipment failure prediction—wherever accuracy and stability matter."},"problemSolving":{"0":"**Ensembles / Random Forest** — Classification: **majority vote** across trees. Regression: **average** $\\hat y = \\frac{1}{B}\\sum_b \\hat y_b$. **OOB** samples estimate error without a separate validation set.","1":"**Example (majority vote)**\n\n3 votes for class 0, 5 for class 1: final class? ① 0 ② 1\n\n②. → **Answer ②**\n\n---\n\n**Example (regression mean)**\n\nThree trees predict 6, 9, 12: mean?\n\n$9$. → **Answer 9**\n\n---\n\n**Example (OOB)**\n\n10 trees total; a point appears in 6 bootstraps: OOB tree count?\n\n$4$. → **Answer 4**\n\n---\n\n**Example (formula mean)**\n\n$B=4$, sum of predictions = 20: mean?\n\n$5$. → **Answer 5**"},"visual":"Visualization of ensemble voting/averaging for final prediction."},"ml10":{"chapter":"Chapter 10","title":"K-Means Clustering: Grouping Without Labels","description":"K-Means is a classic **unsupervised learning** algorithm that groups data into K clusters using **distance**—no labels. You will see how the 'unsupervised' idea from Ch01 works in practice: concept → intuition → math → application. It reuses the distance formula from Ch02 (KNN) and shows how repeating 'assign to nearest center' and 'update centers' yields clear clusters.","sectionTitle":"K-Means Clustering: Grouping Without Labels","whatIs":{"0":"**What is K-Means?** — With no labels $y$, only data $\\mathbf{x}_1, \\mathbf{x}_2, \\ldots$, K-Means partitions points into **K groups** by **nearest centroid**. Distance is **Euclidean** $d(\\mathbf{x}, \\boldsymbol{\\mu}) = \\sqrt{\\sum_j (x_j - \\mu_j)^2}$ (as in Ch02). Each group has one **centroid** $\\boldsymbol{\\mu}_k$. The algorithm alternates: assign each point to the nearest center → set each center to the mean of its assigned points, until convergence.","1":"**K is the number of clusters** — The user chooses **K** (e.g. K=2 → two groups). There are no 'correct' labels, only a partition. In practice, K is chosen by domain knowledge, the elbow method, or silhouette scores.","2":"**Objective: minimize SSE (distortion)** — K-Means minimizes $J = \\sum_{k=1}^K \\sum_{i \\in C_k} \\|\\mathbf{x}_i - \\boldsymbol{\\mu}_k\\|^2$. The update $\\boldsymbol{\\mu}_k = \\frac{1}{|C_k|}\\sum_{i \\in C_k} \\mathbf{x}_i$ (mean of assigned points) reduces each cluster's SSE.","3":"**If the formulas feel heavy** — The distance formula is just 'length between a point and a center.' SSE $J$ is a single number for 'how tightly points sit around their center'; the algorithm moves centers to make $J$ smaller. The centroid update is literally 'average of the coordinates of points in that cluster.' The **Formula guide** below spells out each symbol step by step."},"whyImportant":{"0":"**Ch01 unsupervised learning in action** — K-Means is the go-to when you have no labels and want structure (e.g. customer segmentation, clustering documents or images, preprocessing for anomaly detection).","1":"**Customer segmentation** — With only purchase history and no segment labels, K-Means groups similar customers; people then attach meaning (e.g. VIP, churn risk) to each cluster and use it for downstream tasks (Ch09, Ch12).","2":"**Simple and interpretable** — Assign (nearest center) and update (mean) are easy to implement and visualize in 2D."},"howUsed":{"0":"**Clustering** — Customer segmentation, topic/document grouping, image color compression, gene expression groups.","1":"**Preprocessing** — Use cluster index as a new feature for supervised models, or keep only centroids to reduce data size.","2":"**Choosing K** — The user sets K; compare SSE or silhouette across K to pick a value (e.g. elbow)."},"problemSolving":{"0":"**K-Means** — With no labels, place **K centroids**, **assign** each point to the nearest center, **update** centers as cluster means, repeat. Minimize SSE $J = \\sum_{k}\\sum_{i \\in C_k} \\|\\mathbf{x}_i - \\boldsymbol{\\mu}_k\\|^2$; update $\\boldsymbol{\\mu}_k = \\frac{1}{|C_k|}\\sum_{i \\in C_k} \\mathbf{x}_i$.","1":"**Example (key terms)**\n\n- **Distance²** — $(x_2-x_1)^2+(y_2-y_1)^2$; compare without sqrt if needed\n- **Assign** — Smallest distance → cluster index\n- **Center update** — Mean of coordinates in the cluster\n- **SSE** — Sum of squared distances to centers; smaller is tighter\n\n---\n\n**Example (assign)**\n\nCenters $(0,0)$ and $(4,0)$; point $(2,0)$. Tie → cluster 1. → **Answer 1**\n\n---\n\n**Example (center update)**\n\nPoints $(1,2)$ and $(3,4)$ only: new $\\bar{x}$?\n\n$2$. → **Answer 2**\n\n---\n\n**Example (distance²)**\n\nPoint $(1,2)$, center $(4,6)$: squared distance?\n\n$25$. → **Answer 25**\n\n---\n\n**Example (SSE idea)**\n\nAs $J$ gets smaller, clusters are? ① more spread ② tighter\n\n②. → **Answer ②**"},"visual":"Visualization of iterative assignment and centroid updates in K-Means."},"ml11":{"chapter":"Chapter 11","title":"Cross Validation: Practice Tests and the Real Exam","description":"Cross validation is essential so that models do not become \"frogs in a well\"—only good at the exercises they memorized. Just as students use **practice tests** to check their real level and the **final exam** to confirm it, we do not score machine learning models only on **training data**; we evaluate them on **validation** and **test** data they have not seen. This chapter covers **cross validation** (Hold-out, K-Fold, etc.) and how to make performance estimates reliable.","sectionTitle":"Cross Validation: Practice Tests and the Real Exam","whatIs":{"0":"**What is cross validation? \"Don’t score with the same problems they practiced\"** — If a math exam contained only problems from the workbook, we could not tell whether students understood the ideas or had **overfit** by memorizing answers. The same holds for ML: testing on training data always looks good. So we split data into **train**, **validation**, and **test**, and evaluate the model strictly and fairly on data it has never seen. That process is cross validation.","1":"**Three roles when splitting data** — The ideal split and role of each part are as follows.\n\n- **Training (Train)** — Metaphor: textbook / practice set. Main data used to learn patterns and update weights. Typical ratio: ~70–80%.\n- **Validation** — Metaphor: practice exam. Used mid-learning to check performance and tune hyperparameters. Typical ratio: ~10–15%.\n- **Test** — Metaphor: final exam. Used **only once** after all learning to report final performance. Typical ratio: ~10–15%.","2":"**How to split? Hold-out and K-Fold** — There are two main approaches. **Hold-out** is like cutting a pizza once: you split the data once into train and test. It is simple and fast, but if by chance the \"easy\" part ends up in the test set, the estimate can be overly optimistic. **K-Fold cross validation** divides data into K segments and uses each in turn as the \"practice exam\" (validation) and the rest for training, so every sample is validated once and the estimate is more stable and objective.","3":"**K-Fold final score in a formula** — After K-Fold you have K \"exam\" scores. The model’s final performance is the average of these K scores.\n\n* **Mean score formula:** $\\bar{S} = \\frac{1}{K}\\sum_{k=1}^K S_k$\n\n* **Symbols:** $K$ = number of folds (number of validation runs), $S_k$ = score when the $k$-th fold was used for validation (e.g. accuracy or MSE). $\\sum_{k=1}^K S_k$ means $S_1 + S_2 + \\cdots + S_K$, so $\\bar{S}$ is the **mean of the K validation scores** and is used as the final performance estimate.\n\n* **Numeric example:** With 5-Fold, if the five scores are 80, 85, 90, 80, 85, then $\\bar{S} = (80+85+90+80+85)/5 = 84$."},"whyImportant":{"0":"**Escaping the \"frog in a well\" (detecting overfitting)** — If the model scores 99 on training data but 50 on unseen validation data, it is almost certainly **overfitting** (memorizing rather than understanding). Cross validation acts as a filter to catch such models before they fail in production.","1":"**Proving real-world performance (generalization)** — Companies adopt AI to predict the future, not to replay the past. Models validated with K-Fold and a held-out test set are more likely to perform well on truly new data.","2":"**Finding the best setup (hyperparameters and model choice)** — When choosing tree depth, K in K-NN, learning rate, etc., we run multiple settings on the validation set and pick the best. Because the test set is kept separate, we can compare models fairly."},"howUsed":{"0":"**Data scientist routine (production pipeline)** — In practice, the first step is to set aside about 10% of the data as the **test set** and lock it away. The rest is used for training and K-Fold validation until the best model is ready; then the test set is used once to report: \"Our model’s final accuracy is 92%.\"","1":"**Fair algorithm comparison** — When asking \"Is logistic regression or random forest better for our churn prediction?\", the same K-Fold setup is applied to both; the algorithm with the higher mean validation score ($\\bar{S}$) is chosen for deployment."},"problemSolving":{"0":"**Summary** — Cross validation starts from the premise that we must not measure performance only on the data used for training. Just as students take practice tests before the real exam, in machine learning we cannot tell if the model has \"memorized the exercises\" if we score only on **training data**. So we split data into **train**, **validation**, and **test**. The **training** set is used for the model to learn patterns; the **validation** set is used to check performance during learning or to choose hyperparameters; the **test** set is used **only once** after all learning to report final performance before deployment. The main split strategies are **Hold-out** and **K-Fold**. Hold-out splits the data once into train and test (or validation). K-Fold divides data into K segments, uses one segment at a time for validation and the rest for training. With K-Fold every sample is used for validation once, so the performance estimate is more stable than with a single split.","1":"**Example (terms & formulas)**\n\n- **Train size** — $n \\times (\\text{ratio}/100)$ etc.\n- **Test size** — $n - \\text{train}$\n- **Fold size** — $\\lfloor n/K \\rfloor$\n- **K-Fold mean** — $(S_1+\\cdots+S_K)/K$\n- **Stratified** — Class ratio per fold\n\n---\n\n**Example (T/F)**\n\n\"You may reuse the test set many times.\" 1 true, 0 false.\n\nUsually test once for final report. → **Answer 0**\n\n---\n\n**Example (Hold-out train)**\n\n100 samples, 80% train → train count?\n\n$80$. → **Answer 80**\n\n---\n\n**Example (Hold-out test)**\n\nSame setup → test count?\n\n$20$. → **Answer 20**\n\n---\n\n**Example (K-Fold size)**\n\n100 samples, 5-Fold → one fold size?\n\n$20$. → **Answer 20**\n\n---\n\n**Example (K-Fold mean)**\n\nScores 80,80,90,80,90 → mean?\n\n$84$. → **Answer 84**\n\n---\n\n**Example (Stratified)**\n\nStratified K-Fold keeps class ratio per fold? ① yes ② no only random\n\n①. → **Answer 1**"},"visual":"Visualization of data splitting and K-Fold evaluation flow."},"ml12":{"chapter":"Chapter 12","title":"Classification Metrics: The Model's Detailed Report Card","description":"Learn the **'detailed report card'** that a classification AI model receives after its test. Beyond \"how many did you get right?\" (accuracy), we look at **confusion matrix** concepts that ask \"which questions did you get wrong, and how?\" In business settings where *how* the model is wrong can be critical—spam filters, cancer diagnosis AI—we explain how **precision, recall, and F1** prove the model's real capability, with intuitive analogies.","sectionTitle":"Classification metrics: confusion matrix and the model's report card","whatIs":{"0":"**What is the confusion matrix? The AI's detailed report card** — Just as knowing only \"how many correct\" on an exam doesn't tell you whether a student is good at math or English, we need more for a classifier. The **confusion matrix** is a 2×2 table that compares the model's **predictions (columns)** with **actual answers (rows)**. By reading the four cells, you can see what the model gets right and where it gets confused and stumbles.","1":"**The four cells: TP, TN, FP, FN** — Think of the famous \"boy who cried wolf.\" Here 'positive' means the boy cries wolf; 'negative' means peace.\n* **TP (True Positive):** Wolf really came (1), boy cried wolf (1). Best outcome—village saved.\n* **TN (True Negative):** No wolf (0), boy stayed quiet (0). Peace.\n* **FP (False Positive):** No wolf (0), boy cried wolf (1). Villagers run out with pitchforks for nothing (false alarm).\n* **FN (False Negative):** Wolf came (1), boy was asleep (0). Sheep get eaten—worst outcome (miss).\n* Total count $n = \\mathrm{TP} + \\mathrm{TN} + \\mathrm{FP} + \\mathrm{FN}$.","2":"**Accuracy's dangerous trap** — It is the fraction of correct answers: $\\text{Accuracy} = \\frac{\\mathrm{TP}+\\mathrm{TN}}{n}$. Intuitive but treacherous. Suppose 99 out of 100 days are peaceful and the wolf comes only once. A robot that closes its eyes and always says \"No wolf!\" still gets 99% accuracy. When positive cases are rare (imbalanced data), you must not trust accuracy alone.","3":"**Precision and recall: two rabbits to chase** —\n* **Precision (caution):** \"When I cried wolf, how often was it really the wolf?\" The share of **predicted positives** that are **truly positive**. $\\text{Precision} = \\frac{\\mathrm{TP}}{\\mathrm{TP}+\\mathrm{FP}}$. It goes up when you avoid false alarms (FP).\n* **Recall (sensitivity):** \"Of all the times the wolf actually came, how often did I notice and warn?\" The share of **actual positives** that the model **got right**. $\\text{Recall} = \\frac{\\mathrm{TP}}{\\mathrm{TP}+\\mathrm{FN}}$. It goes up when you miss fewer true wolves (FN).","4":"**F1 score: the golden balance of precision and recall** — Precision and recall are like a seesaw: pushing one up often pushes the other down. **F1** summarizes both in one number using the **harmonic mean**: $\\text{F1} = \\frac{2 \\cdot \\mathrm{TP}}{2\\cdot\\mathrm{TP}+\\mathrm{FP}+\\mathrm{FN}}$. If either precision or recall is poor, F1 tanks. Use F1 when you want a model with good balance.","5":"**AUC (Area Under the ROC Curve): the model's ranker** — When the model outputs a probability (e.g. \"90% chance of wolf\") rather than a bare yes/no, **AUC** measures how well **true positives** get higher scores than **true negatives** (discriminative power), on a 0–1 scale. 1 = perfect ranking; 0.5 = coin flip. Very useful to compare models before choosing a threshold."},"whyImportant":{"0":"**Don't fall for 99% accuracy** — Imagine a credit-card fraud detector: 1 fraudulent transaction in 100,000. A model that does nothing and always says \"all normal\" still has 99.999% accuracy—but 0% recall (catches no fraud). You must open the **confusion matrix** and inspect **precision** and **recall** to see if the model is doing its job or gaming the numbers.","1":"**In practice, it's a fierce trade-off: which mistake can you live with?** — The metric you bet on depends on the business.\n* **Recall (don't miss) is life:** Cancer screening. Better to have healthy people get extra tests (FP) than to miss a real case (FN) and delay treatment.\n* **Precision (fewer false alarms) is life:** Spam filter. Missing a few spams (FN) is fine—delete and move on. Misclassifying the boss's email as spam (FP) can be career-threatening."},"howUsed":{"0":"**Final pass/fail for AI services (binary classification)** — COVID-19 positive/negative, YouTube harmful-video block/allow, bank loan approve/reject: before deployment, real-world projects draw the confusion matrix and review precision, recall, and F1.","1":"**Tuning alarm sensitivity (threshold tuning)** — Models usually output a probability. \"At what % do we sound the alarm?\" Adjusting this threshold tailors the model to the business: e.g. lower threshold for maximum recall (security-critical), higher for maximum precision (when too many false alarms annoy users)."},"problemSolving":{"0":"**Confusion matrix & metrics** — Count **TP/TN/FP/FN**; $n=\\mathrm{TP}+\\mathrm{TN}+\\mathrm{FP}+\\mathrm{FN}$. **Accuracy** $(\\mathrm{TP}+\\mathrm{TN})/n$, **precision** $\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FP})$, **recall** $\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FN})$, **F1** harmonic mean. On imbalance, use precision/recall, not accuracy alone.","1":"**Example (metrics)**\n\n- **Accuracy (%)** — $100(\\mathrm{TP}+\\mathrm{TN})/n$\n- **Precision (%)** — $100\\,\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FP})$\n- **Recall (%)** — $100\\,\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FN})$\n- **F1 (%)** — $100\\cdot 2\\mathrm{TP}/(2\\mathrm{TP}+\\mathrm{FP}+\\mathrm{FN})$\n\n---\n\n**Example (accuracy)**\n\nTP=10, TN=70, FP=10, FN=10 → accuracy (%)?\n\n$80$. → **Answer 80**\n\n---\n\n**Example (precision)**\n\nTP=10, FP=10 → precision (%)?\n\n$50$. → **Answer 50**\n\n---\n\n**Example (recall)**\n\nTP=10, FN=10 → recall (%)?\n\n$50$. → **Answer 50**\n\n---\n\n**Example (F1)**\n\nTP=10, FP=10, FN=10 → F1 (%)?\n\n$50$. → **Answer 50**"},"visual":"Visualization of confusion matrix and metric calculations."},"ml13":{"chapter":"Chapter 13","title":"Regularization: Beyond Rote Memorization","description":"It is the key technique that keeps ML models from becoming **'rote memorizers'** that only memorize answers from the workbook. Fitting the training data too tightly means the model flounders when faced with slightly different new problems—this is **overfitting**. **Regularization** reduces the model's data error while imposing a **penalty (cost)** so the model does not become overly complex or forced. In this way, the model prunes the twigs and learns only the essential patterns, becoming strong in real-world **generalization**.","sectionTitle":"Regularization: Beyond Rote Memorization","whatIs":{"0":"**What is regularization? A 'penalty' for complexity**\n\nWhen a model tries to fit every small noise or exception in the training data, its formula becomes wiggly and needlessly complex. Regularization computes the model's **total loss** not only by \"how wrong the predictions are\" but also by **\"how complex the model is (size of weights)\"** and adds a penalty. To avoid that penalty, the model naturally stays simpler and cleaner.","1":"**Intuitive analogy: crammer vs principle-seeking student**\n\nA crammer who memorizes the workbook digit by digit gets 100 on practice tests but fails the real exam (new data). A student who understands principles may get a few practice problems wrong but scores steadily on the real exam. Regularization acts like a teacher, forcing the model to **\"prune the twigs (excessive weights) and focus on the main stem (core pattern)\"** so it becomes robust in practice.","2":"**Math: two 'magic' formulas (L1 and L2)**\n\nRegularization is divided into two main types by how it penalizes the model.\n\n- **L2 (Ridge)**: Uses the **square** of weights as the penalty. The objective is $J = \\text{MSE} + \\lambda \\sum_{j} w_j^2$. It smoothly pushes all weights down so they do not grow too large.\n- **L1 (Lasso)**: Uses the **absolute value** of weights as the penalty. The objective is $J = \\text{MSE} + \\lambda \\sum_{j} |w_j|$. It can drive less important weights **exactly to zero**, leaving only the key features (sparsity).","3":"**Real-world examples: spam filtering and medical diagnosis**\n\nIn spam filtering, giving high weight to a common word that happened to appear in training spam (e.g. \"hello\") can wrongly filter normal mail. Regularization prevents the model from obsessing over a single word (exploding weights). In medical diagnosis, it helps the AI avoid latching onto meaningless details like \"gown color\" among many patient features.","4":"**Reading the formulas: a beginner's dissection**\n\n- **Total loss (L2 example)**: $J = \\text{MSE} + \\lambda \\sum_{j} w_j^2$\n - **$J$**: The **\"final report card\"** we want to make as small as possible (minimize). The smaller, the better the model.\n - **$\\text{MSE}$**: The **\"error score\"** showing how much predictions differ from the true answers.\n - **$\\lambda$ (lambda)**: The **\"strength of the penalty\"** we set by hand. Larger $\\lambda$ acts like a strict teacher and heavily penalizes complex models; smaller $\\lambda$ barely penalizes.\n - **$\\sum_{j} w_j^2$ (L2 penalty)**: Sum of **squares** of all weights. If any weight grows, this sum grows and $J$ increases, so the model tries to keep weights small.\n\n- **L1 penalty ($\\lambda \\sum_{j} |w_j|$)**\n - Where L2 uses squares, L1 uses **absolute values ($|w_j|$)**. L1 is like a strict tidier: it mercilessly zeros out useless weights."},"whyImportant":{"0":"**Because real-world (generalization) performance is the true goal**\n\nThe real value of ML shows not during practice but when the model meets **unseen (test) data**. With regularization, accuracy on the training set may drop a bit, but accuracy in the wild goes up. This ability to handle unknown data well is called **generalization**.","1":"**The art of balance: bias–variance tradeoff**\n\nIf the model is too simple, **bias (underfitting)** grows and it cannot solve the problem. If it is too complex, **variance (overfitting)** grows and it memorizes noise. The two are like a seesaw: when one goes down, the other goes up. Tuning the regularization strength $\\lambda$ is the process of finding the **level (sweet spot)** of that seesaw.","2":"**The human role: finding $\\lambda$ (the hyperparameter)**\n\n$\\lambda$ is not learned by the model; it is a **dial (hyperparameter)** we must set. Turn the dial too hard and the model becomes underpowered; too soft and it becomes a memorizer again. So we must try many $\\lambda$ values and choose the one that gives the best real-world performance."},"howUsed":{"0":"**Adding wings to basic models (Ridge & Lasso)**\n\nWe simply add the L1 or L2 penalty to the usual **linear regression** or **logistic regression** formula.\n\n- Linear regression + L2 = **Ridge regression**\n- Linear regression + L1 = **Lasso regression**\n\nThe computer then minimizes the total loss (including the penalty) via gradient descent and adjusts the weights automatically.","1":"**A 3-step pipeline in practice**\n\nIn practice, regularization is applied as follows.\n\n**1. Split the data**: Divide data into [train / validation / test].\n\n**2. Run a $\\lambda$ audition**: Try $\\lambda$ values such as 0.01, 0.1, 1, 10, and train multiple models on the training set.\n\n**3. Pick the winner and deploy**: Test on the validation set and choose the $\\lambda$ with the best score as the final model. Then evaluate **once** on the test set for the final performance."},"problemSolving":{"0":"**Regularization** — Add **data loss** + **λ×penalty** to shrink weights and reduce **overfitting**. **L2 (Ridge)** uses $\\sum w_j^2$; **L1 (Lasso)** uses $\\sum|w_j|$ for sparsity. λ is a **hyperparameter**.","1":"**Example (formulas)**\n\n- **L2** — $w=(2,3,1)$ → $14$\n- **Total loss** — MSE=20, λ=2, pen=5 → $J=30$\n- **L1** — $w=(2,-3,1)$ → $6$\n\n---\n\n**Example (definition)**\n\nMain goal of regularization? ① reduce overfitting ② speed only\n\n①. → **Answer 1**\n\n---\n\n**Example (T/F)**\n\n\"Regularization only minimizes training error.\" 1 true, 0 false.\n\n0 — penalty term matters. → **Answer 0**\n\n---\n\n**Example (λ)**\n\nIn $J=\\text{MSE}+\\lambda\\cdot(\\text{penalty})$, λ is? ① strength ② learning rate\n\n①. → **Answer 1**\n\n---\n\n**Example (L2)**\n\n$w=(2,3,1)$: $\\sum w_j^2$?\n\n$14$. → **Answer 14**\n\n---\n\n**Example (total loss)**\n\nMSE=20, λ=2, L2 penalty=5 → $J$?\n\n$30$. → **Answer 30**\n\n---\n\n**Example (L1)**\n\n$w=(2,-3,1)$: $\\sum|w_j|$?\n\n$6$. → **Answer 6**\n\n---\n\n**Example (L1 vs L2)**\n\nWhich tends to zero exact weights? ① L1 ② L2\n\n①. → **Answer 1**"},"visual":"Visualization of reducing overfitting with regularization.","problems":{"definition_0":"The main purpose of regularization is? ① Reduce overfitting ② Speed up training ③ Data augmentation","definition_1":"Adding a penalty on weights to keep the model simple is? ① Regularization ② Normalization ③ Ensemble","definition_2":"Adding λ·(penalty) to the loss to reduce overfitting is? ① Regularization ② Gradient descent ③ K-Fold","definition_3":"In L2 regularization the penalty term is? ① $\\sum w_j$ ② $\\sum w_j^2$ ③ $\\sum |w_j|$","definition_4":"In L1 regularization the penalty term is? ① $\\sum w_j$ ② $\\sum w_j^2$ ③ $\\sum |w_j|$","definition_5":"As λ increases, the model becomes? ① More complex ② Simpler ③ Unchanged","definition_6":"Which regularization makes some weights exactly zero (sparse)? ① L1 ② L2 ③ Both","definition_7":"Which keeps weights small but rarely exactly zero? ① L1 ② L2 ③ Both","definition_8":"Ridge regression uses which regularization? ① L1 ② L2 ③ None","definition_9":"Lasso regression uses which regularization? ① L1 ② L2 ③ None","definition_10":"Elastic Net uses which regularization? ① L1 only ② L2 only ③ L1 and L2","trueFalse_0":"With regularization, training error may increase but generalization can improve. 1 if true, 0 if false.","trueFalse_1":"If λ=0 there is no regularization; large λ increases penalty and shrinks weights. 1 if true, 0 if false.","trueFalse_2":"L2 penalty is the sum of absolute values of weights. 1 if true, 0 if false.","trueFalse_3":"L1 tends to set some weights exactly to zero. 1 if true, 0 if false.","trueFalse_4":"λ is usually chosen by cross-validation. 1 if true, 0 if false.","trueFalse_5":"When overfitting, increasing λ can help. 1 if true, 0 if false.","trueFalse_6":"Minimizing only training loss always gives good validation performance. 1 if true, 0 if false.","trueFalse_7":"Total loss = data loss + λ×penalty is the basic form of regularization. 1 if true, 0 if false.","trueFalse_8":"With L2, more weights become zero than with L1. 1 if true, 0 if false.","choice_0":"In J = MSE + λ·(penalty), λ is? ① Regularization strength ② Learning rate ③ Batch size","choice_1":"If L2 penalty $\\sum w_j^2$ is large, the model is? ① More complex ② Large weights ③ Penalty is large; weights are shrunk by training","choice_2":"Ridge and Lasso both? ① Use only L1 ② Penalize weights ③ Classification only","choice_3":"Without regularization (λ=0), we often get? ① Underfitting ② Overfitting ③ No learning","choice_4":"To choose λ we compare? ① Training loss only ② Validation (or CV) performance ③ Test repeatedly","choice_5":"When λ=0 in $\\lambda \\sum w_j^2$? ① No regularization ② Maximum regularization ③ Same as L1","l2Penalty_0":"Weights $w_1=1$, $w_2=2$, $w_3=2$. L2 penalty $\\sum_j w_j^2$ (integer)?","l2Penalty_1":"Weights $w_1=0$, $w_2=3$, $w_3=4$. L2 penalty $\\sum_j w_j^2$ (integer)?","l2Penalty_2":"Weights $w_1=2$, $w_2=2$. L2 penalty $w_1^2+w_2^2$ (integer)?","l2Penalty_3":"Weights $w_1=1$, $w_2=1$, $w_3=1$, $w_4=1$. L2 penalty $\\sum_j w_j^2$ (integer)?","l2Penalty_4":"Weights $w_1=3$, $w_2=4$. L2 penalty (integer)?","totalLoss_0":"MSE=10, λ=1, L2 penalty=6. Total loss J=MSE+λ·(penalty) (integer)?","totalLoss_1":"MSE=16, λ=2, L2 penalty=5. J (integer)?","totalLoss_2":"MSE=8, λ=4, penalty=2. J (integer)?","totalLoss_3":"MSE=12, λ=3, penalty=4. J=MSE+λ·penalty (integer)?","totalLoss_4":"MSE=20, λ=2, penalty=10. J (integer)?","l1Penalty_0":"Weights $w_1=2$, $w_2=-3$, $w_3=1$. L1 penalty $\\sum |w_j|$ (integer)?","l1Penalty_1":"Weights $w_1=1$, $w_2=2$, $w_3=3$. L1 penalty (integer)?","l1Penalty_2":"Weights $w_1=-1$, $w_2=2$. L1 penalty $|w_1|+|w_2|$ (integer)?","l1Penalty_3":"Weights $w_1=4$, $w_2=0$, $w_3=3$. L1 penalty (integer)?","l1Penalty_4":"Weights $w_1=5$, $w_2=5$. L1 penalty (integer)?","concept_0":"'Generalization' in regularization means? ① Fit training only ② Do well on unseen data ③ More data","concept_1":"In bias–variance tradeoff, stronger regularization? ① Increases variance ② Decreases variance ③ Only increases bias","concept_2":"Adding a penalty to the loss causes weights to? ① Grow without bound ② Be penalized if too large ③ Always be zero","concept_3":"A practical reason for Lasso (L1) is? ① Faster than L2 ② Sparse weights, interpretable ③ Always better than L2","concept_4":"Ridge (L2) and Lasso (L1) together is? ① Elastic Net ② Dropout ③ Batch Norm","concept_5":"When tuning λ, what metric do we mainly compare? ① Training loss ② Validation (or CV) performance ③ Parameter count","concept_6":"When overfitting badly, try? ① Decrease λ ② Increase λ or more data ③ More complex model","concept_7":"'Rote memorizer' in the analogy is? ① Model overfitting training ② Well-generalizing model ③ Model with large λ","concept_8":"In J = MSE + λ·(L2 penalty), if λ=0? ① Only penalty ② No regularization (same as OLS) ③ Same as L1","concept_9":"If validation error is much larger than training error, what is it usually? ① Underfitting ② Overfitting ③ Good fit"}},"ml14":{"chapter":"Chapter 14","title":"Collaborative Filtering: Recommendation Basics","description":"Have you ever seen 'You might also like' on Netflix? **Collaborative filtering** recommends items that users with similar tastes liked. This chapter covers the rating matrix, similarity, neighbor-based prediction, and how it is used in practice.","sectionTitle":"Recommendation basics: Collaborative filtering","whatIs":{"0":"**What is collaborative filtering?** — It uses **other users' behavior** (ratings, clicks, purchases) to recommend items to you. The idea is that people with similar tastes tend to like similar things. It is widely used in streaming, e-commerce, and music apps.","1":"**Intuition: borrowing from neighbors** — For movie recommendations, if someone who liked the same movies A and B as you also liked C, you might like C. Those **similar users** are **neighbors**, and predicting from their ratings is the core of collaborative filtering.","2":"**Math: rating matrix and prediction** — The **rating matrix** has size (users × items); many entries are missing (sparse). **User-based** collaborative filtering finds **neighbors** of user $u$, then fills a missing rating for item $i$ with a **weighted average** of the neighbors' ratings. Similarity is often measured by **cosine similarity** or **Pearson correlation**.","3":"**In practice** — **Cold start** (new users/items have no neighbors) and **sparsity** make pure collaborative filtering hard, so it is often combined with **content-based** methods or **matrix factorization**."},"whyImportant":{"0":"**Recommendations drive business and UX** — Good recommendations increase engagement and revenue. Collaborative filtering personalizes results using behavior data alone, without rich metadata.","1":"**Core ML application** — Recommendation is a different kind of problem: we fill in missing entries of a matrix. Understanding collaborative filtering is a step toward matrix factorization and deep learning-based recommenders."},"howUsed":{"0":"**User-based vs item-based** — **User-based**: find users similar to you and recommend what they liked. **Item-based**: find items similar to the one you are viewing ('Users who bought this also bought'). Both use similarity and neighbors.","1":"**Similarity and prediction** — Similarity $s_{u,v}$ between users is computed, then prediction uses a weighted average of neighbors' ratings. Metrics like **MAE** and **RMSE** are used for evaluation.","2":"**Matrix factorization** — Advanced methods approximate the rating matrix by a product of lower-rank matrices. **Hybrid** systems combine collaborative filtering with content or context."},"problemSolving":{"0":"**Collaborative filtering** — Use other users' **behavior** to find **neighbors**, then fill missing $\\hat{r}_{u,i}$ by **simple** or **weighted** average. **Rating matrix**: rows=users, columns=items, often **sparse**. Cold start / sparsity: combine with content, MF, hybrid.","1":"**Example (summary)**\n\n- **Definition** — Based on other users' **behavior**\n- **Matrix** — Rows×cols = cell count\n- **Simple avg** — $\\hat{r}=\\frac{1}{K}\\sum r$\n- **Weighted avg** — $\\hat{r}=\\frac{\\sum s\\,r}{\\sum|s|}$\n\n---\n\n**Example (definition)**\n\nClosest to collaborative filtering? ① other users' behavior ② genre only ③ random\n\n①. → **Answer 1**\n\n---\n\n**Example (simple average)**\n\nRatings 3,4,5 → mean?\n\n$4$. → **Answer 4**\n\n---\n\n**Example (cells)**\n\n3 users, 4 items → cells?\n\n$12$. → **Answer 12**\n\n---\n\n**Example (weighted)**\n\nRatings 4,5,3 weights 2,1,1 → weighted mean?\n\n$4$. → **Answer 4**"},"visual":"Visualization of rating-matrix recommendation flow.","problems":{"definition_0":"Collaborative filtering is? ① Recommendation based on other users' behavior (ratings, clicks) ② Recommendation based on item features (e.g. genre) ③ Random recommendation","definition_1":"A method that recommends what 'similar users' liked is? ① Collaborative filtering ② Supervised learning ③ K-Means","definition_2":"In user-based collaborative filtering, 'neighbors' are? ① Users with similar taste ② Users in the same region ③ Users in the same age group","definition_3":"In the rating matrix, rows and columns are? ① Rows=users, Columns=items ② Rows=items, Columns=users ③ Rows=time, Columns=rating","definition_4":"Cold start is? ① New users/items have no neighbors, so recommendation is hard ② Server stops ③ Too many ratings","definition_5":"Similarity in collaborative filtering is used to? ① Find similar users (or items) ② Normalize ratings ③ Compress the matrix","definition_6":"Filling blanks using neighbors' ratings is? ① A core step of collaborative filtering ② Preprocessing ③ An evaluation metric","definition_7":"Cosine similarity and Pearson correlation are? ① Similarity measures between users (or items) ② Loss functions ③ Activation functions","definition_8":"Item-based collaborative filtering? ① Finds similar items to recommend ② Uses only similar users ③ Does not use the rating matrix","definition_9":"Sparsity means? ① Most entries of the matrix are missing ② Too many ratings ③ Too many users","definition_10":"MAE and RMSE in recommendation are? ① Evaluation metrics for prediction accuracy ② Similarity metrics ③ Matrix size","definition_11":"Hybrid recommendation? ① Combines collaborative + content-based etc. ② Uses only collaborative ③ No recommendation","trueFalse_0":"Collaborative filtering uses other users' ratings for recommendation. Enter 1 if true, 0 if false.","trueFalse_1":"More neighbors (larger K) always give more accurate predictions. Enter 1 if true, 0 if false.","trueFalse_2":"The rating matrix is usually sparse (most entries are missing). Enter 1 if true, 0 if false.","trueFalse_3":"Cold start refers to difficulty in recommending to new users. Enter 1 if true, 0 if false.","trueFalse_4":"Both user-based and item-based use similarity and neighbors. Enter 1 if true, 0 if false.","trueFalse_5":"Prediction can only be the simple average of neighbor ratings. Enter 1 if true, 0 if false.","trueFalse_6":"Matrix factorization is used to predict missing ratings. Enter 1 if true, 0 if false.","trueFalse_7":"Collaborative filtering alone can fully solve cold start. Enter 1 if true, 0 if false.","trueFalse_8":"Collaborative filtering is widely used in Netflix and e-commerce. Enter 1 if true, 0 if false.","choice_0":"The main idea of collaborative filtering is? ① Use similar users' behavior ② Use only item descriptions ③ Random choice","choice_1":"One cell in the rating matrix means? ① One user's rating for one item ② Number of users ③ Number of items","choice_2":"To predict from K neighbors' ratings we use? ① Average (or weighted average) ② Maximum ③ Minimum","choice_3":"Similarity is used to? ① Choose similar neighbors ② Normalize ratings ③ Compress the matrix","choice_4":"A sparse matrix causes? ① Unstable similarity estimates ② Faster computation ③ No users","choice_5":"Recommendation quality is measured by? ① MAE, RMSE ② Similarity ③ Matrix size","choice_6":"Item-based recommendation finds 'similar items' using? ① Item-item similarity ② Number of users ③ Sum of ratings","choice_7":"To ease cold start we use? ① Content-based, hybrid ② Collaborative only ③ No recommendation","scenario_0":"Hard to recommend to a new user because? ① Cold start (no neighbors/ratings) ② Too many ratings ③ Similarity is 1","scenario_1":"'Users who bought this also bought' is close to? ① Item-based collaborative filtering ② User-based only ③ Random","scenario_2":"Hard to recommend a new movie with few ratings? ① Cold start (item side) ② Too many neighbors ③ Similarity is 0","scenario_3":"Combining collaborative filtering with genre/tags is? ① Hybrid ② Collaborative only ③ Content only","scenario_4":"'For you' recommendations like Netflix are based on? ① Personalization (collaborative, content, etc.) ② Same for everyone ③ Ads only","scenario_5":"When the matrix is very sparse, to improve quality we? ① Use matrix factorization, hybrid, etc. ② Just increase K ③ Delete ratings","concept_0":"When choosing K neighbors, K is? ① A hyperparameter set by the user ② Always 1 ③ Always all users","concept_1":"In weighted average prediction, the weights are? ① Similarity ② Ratings only ③ Random","concept_2":"Matrix factorization aims to? ① Predict missing entries, reduce dimension ② Delete ratings ③ Remove similarity","concept_3":"The size (number of cells) of the rating matrix is? ① (Number of users)×(Number of items) ② Number of users only ③ Number of items only","concept_4":"Neighbor ratings 3, 4, 5. Simple average prediction (integer)? ① 4 ② 5 ③ 3","concept_5":"In user-based filtering, prediction uses? ① Neighbors' ratings for that item ② Only my past ratings ③ Only item descriptions","concept_6":"Lower MAE means? ① Predictions are closer to actual ② Predictions are worse ③ No relation","concept_7":"Content-based recommendation uses? ① Item features (genre, tags) ② Collaborative only ③ Random","concept_8":"To ease cold start we use? ① Content, popular items, hybrid ② Just increase K ③ Stop recommendation","neighborPredict_0":"Three neighbors' ratings are 3, 4, 5. Mean prediction (integer)?","neighborPredict_1":"Three neighbors' ratings are 2, 4, 6. Mean prediction (integer)?","neighborPredict_2":"Three neighbors' ratings are 4, 4, 4. Mean prediction (integer)?","neighborPredict_3":"Three neighbors' ratings are 1, 3, 5. Mean prediction (integer)?","neighborPredict_4":"Four neighbors' ratings are 2, 2, 4, 4. Mean prediction (integer)?","neighborPredict_5":"Three neighbors' ratings are 5, 5, 5. Mean prediction (integer)?","matrixCells_0":"3 users, 4 items. Number of cells in the rating matrix (integer)?","matrixCells_1":"5 users, 6 items. Number of cells (integer)?","matrixCells_2":"2 users, 10 items. Number of cells (integer)?","matrixCells_3":"4 users, 5 items. Number of cells (integer)?","matrixCells_4":"6 users, 5 items. Number of cells (integer)?","weightedPredict_0":"Ratings 4, 5, 3 with weights 2, 1, 1. Weighted average prediction (integer)?","weightedPredict_1":"Ratings 3, 5 with weights 1, 1. Weighted average prediction (integer)?","weightedPredict_2":"Ratings 5, 3, 4 with weights 2, 2, 2. Weighted average prediction (integer)?","weightedPredict_3":"Ratings 2, 4 with weights 1, 1. Weighted average prediction (integer)?","weightedPredict_4":"Ratings 5, 5, 1 with weights 1, 1, 2. Weighted average prediction (integer)?"}}},"mlCh01":{"chapter":"Chapter 01","title":"Missing Value Handling: Strategies to Fill Data Gaps","description":"Real-world data often has missing values—empty cells like in a spreadsheet. Ignoring them can halt training or yield biased results. This chapter walks through filling those gaps, screening extreme values (**outliers**), and correcting skewed class ratios (**class imbalance**)—a practical **data quality pipeline** that underpins reliable machine learning.","sectionTitle":"Missing Value Handling: preprocessing that reduces gaps and raises trust","whatIs":{"0":"**What is a missing value?** An empty cell in a data table—like a puzzle with a tooth missing. In practice they come from skipped survey answers, sensor failures, data transfer loss, and more.","1":"**Missingness mechanisms (MCAR/MAR/MNAR)** ask *why* the blank appeared. **MCAR** (*Missing Completely at Random*) is like coffee spilled on a form—chance alone. **MAR** (*Missing at Random*) is like male respondents leaving “cosmetics spend” empty—linked to *other* observed variables. **MNAR** (*Missing Not at Random*) is like low-income people leaving “income” blank—the missingness itself carries meaning.","2":"**Handling strategies** fall into three broad types: **listwise deletion**, **single imputation** (fill with one value), and **multiple imputation** (fill several times and pool). Each trades off how much data you keep, speed, and statistical rigor—pick to fit the situation.","3":"**Single vs multiple imputation:** **Single imputation** fills each gap once with e.g. the mean or mode—fast but risky. **Multiple imputation** builds several plausible completed datasets (parallel “worlds”) and pools results for a more careful conclusion.","4":"**Two views on outliers:** **Univariate detection (box plot)** flags extreme values in one variable; **multivariate detection (Mahalanobis / Isolation Forest / SVDD)** flags odd *combinations* across variables. They answer different questions—in practice you often check both.","5":"**Class imbalance correction:** When one class dominates, models may behave as if the rare class barely exists. Practitioners combine Tomek Links (boundary cleaning), SMOTE/ADASYN (synthetic minority samples), and SMOTE+Tomek (synthesize then clean).","6":"**Core message:** Missing-value handling is not a standalone trick—it is **one pipeline design problem** tied to outlier checks and imbalance correction."},"whyImportant":{"0":"**Systems hate blanks.** If you leave gaps, the pipeline may error—like an OMR sheet that cannot be scored without marks.","1":"**Bad fills mislead.** Filling everything with 0 or the mean breaks the true distribution; the model may treat imputed values as real and become **overconfident**.","2":"**Preprocessing is a set menu.** Filling missing values is not the end—you should plan outlier screening and imbalance handling in the same breath so the model behaves in production.","3":"**Fairness and safety:** If missingness differs by group (MAR/MNAR), careless imputation can widen performance gaps between groups—check bias signals early.","4":"**It beats model choice to the punch:** With the same algorithm, better preprocessing can change outcomes more than swapping models—often “good data flow” wins over “good model name.”","5":"**Deployment stability:** If you define rules for missingness, outliers, and imbalance up front, new data can be handled consistently—retraining and monitoring get easier."},"howUsed":{"0":"**End-to-end flow:** EDA → hypothesize why values are missing → choose imputation → catch extremes (**outlier detection**, e.g. box plot) → adjust class mix (**imbalance correction**, e.g. SMOTE) → then train and evaluate.","1":"**Single-imputation formulas:** Mean fill: $x_{miss} \\leftarrow \\bar{x}$; median fill: $x_{miss} \\leftarrow \\mathrm{median}(x)$.","2":"**Multiple imputation:** Build $m$ completed datasets (“parallel worlds”), then pool estimates $\\theta_k$: $\\bar{\\theta}=\\frac{1}{m}\\sum_{k=1}^{m}\\theta_k$.","3":"**Box plot (IQR) rule:** Fences from $Q_1-1.5\\times IQR$ to $Q_3+1.5\\times IQR$; points outside are outlier *candidates*.","4":"**Covariance:** Measures how two variables move together—e.g. do taller people tend to weigh more? $\\mathrm{cov}(X,Y)=\\mathbb{E}[(X-\\mu_X)(Y-\\mu_Y)]$. Stacking covariances yields $\\Sigma$, which sets the orientation and stretch of the multivariate “cloud” (ellipses).","5":"**Mahalanobis distance:** Not plain Euclidean distance—it uses $\\Sigma^{-1}$ to weight directions by spread: $D_M(\\mathbf{x})=\\sqrt{(\\mathbf{x}-\\boldsymbol\\mu)^\\top\\Sigma^{-1}(\\mathbf{x}-\\boldsymbol\\mu)}$ (covariance is central).","6":"**Isolation Forest:** Outliers are points that become **isolated quickly** under random splits—few splits needed to separate them (short path length), often in high dimensions with weak distributional assumptions.","7":"**SVDD (one-class):** Learn a **boundary** around normal data (minimum-volume sphere or kernel-shaped region) and flag points outside as outliers—common in one-class anomaly detection.","8":"**Class imbalance:** With a very rare positive class, accuracy can look high while the model ignores positives—use Recall, Precision, F1, PR-AUC together and resample when needed.","9":"**Tomek Links:** Pairs of opposite-class mutual nearest neighbors near the boundary—often remove the majority point (or both) to **clean** overlap (undersampling-based cleaning).","10":"**SMOTE:** Interpolate between a minority point $\\mathbf{x}$ and a neighbor $\\mathbf{x}_{nn}$: $\\mathbf{x}_{new}=\\mathbf{x}+\\lambda(\\mathbf{x}_{nn}-\\mathbf{x})$, $\\lambda\\sim U(0,1)$—richer than copy-paste but can add bad samples if the boundary is noisy.","11":"**Hybrid resampling (e.g. SMOTE+Tomek):** **Oversample** the minority with SMOTE, then **clean** ambiguous boundary pairs with Tomek—think **oversample → clean**.","12":"**ADASYN:** Like SMOTE but allocates **more** synthetic samples to “hard” minority regions (surrounded by majority)—denser support where the classifier struggles."},"summary":"**One-page cheat sheet**\n- There is no universal “magic” imputation—start by **why** data are missing (**MCAR/MAR/MNAR**).\n- **Single imputation** is fast but ignores uncertainty; **multiple imputation** is stronger statistically but costs more compute.\n- Check outliers in **both** **univariate (box plot)** and **multivariate (Mahalanobis / Isolation Forest / SVDD)** ways to miss fewer cases.\n- For imbalance combine **Tomek (clean)**, **SMOTE/ADASYN (synthesize)**, **SMOTE+Tomek (hybrid)** as needed.\n- Always compare metrics before and after preprocessing (Recall, F1, PR-AUC, etc.) to see real gains.","problemSolving":{"0":"$61","1":"$62"},"sectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Practical use","problemSolving":"Problem solving"},"problemSolvingLabel":"How to approach the problems","imputationTable":{"title":"Common Single-Imputation Values/Methods","caption":"A compact table of common single-imputation methods with definitions and formulas.","headers":{"method":"Value/Method","definition":"Definition (short formula)"},"rows":{"0":{"method":"Mean","definition":"Impute with sample mean: $x_{miss} \\leftarrow \\bar{x}=\\frac{1}{n}\\sum_{i=1}^{n}x_i$"},"1":{"method":"Median","definition":"Impute with median: $x_{miss} \\leftarrow \\mathrm{median}(x)$"},"2":{"method":"Mode","definition":"Impute with most frequent value: $x_{miss} \\leftarrow \\arg\\max_v\\,\\mathrm{count}(x=v)$"},"3":{"method":"Regression · KNN · Hot-deck","definition":"Regression: $\\hat{x}=f(\\mathbf{z})$, KNN: $x_{miss}\\leftarrow\\frac{1}{k}\\sum_{j\\in N_k}x_j$, Hot-deck: $x_{miss}\\leftarrow x_{donor}$"}}},"practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Ten questions drawn at random from a pool of 60: 4 easy, 3 medium, 3 hard.","practiceProblemsInstruction":"Choose one of ①–④, then press Check answer.","checkAnswer":"Check answer","correctAnswer":"Correct!","wrongAnswer":"Incorrect. Try again.","testCodeLabel":"Test code","visualIntro":"Data quality pipeline from missing-value handling to outlier/imbalance correction","visualStep0":"Detect missingness: rate and pattern","visualStep1":"Handle missingness: deletion / single vs multiple imputation","visualStep2":"Outlier detection: Box Plot, Mahalanobis, Isolation Forest, SVDD","visualStep3":"Imbalance correction: Tomek, SMOTE, ADASYN, SMOTE+Tomek","visualStep4":"⑤ Train and validate: check generalization","visualAriaLabel":"Diagram of missing-value handling and data quality improvement","problemSolvingFallback":"Identify MCAR/MAR/MNAR → choose single vs multiple imputation → screen outliers with Box Plot / Mahalanobis / Isolation Forest / SVDD → apply Tomek / SMOTE / ADASYN / hybrid resampling.","visualDiagram":{"hintStep0":"Observe: look at the missingness pattern first","hintStep1":"Choose: single vs multiple imputation","hintStep2":"Check: outliers (univariate / multivariate)","hintStep3":"Correct: imbalance (synthesize → clean)","clickMechanismCards":"Click the MCAR · MAR · MNAR cards below to change the pattern.","pipelineNavAria":"Pipeline steps","chipPattern":"Missing pattern","chipImpute":"Imputation","chipOutlier":"Outliers","chipImbalance":"Imbalance","panelDetectTitle":"Missing detection (pattern)","badgeMcar":"MCAR (random)","badgeMar":"MAR (conditional)","badgeMnar":"MNAR (value-dependent)","legendObserved":"Observed","legendMissing":"Missing","gridColorHint":"Cell colors hint at “why is this blank?”","tooltipObserved":"Observed","tooltipMissing":"Missing","mcarLine1":"MCAR","mcarLine2":"Missing completely at random · Missing Completely At Random","mcarLine3":"Scattered pattern—could be “pure chance”","marLine1":"MAR","marLine2":"Missing at random (MAR) · Missing At Random","marLine3":"Vertical bands—missingness when certain conditions hold","mnarLine1":"MNAR","mnarLine2":"Not missing at random · Missing Not At Random","mnarLine3":"Concentrated in tails—the blank itself carries meaning","panelImputeTitle":"Missing handling: single vs multiple imputation","imputePhase0":"Check blanks","imputePhase1":"Single imputation","imputePhase2":"Multiple imputation","imputePhase3":"Pool","singleTitle":"Single imputation (1×)","singleLead":"Each blank gets the same single fill","singleFoot":"Filling once is **fast** but can make data look “less noisy” than it is (underestimated variance).","multiTitle":"Multiple imputation (m×)","multiLead":"Several plausible fills → pool mean and uncertainty at the end","multiFoot":"Impute several times, then **pool (mean / variance)** to reflect uncertainty.","boxTitle":"Univariate outliers: box plot (IQR)","boxPhase0":"Box (Q1–Q3)","boxPhase1":"Fences (1.5×IQR)","boxPhase2":"Past fence = candidates","boxChip1":"Box","boxChip2":"Fences","boxChip3":"Outside points","boxPlotStagesAria":"Box plot steps","fenceLower":"Lower fence","fenceUpper":"Upper fence","boxSummary":"In short: **Q1·Q3 → IQR → 1.5×IQR fences**—points outside are outlier candidates.","mvTitle":"Multivariate outliers: odd “combinations”","mvPhase0":"Distance (covariance)","mvPhase1":"Isolation (short path)","mvPhase2":"Boundary (normal region)","mahalPara1":"When axes move together (covariance), points form an **elliptical cloud**. Inside is common; **far outside the ellipse** is suspicious.","mahalPara2":"Distance that respects correlation","mahalBadge":"Far from ellipse → candidate","ifPara1":"Under random cuts, a point that is **isolated in few splits** is easy to separate—remember that intuition.","ifPara2":"Isolated quickly under random splits","ifBadge":"Short path → candidate","svddPara1":"Wrap normals in a **balloon-like boundary**. Inside=familiar; **outside=unfamiliar**.","svddPara2":"Learn a boundary around normal data","svddBadge":"Outside boundary → candidate","imbTitle":"Class imbalance: SMOTE/ADASYN + Tomek Links","imbIntro":"**Tomek Links** finds pairs of mutual nearest neighbors across classes near the boundary and often removes the **majority** point to clear overlap.","imbSmoteAdasynIntro":"**SMOTE** creates synthetic minority points by interpolating between a minority sample and its neighbors; **ADASYN** allocates *more* synthetic samples to ‘hard’ minority regions surrounded by the majority, densifying the boundary neighborhood.","imbPhase0":"Minority squeezed at the boundary","imbPhase1":"Synthesize to fill gaps","imbPhase2":"Clean boundary with Tomek","imbWhyTitle":"Why it matters","imbWhyBody":"With strong imbalance, the model can look good while mostly predicting the majority—check recall/F1 and fix the data too.","imbMajor":"Majority (85%)","imbMinor":"Minority (15%)","imbHowTitle":"How to fix it? (visual)","imbHowLead":"Picture a **curved** decision boundary—SMOTE and Tomek fit boundary clutter naturally.","imbChip0":"Noisy boundary","imbChip1":"Synthesize to fill","imbChip2":"Clean with Tomek","imbChip2Title":"Pairs of mutual nearest neighbors across classes: often remove the majority side to tidy the boundary.","imbTomekCallout":"Yellow ring: **majority (gray)** points intruding on the boundary are Tomek candidates; after cleaning they fade and the boundary is cleaner.","chartDenseTop":"Top: dense majority","chartSparseBottom":"Bottom: minority (+ synthetic)","imbBoundaryMsg":"Near the boundary, misclassification noise grows easily","imbFlow1":"Flow: SMOTE/ADASYN densifies the minority neighborhood → **Tomek Links** removes points in **mutual nearest-neighbor pairs** across classes at the boundary (typically the majority side) to tidy it","imbFlow2":"Intuition: after synthesis, pair points that are nearest neighbors yet from different classes—then drop the **majority** points that clutter the boundary.","legMinor":"Minority","legMajor":"Majority","legSyn":"Synthetic (SMOTE/ADASYN)","legCurve":"Curved boundary","pointTitleMajor":"Majority","pointTitleMajorTomek":"Majority intruding (clean-up candidate)","pointTitleSyn":"Synthetic (SMOTE/ADASYN)"}},"mlCh07":{"chapter":"Chapter 07","title":"XGBoost, LightGBM, CatBoost: Tabular ML Powerhouses","description":"When you work with spreadsheet-like **tabular data**, a family of models often beats even heavy deep learning: **gradient boosting**. Boosting lines up many \"average students\" (weak learners) in order; each one studies the mistakes the previous models still make, until the team acts like a single **strong predictor**.\n\nThis chapter dissects **XGBoost, LightGBM, and CatBoost**—the trio behind countless production systems and Kaggle solutions—and gives you **clear rules for which tool fits your dataset**.","sectionTitle":"CH07 — The boosting trio: mastering residuals one tree at a time","whatIs":{"0":"**1. Core idea: sequential error notebooks**\n\n**Concept:** Boosting chains decision trees **in sequence**. Each new tree focuses on the **residuals** (errors) left by the ensemble so far.\n\n**Intuition:** Picture a study group before an exam. Student 1 takes a practice test and writes an **error notebook** of mistakes. Student 2 drills only those questions. Student 3 fixes what student 2 still misses. Repeat many rounds and the group’s combined score skyrockets.\n\n**Key update:** $F_t(x)=F_{t-1}(x)+\\eta h_t(x)$\n\n- $F_t(x)$: prediction after stage $t$\n- $F_{t-1}(x)$: prediction before adding the latest tree\n- $h_t(x)$: new tree trained to reduce the **remaining error**\n- $\\eta$: **learning rate**—how aggressively you trust the new tree (smaller $\\eta$ often means you need more trees but can be more stable)\n\n**Practice:** Loan default, churn, CTR, and many other **row-and-column** tasks still treat boosting as a top-tier baseline.","1":"**2. XGBoost: stable, regularized workhorse**\n\n**Concept:** The library that popularized modern gradient boosting. It optimizes loss while penalizing overly complex trees through built-in **regularization** terms, which tends to make training predictable and robust.\n\n**Intuition:** A strict teacher who cares about progress **and** about stopping you from \"memorizing\" the textbook—penalties kick in when the model gets too wiggly (overfitting).","2":"**3. LightGBM: speed for huge datasets**\n\n**Concept:** Built for scale when millions of rows made classic boosting slow. It uses **histogram-based** binning to cut computation and usually grows trees **leaf-wise**—splitting the leaf that most reduces loss—instead of expanding an entire level at a time (level-wise).\n\n**Intuition:** Like skipping chapters you already know and camping on the one chapter most likely to be on the exam: **maximum efficiency**, but you can **over-drill** one corner of the space.\n\n**Caveat:** Leaf-wise trees overfit more easily on **small** data. Tune **`max_depth`**, **`min_data_in_leaf`**, and related knobs.","3":"**4. CatBoost: categoricals without the headache**\n\n**Concept:** From Yandex—the name merges **Cat**egory and **Boost**. It is strong on **high-cardinality categorical** features (city, job title, product ID) with less manual encoding drama.\n\n**Intuition:** Think of taking an exam: you should solve each question **without peeking at later answers**. In tabular ML, mixing future information into training causes **target leakage** and inflated scores. CatBoost’s design (including **ordered / permutation-based** ideas) is built to reduce that \"peeking ahead\" risk. That is why it often works well even with strong **default settings**.","4":"$63","5":"$64"},"whyImportant":{"0":"**The go-to baseline for tabular work**\n\nFor many **database / CSV** problems, gradient boosting is **fast, accurate, and simpler to iterate** than a full deep-learning stack. Teams routinely reach for it **before** designing exotic neural nets.","1":"**Pick the weapon to match the data**\n\n- Need **stability** and mature tooling on medium-sized data? **XGBoost**\n- Need **training speed** and memory efficiency at **very large** scale? **LightGBM**\n- Drowning in **categorical columns** and want sane defaults? **CatBoost**","2":"**Hyperparameters are the steering wheel**\n\n`learning_rate`, tree depth / leaves, `n_estimators`, early stopping—these jointly control the **bias–variance trade-off** and compute cost. Understanding how they interact lets you **tune without guessing**."},"howUsed":{"0":"**① Pipeline pattern**\n\nClean missing values and categories $\\rightarrow$ split **train / validation** $\\rightarrow$ fit a booster $\\rightarrow$ explain with **SHAP** or feature importance for stakeholders $\\rightarrow$ ship and monitor.","1":"**② Early stopping**\n\nMore trees are not always better— eventually you memorize the training set. When **validation loss** plateaus or worsens, **stop** and keep the best iteration. In production this is standard practice.","2":"**③ Align metrics with the business**\n\n- **Classification (churn, fraud):** look beyond accuracy—**AUC**, **F1**, precision/recall at a chosen threshold.\n- **Regression (demand, price):** track **RMSE** / **MAE** in units stakeholders understand."},"summary":"**Cheat-sheet**\n\n| Model | Keywords | Strengths | Watch-outs |\n| :--- | :--- | :--- | :--- |\n| **XGBoost** | Regularization, stability | Reliable default; strong on many tabular workloads | Can be slower on very large data |\n| **LightGBM** | Speed, leaf-wise | Fast training; memory-friendly bins | Overfits more easily on small data |\n| **CatBoost** | Categories, defaults | Less hand-built encoding; strong out-of-box | Heavier; model size can grow |\n\nAll three share the boosting idea: **reduce residuals stage by stage and blend many trees**.","problemSolving":{"0":"**Practice problem playbook**\n\n- Practice items are **four-choice multiple choice**; pick the option that matches the prompt (numeric items show the value as one of the choices).\n- **LightGBM / leaf-wise** items often pair with **`max_depth`**, **`min_data_in_leaf`**, or **`num_leaves`** constraints that fight overfitting.\n- **Model choice** questions: match **data volume**, **categorical load**, and **latency** to the table above.\n- Theory items: locate definitions of $F_t=F_{t-1}+\\eta h_t$, the **regularizer** $\\Omega(f)$, **histogram binning**, and **ordered statistics** inside the question text first."},"sectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary","problemSolving":"Problem solving & tips"},"problemSolvingLabel":"Problem Solving Notes","practiceProblemsTitle":"Practice Problems","practiceProblemsIntro":"10 random questions are sampled from 60 in total (easy→medium→hard, 4-3-3).","practiceProblemsInstruction":"Read the prompt, select one of the four choices (①–④), then press Check.","boostingTestCodeLabel":"Test code","boostingVisualIntro":"Each new tree reduces the remaining error from previous trees.","boostingVisualIntroPanels":"These libraries grow trees differently: level-wise, leaf-wise, and symmetric (oblivious). Watch each panel animate in turn.","boostingVisualAriaLabel":"Comparison of XGBoost level-wise, LightGBM leaf-wise, and CatBoost symmetric tree growth","boostingVisualTitleXgb":"XGBoost","boostingVisualTitleLgb":"LightGBM","boostingVisualTitleCat":"CatBoost","boostingVisualCaptionXgb":"Level-wise\nfills each depth before going deeper","boostingVisualCaptionLgb":"Leaf-wise\nsplits the leaf with the largest loss drop","boostingVisualCaptionCat":"Oblivious\nsame split rule at each depth (symmetric)","boostingVisualPhaseCaption0":"① XGBoost — level-wise: completes each depth before going deeper.","boostingVisualPhaseCaption1":"② LightGBM — leaf-wise: splits the leaf with the largest loss reduction first.","boostingVisualPhaseCaption2":"③ CatBoost — oblivious trees: identical splits at each depth (symmetric).","boostingVisualPhaseCaption3":"Side by side, the three growth rules are easy to contrast.","boostingVisualStep0":"① Initial model leaves large error","boostingVisualStep1":"② Trees sequentially fit residuals","boostingVisualStep2":"③ Later tree fixes harder patterns","boostingVisualStep3":"④ Final ensemble improves prediction","checkAnswer":"Check Answer","correctAnswer":"Correct!","wrongAnswer":"Try again."},"mathChapters":{"mathCumulativeVisualTitle":"Basic math concept flow","mathCumulativeVisualLabel":"Basic math chapter concept visual","sectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Explanation for solving the problems"},"mathSymbolPaletteTitle":"Math symbols","mathSymbolPaletteDescription":"View math symbols (Greek letters, operators, sets, etc.) with their names and pronunciation. Click to copy.","mathSymbolPaletteSearchPlaceholder":"Search by name or keyword (e.g. alpha, sigma, partial)","mathSymbolPaletteNoResults":"No results.","mathSymbolPaletteHint":"Click a symbol to copy to clipboard.","mathSymbolCategoryGreekLower":"Greek (lowercase)","mathSymbolCategoryGreekUpper":"Greek (uppercase)","mathSymbolCategoryOperators":"Operators","mathSymbolCategoryRelations":"Relations","mathSymbolCategoryArrows":"Arrows","mathSymbolCategorySets":"Sets & number systems","mathSymbolCategoryLogic":"Logic","mathSymbolCategoryCalculus":"Calculus","mathSymbolCategoryMisc":"Misc","math00":{"chapter":"Chapter 00","title":"Basic Math and AI: Learning the Language of AI","description":"Why math is needed to understand deep learning and machine learning, what math tools are used—we draw that map together.","sectionTitle":"Why do we need math to understand deep learning and machine learning?","visualIntro":"Visualize how AI inputs pass through math to become predictions.","visualInputLabel":"Input","visualInputTypes":"Image, text, sound","visualMathLabel":"Basic math","visualMathTopics":"Functions · vectors · matrices","whatIs":{"0":"**Understanding AI requires math as a lens** — Deep learning and machine learning turn the images, text, and sound we give them into **numbers**. Those numbers pass through **functions** and repeated **multiplication and addition** to find the answer. Because this whole process is written in math, knowing math lets you read the **inner workings** of AI clearly.","1":"**What math tools will we use?** — We will learn **functions** (rules that map input to output), **vectors and matrices** (bundling lots of data for batch computation), **differentiation** (so the model can learn and move toward the right answer), and **probability and distributions** (to measure how likely an outcome is). These tools together build the intelligence of AI.","2":"**In short** — AI runs on a solid foundation of numbers and functions. To interpret why AI produced a given result and to build better models, you need basic strength in **functions**, **limits**, **differentiation**, and **probability**. This course is the journey of building that foundation step by step."},"whyImportant":{"0":"**To understand why AI decides as it does** — Every decision AI makes is ultimately the result of **numbers and functions**. We learn functions and differentiation so we can follow the computation and logically understand **why that answer was produced**.","1":"**Where math works in the AI model** — Each **layer** of the model is a set of **functions** that multiply by weights and add. The process of the model learning and reducing error uses the concept of **gradient** (differentiation). Probability becomes the measure of how confident the AI is in its prediction.","2":"**The roadmap we will follow (Ch01–Ch12)** — This course proceeds in order: **Functions (Ch01–03)** (flow of data), **Limits and continuity (Ch04–05)** (foundations of change), **Differentiation (Ch06–08)** (heart of learning), **Integral (Ch09)** (accumulation and basis of probability), and **Probability and distributions (Ch10–12)** (uncertainty)."},"howUsed":{"0":"**The link between reality and math** — An AI model has the structure **input → turn into numbers → repeat functions → output**. **Functions** are the building blocks, **differentiation** is the chisel that shapes them to get smarter, and **probability** is the tool that checks the stability of the finished building. Once you master this basic math, the complex formulas of deep learning start to read like meaningful sentences."},"problemSolving":{"0":"| Category | Role in AI | Key math concepts |\n| --- | --- | --- |\n| **Input & output** | Basic framework for feeding data and getting answers | Functions, exponents, logarithms |\n| **Learning (Training)** | Process of reducing error to approach the correct answer | Limits, derivatives, chain rule |\n| **Prediction & decision** | Choosing the best among uncertain outcomes | Probability, statistics, normal distribution |"}},"math01":{"chapter":"Chapter 01","title":"Functions: The Basic Unit of AI That Connects Input and Output","description":"A function is a rule that assigns one output to each input. The way AI turns input into output is directly connected to this function concept.","sectionTitle":"What is a function?","visualIntro":"One input x gives exactly one output y. The diagram below shows the flow x → f → y.","visualCaption":"Example: x = 3 gives 7 for f(x) = 2x + 1","whatIs":{"0":"A **function** is a strict **mapping** between two sets. Every element of the **domain** (the set of inputs) must be connected to **exactly one** element of the **codomain** (the set of outputs). Just as a vending machine is broken if pressing a button gives no drink or two drinks at once, a function must have exactly one output for each input.","1":"We write **y = f(x)**. Here **x** is the **independent variable (cause)** and **y** is the **dependent variable (result)**. From an AI perspective, **x** is the **data** we provide (pixels, text, sensor values), and **y** is the **prediction** the AI computes. The function **f** acts as a **transformer** that turns this data into answers.","2":"An **AI model** itself is a huge **composite function**. Input data is transformed by the first function (layer), and that result is fed into the next function (layer); this repeats dozens of times. Just as we write $y = f(g(h(x)))$ in math, deep learning stacks many functions in layers to read complex patterns from data."},"whyImportant":{"0":"Because we can **model the real world**. A vague relation like \"more study leads to better grades\" can be expressed as a **linear function** $y = ax + b$, so we can compute expected grades ($y$) from study time ($x$). AI approximates far more complex nonlinear relations (e.g., images to object names) as functions to solve problems.","1":"Functions are the **object of optimization**. The goal of AI training is to minimize the error between the correct answer and the prediction. That error is computed by a **loss function**, and we use differentiation to find its minimum. Without functions, there would be no mathematical basis for training AI.","2":"They are the language of **change**. We need to know how much the output changes when the input changes a little (the slope) so that AI can move step by step toward the correct answer. Functions make the **cause–effect** relationship between input and output explicit in math, so we can analyze **why** the AI made a given decision."},"howUsed":{"0":"Every **neuron** in AI is a small **function**. It takes input signals ($x$), multiplies them by weights ($w$) and adds ($wx+b$), then passes the result through an **activation function** to the next neuron. Functions like **ReLU** and **Sigmoid** decide whether to pass the signal on; many such small functions together make complex decisions like the human brain.","1":"**Data transformation**: A photo is just a pile of numbers ($x$) to the computer. AI passes them through functions to shrink or expand dimensions and keep only key features ($y$) like \"ear shape\" or \"eye shape.\" That's mapping high-dimensional vectors to a lower-dimensional space.","2":"**Probability**: The **softmax** function at the last step of classification turns raw scores into \"probabilities that sum to 1.\" So the AI can say \"this image is 90% a dog.\" Functions turn raw data into information we can interpret."},"problemSolving":{"0":"| Function | Example (input → output) |\n| --- | --- |\n| $f(x)=x+1$ | 3 → 4, 10 → 11 |\n| $g(x)=2x$ | 3 → 6, 10 → 20 |\n| $h(x)=x^2$ | 3 → 9, $-2$ → 4 |","1":"In the visual below, **f(x) = 2x + 1** gives 7 for x = 3 and 21 for x = 10. Fill in the blank in the problem."}},"math02":{"chapter":"Chapter 02","title":"Exponents and Exponential Functions: The Math of Growth and Activation","description":"Exponentiation is repeated multiplication of the same base; an exponential function fixes the base and uses the exponent as the variable. Used in activation and loss design in deep learning.","sectionTitle":"What are exponent and exponential function?","visualIntro":"Fix a base $a$; for each exponent $x$ the value $a^x$ is determined. Below are examples for $2^x$.","visualCaption":"Example: $2^0=1$, $2^1=2$, $2^2=4$, $2^3=8$","whatIs":{"0":"An **exponent** is how many times a number (the **base**) is multiplied by itself. Like the fact that folding a piece of paper 42 times would reach the moon, repeated **multiplication** (not addition) makes values grow **explosively (exponential growth)**.","1":"An **exponential function** puts that repeated power in a variable: $y = a^x$. In polynomials the variable is in the base ($x^2$); in exponentials the variable is in the **exponent**. That means growth proportional to current size. If $a>1$, the value shoots up as $x$ increases (**exponential growth**); if $00$. AI cannot say \"the probability is -50%,\" so exponentials are essential when we need outputs to be **positive** (e.g. probabilities or positive scores).","1":"They **amplify small differences**. Inputs 1 and 2 differ by 1, but $10^1=10$ and $10^2=100$ differ by 90. AI uses this to **sharply separate** similar data and **classify** with confidence.","2":"**Efficient differentiation**: Backprop is a long chain of derivatives. The exponential $e^x$ keeps the same shape when differentiated (or stays in a simple form), which is crucial for fast, stable training."},"howUsed":{"0":"Used in the **softmax** function. When AI chooses one out of 1000 images, it applies $e^x$ to each score. Slightly higher scores get much larger values and lower ones shrink toward 0, so the model can say \"this is the answer with 99% confidence.\"","1":"The **sigmoid** function $y = \\frac{1}{1+e^{-x}}$ squeezes the input into (0, 1). The output never exceeds 1 or goes below 0, so the neuron acts like an on/off switch."},"problemSolving":{"0":"| Expression | Value |\n| --- | --- |\n| $2^0$ | 1 |\n| $2^1$ | 2 |\n| $2^2$ | 4 |\n| $2^3$ | 8 |\n| $2^4$ | 16 |\n| $3^2$ | 9 |\n| $3^3$ | 27 |","1":"In the visual below, $y = 2^x$ gives $1$ for $x=0$, $2$ for $x=1$, $4$ for $x=2$, $8$ for $x=3$. Use it to see how base and exponent relate.","2":"**Problem types and how to solve them**\n\n| Type | Description | How to get the answer |\n| --- | --- | --- |\n| **Find value** | $a^x = ?$ | Multiply base $a$ by itself $x$ times. E.g. $2^3 = 8$. |\n| **Find exponent** | $a^? = \\text{value}$ | \"How many times do we multiply $a$ to get this value?\" That count is the answer. E.g. $2^? = 8 \\Rightarrow 3$. |\n| **Compare** | Which is larger: 1) $a^{m}$, 2) $b^{n}$? | Compute each, then compare. If (1) is larger enter **1**, if (2) enter **2**. |\n| **Product, same base** | $a^p \\times a^q = a^?$ | **Add** exponents: $? = p + q$. (Rule: $a^p \\cdot a^q = a^{p+q}$) |\n| **Quotient, same base** | $a^p \\div a^q = a^?$ ($p \\ge q$) | **Subtract** exponents: $? = p - q$. (Rule: $a^p / a^q = a^{p-q}$) |\n| **Power of power** | $(a^p)^q = ?$ | **Multiply** exponents: $? = a^{p \\times q}$. (Rule: $(a^p)^q = a^{pq}$) |"}},"math03":{"chapter":"Chapter 03","title":"Logarithm: From Multiplication to Addition, the Language of Loss Design","description":"A logarithm answers 'how many times we multiply the base to get this number?' It is the inverse of exponentiation and is used with exponentials in loss and probability in deep learning.","sectionTitle":"What is the logarithm?","visualIntro":"Logarithm is the inverse of exponent. $y = \\log_2 x$ means $2^y = x$. Below are the graphs of $y = \\log_2 x$ and its inverse $y = 2^x$.","visualCaption":"Example: $\\log_2 1 = 0$, $\\log_2 2 = 1$, $\\log_2 4 = 2$, $\\log_2 8 = 3$ (when $2^y = x$, $y$ is $\\log_2 x$)","visualLegend":"Purple: $y=\\log_2 x$, Teal: $y=2^x$","whatIs":{"definition":"The **logarithm** is like \"running exponentiation backward.\" In $2^3 = 8$, when you see the result 8 and ask \"**how many times** did we multiply 2 to get 8?\", that count (3) is the logarithm: $\\log_2 8 = 3$. Here 2 is the **base** and 8 is the **argument**.","example":"Think of it as **counting digits**. $100 = 10^2$ so $\\log_{10} 100 = 2$; $1000 = 10^3$ so $\\log_{10} 1000 = 3$. When the number grows 10×, the log value only goes up by 1. So log acts as a **filter** that turns explosively large numbers into much gentler ones. **Basic properties**: $\\log_a 1 = 0$ (base to the 0th power is 1), $\\log_a a = 1$ (base to the 1st power is itself).","logSumProduct":"**The magic of log** is that it turns multiplication into addition: $\\log_a(b \\times c) = \\log_a b + \\log_a c$. For computers, multiplication is costlier than addition and can overflow or underflow; taking the log turns that multiplication into a safer, simpler addition.","whyInAI":"The **argument condition ($x>0$)** matters: log of 0 or a negative number is undefined. So in AI code we often add a tiny constant ($\\epsilon$, epsilon) to avoid $\\log(0)$ errors. The **natural log** ($\\ln$, base $e$) keeps differentiation tidy and is the standard in deep learning."},"whyImportant":{"0":"**Avoiding underflow** is essential. If AI multiplies probability $0.1$ a hundred times ($0.1^{100}$), the computer may treat it as zero. Taking the log gives $\\log(0.1^{100}) = 100 \\times \\log(0.1) = -100$—a **meaningful number** the computer can still handle.","1":"It is the **ruler for information (entropy)**. The rarer an event, the larger (in absolute value) its log. A rare event (e.g. \"sun rises in the west\") carries high information; an obvious one (\"morning comes\") carries almost none. AI uses this log-based measure to see **how much surprising information** was learned.","2":"**It penalizes mistakes harshly**. For $y=\\ln x$ with $00$, $\\cos\\theta<0$\n2) So $\\tan\\theta=\\frac{\\sin\\theta}{\\cos\\theta}<0$\n\nSo the **answer is negative**.\n\n---\n\n**Example (Period calculation)**\n\nFind the period (degrees) of $y=\\sin(8x)$.\n\n**Solution**\n\n1) Use period formula $\\frac{360}{k}$\n2) With $k=8$, $\\frac{360}{8}=45$\n\nSo the **answer is 45**.\n\n---\n\n**Example (ML application, without direct $\\pi$ arithmetic)**\n\nFor $hour=6$, when a day (24h) is mapped to 360°, what is the angle and $\\sin\\theta$?\n\n**Solution**\n\n1) 24h = 360°, so 1h = 15°\n2) 6h = $6\\times15=90^\\circ$\n3) $\\sin90^\\circ=1$\n\nSo the **answer is 1**.\n\n(Equivalent formula form: $\\theta=2\\pi\\cdot\\frac{6}{24}=\\frac{\\pi}{2}$.)"},"summary":"**One-line summary:** Trigonometric functions are not just calculators from angles to ratios. They are a unified language for circular motion and waves, linking geometric intuition to practical AI applications like cyclic feature encoding and positional encoding.","problemSolvingLabel":"Explanation for solving the problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"From a bank of 60 problems, 10 are selected per session. Selection prefers non-overlapping problem types, and difficulty is ordered as easy → medium → hard.","problemPromptQuadrantSign":"Find the sign of {func} in Quadrant {quadrant}. (positive=1, negative=-1)","problemPromptPeriodDeg":"period (in degrees)?","problemPromptIntSum":"Integer-sum problem: {a} + {b} = ?","problemPromptUnitCircleCoord":"On the unit circle, at θ={deg}°, find the value of {axis}.","problemPromptCoterminalAngle":"Choose the coterminal angle in 0?~360? for {deg}?.","problemPromptQuadrantFromAngle":"Which quadrant contains ?={deg}?? (1~4)","cosineVisualTitle":"Cosine Similarity Vector Visual","cosineVisualHint":"The closer the vector directions, the closer cosine is to 1.","cosineVisualNow":"Current cosine similarity:","cosineVisualHigh":"High similarity","cosineVisualMedium":"Medium similarity","cosineVisualLow":"Low similarity"}}},"now":"$undefined","timeZone":"UTC","children":["$L65","$L66","$L67"]}]