3:["$","$L18",null,{"formats":"$undefined","locale":"en","messages":{"meta":{"title":"Everyone's AI","description":"Free AI, deep learning & machine learning courses. Learn basic math, neural networks, backpropagation, KNN, regression, ensemble step by step with quizzes. AI for beginners—start here.","keywords":"deep learning, machine learning, AI course, free AI course, deep learning for beginners, machine learning tutorial, AI learning, neural network, backpropagation, KNN, linear regression, free course, AI education","learnTitle":"Learn","learnPageSeoTitle":"Basic Deep Learning | Learn","learnDescription":"Free basic deep learning course: dot product, matrix multiplication, linear layer, activation, backpropagation. Learn by chapter with problems and mini neural network playground.","learnKeywords":"basic deep learning, deep learning, dot product, matrix multiplication, neural network, backpropagation, linear layer, activation function, softmax, free deep learning course","learnMathTitle":"Basic Math and AI | Learn","learnMathDescription":"Free basic math for AI and deep learning: functions, vectors, matrices, exponents, logarithms, uniform and normal distributions. AI math foundations.","learnMathKeywords":"basic math, functions, vectors, matrices, AI math, normal distribution, deep learning math","learnMlTitle":"Basic Machine Learning | Learn","learnMlDescription":"Free basic machine learning course: KNN, linear and logistic regression, decision trees, ensemble, K-means, cross-validation, recommendation systems. ML for beginners.","learnMlKeywords":"basic machine learning, machine learning, KNN, linear regression, logistic regression, decision tree, ensemble, K-means, cross-validation, recommendation system, free ML course","learnMidMlTitle":"Intermediate Machine Learning | Learn","learnMidMlDescription":"Data preprocessing (scaling, encoding, imputation), PCA, SVM, boosting basics, DBSCAN, GMM, pipelines, and hyperparameter tuning for real-world ML.","learnMidMlKeywords":"intermediate ML, scaling, encoding, imputation, PCA, SVM, boosting, AdaBoost, GBM, DBSCAN, GMM, pipeline, Optuna","learnAdvDlTitle":"Advanced Deep Learning | Learn","learnAdvDlDescription":"Transformer, BERT, GPT, FlashAttention, ViT, self-supervised learning, prompt engineering, LoRA, QLoRA, RLHF, DPO, RAG, LLM agents, GNN, XAI, autoencoder, VAE, GAN, diffusion, VLM, speech, knowledge distillation, deployment: large models and generative AI in chapters.","learnAdvDlKeywords":"advanced deep learning, Transformer, BERT, GPT, FlashAttention, ViT, LoRA, QLoRA, RLHF, DPO, RAG, LLM agents, GNN, Grad-CAM, VAE, GAN, diffusion model, Stable Diffusion, CLIP, Whisper, knowledge distillation, TensorRT, vLLM","learnMidDlTitle":"Intermediate Deep Learning | Learn","learnMidDlDescription":"Weight initialization, Adam, learning rate scheduling, regularization, batch/layer norm, data augmentation, CNN, pooling, ResNet, efficient convolution, transfer learning, object detection, image segmentation, tokenization, word embedding, 1D CNN, RNN, LSTM, GRU, encoder-decoder, attention: stable training and unstructured data in chapters.","learnMidDlKeywords":"intermediate deep learning, weight initialization, Xavier, He, Adam, RMSprop, learning rate scheduling, regularization, dropout, batch norm, layer norm, data augmentation, CNN, pooling, ResNet, MobileNet, transfer learning, YOLO, SSD, U-Net, tokenization, BPE, Word2Vec, GloVe, RNN, LSTM, GRU, attention","learnMidMathTitle":"Intermediate Math | Learn","learnMidMathDescription":"Vectors, matrices, linear transformation, eigenvalues, gradient, Jacobian, Hessian, Taylor series, convex optimization, conditional probability, Bayes, covariance, multivariate normal, MLE, entropy, cross-entropy: intermediate math for multivariable and uncertainty, chapter by chapter.","learnMidMathKeywords":"intermediate math, vector space, dot product, matrix, linear transformation, inverse, determinant, rank, eigenvalue, eigenvector, gradient, Jacobian, Hessian, Taylor series, convex optimization, conditional probability, Bayes theorem, covariance, MLE, entropy, KL divergence","learnAdvMathTitle":"Advanced Math | Learn","learnAdvMathDescription":"SVD, tensor algebra, Lagrange, Markov, Monte Carlo, MCMC, EM, MAP, variational inference, Wasserstein, MDP, Fourier, graph Laplacian, SDE, Langevin, information geometry: advanced math for generative models and optimization, chapter by chapter.","learnAdvMathKeywords":"advanced math, SVD, pseudoinverse, tensor, Lagrange, KKT, Markov, Monte Carlo, MCMC, EM, MAP, variational inference, Wasserstein, MDP, Bellman, Fourier, graph Laplacian, SDE, Langevin, score matching, information geometry","learnAdvMlTitle":"Advanced Machine Learning | Learn","learnAdvMlDescription":"Feature engineering, PCA, t-SNE, SVM, kernels, boosting, XGBoost, imbalanced data, anomaly detection, DBSCAN, GMM, hyperparameter tuning, cross-validation, XAI, SHAP, time series, recommender systems: advanced ML for nonlinear problems, complex data, optimization, and interpretability.","learnAdvMlKeywords":"advanced machine learning, feature engineering, PCA, t-SNE, UMAP, SVM, kernel, boosting, XGBoost, LightGBM, SMOTE, anomaly detection, DBSCAN, GMM, hyperparameter, Optuna, XAI, SHAP, LIME, time series, ARIMA, Prophet, matrix factorization, FM","learnPaperReviewHubTitle":"Paper review | Learn","learnPaperReviewHubDescription":"A hub for AI and deep learning papers organized by topic: theory, optimization, architecture, tabular data, vision, NLP, XAI, data-centric AI, edge/web, and domain applications.","learnPaperReviewHubKeywords":"paper review, AI papers, deep learning, machine learning, Learn","playgroundTitle":"Mini Neural Network Playground","playgroundDescription":"Draw and explore neural network structures in your browser.","communityTitle":"IT News","communityDescription":"Stay up to date with the latest AI and IT news and development trends. New posts are added regularly; find them via search.","communityKeywords":"IT news, AI news, artificial intelligence news, machine learning, deep learning, LLM, AI development trends, tech news, AI updates","studiesTitle":"Studies","studiesDescription":"Find deep learning study groups and learning resources.","curriculumTitle":"Book reading","curriculumDescription":"Create and share book-based learning roadmaps.","supportTitle":"Support & Contact","supportDescription":"How to use Everyone's AI, Chrome extension, and support for Learn and community.","privacyTitle":"Privacy Policy","privacyDescription":"How Everyone's AI collects, uses, and stores personal information.","termsTitle":"Terms of Service","termsDescription":"Terms of service for Everyone's AI.","refundTitle":"Refund Policy","refundDescription":"Refund policy for Learn paid subscription.","aboutTitle":"What is Everyone's AI?","aboutDescription":"An AI education platform built by an AI researcher. Learn basic math and deep learning step by step. Based on experience from K-League AI Competition 3rd place, Financial AI Challenge 22nd, and more."},"support":{"title":"Support & Contact","intro":"For how to use Everyone's AI (mdooai.com), error reports, and suggestions, please refer to the following.","serviceTitle":"Service introduction","serviceContent":"Everyone's AI is an education platform that helps you understand deep learning and AI from the ground up. It offers Learn (chapter-by-chapter visuals, some chapters paid subscription), Book reading (book-based roadmaps), Community (learning material sharing), and a Chrome extension (open the learning page in a new tab).","extensionTitle":"Chrome extension","extensionContent":"Clicking the toolbar icon opens the learning page (https://mdooai.com/learn) in a new tab. For installation or usage questions, contact us via this support page or the extension's Chrome Web Store listing.","extensionInstallCta":"Install from Chrome Web Store","contactTitle":"Contact us","contactContent":"For general inquiries, error reports, or suggestions, please use the contact option on mdooai.com or the published contact details. We will respond as soon as possible.","linksTitle":"Related links","learnLink":"Learn","privacyLink":"Privacy Policy","termsLink":"Terms of Service","refundLink":"Refund Policy","supportUrlLabel":"Support URL"},"about":{"title":"What is Everyone's AI?","intro1":"Hello. This is Everyone's AI. We focus on machine learning and deep learning.","intro2":"I have participated in various AI competitions and developed models used in industry. Through that, I learned one important lesson: while technique matters, what really determines the difference in performance is understanding the fundamentals. These days you can implement models quickly with vibe coding, but when performance doesn't meet expectations, analyzing the cause and improving is still not easy. Without an understanding of the mathematical foundations and AI principles, it's difficult to structurally identify where bottlenecks occur.","intro3":"This site is designed to reduce that trial and error by helping you learn concepts and calculations together.","intro4Before":"So I developed and released this learning platform based on what I've studied and organized. If you'd like lectures or training, feel free to reach out at ","intro4After":" and I'll be glad to help.","approachTitle":"Learning Approach","approachContent":"Rather than listing concept summaries, the content follows the flow of computation step by step so you understand 'why it works this way.' It's centered on visualization and interaction, with immediate AI coach feedback to correct misconceptions.","roadmapTitle":"Future Plans","roadmapContent":"We plan to continuously expand with more AI education content, including machine learning. If you're interested, feel free to contact us at ","roadmapContactAfter":" anytime.","feedbackNote":"It's still an early version, but we're improving it continuously. Your feedback is welcome and will be actively incorporated.","ctaLearn":"Start Learning","ctaDeveloper":"View Developer Profile","chromeExtensionTitle":"Add to Chrome Web Store","chromeExtensionDesc":"Install the Chrome extension to open the learning page in a new tab."},"terms":{"title":"Terms of Service","effectiveDate":"Effective: March 2, 2026 (updates will be announced on this page).","intro":"Thank you for using Everyone's AI (mdooai.com). These terms apply to your use of our services.","section1Title":"1. Applicable scope","section1Content":"These terms apply to the Everyone's AI website and related services (Learn, Book reading, Community, etc.). Only Learn has paid subscription content (some chapters). By using the service, you agree to these terms.","section2Title":"2. Service use","section2Content":"You may use the service after signing up or logging in. Learn has free and paid subscription chapters; Book reading, Community, and other services are free. Payment and refund terms for Learn are on the relevant policy pages.","section3Title":"3. Prohibited acts","section3Content":"You may not misuse accounts, disrupt the service, violate laws, or copy content for commercial use without permission. Violations may result in restricted access.","section4Title":"4. Terms changes","section4Content":"We may update these terms and will announce changes on this page. Significant changes will state an effective date. Continued use after changes constitutes acceptance.","section5Title":"5. Inquiries","section5Content":"For questions about these terms or the service, please contact us via mdooai.com or the support/contact options on the site.","termsUrlLabel":"Terms of Service URL"},"refund":{"title":"Refund Policy","effectiveDate":"Effective: March 2, 2026 (updates will be announced on this page).","intro":"Everyone's AI Learn (paid subscription) is billed monthly at 4 USD per month. This policy covers payment and refunds.","section1Title":"1. Subscription fee and payment","section1Content":"The Learn paid subscription fee is 4 USD per month and is automatically renewed and charged each month from your payment date. Payment is processed by Paddle or another payment provider; you will be charged according to the amount, currency, and billing cycle shown at checkout.","section2Title":"2. Refunds","section2Content":"If you are not satisfied with the service, you may request a full refund within 7 days of your first payment date. After 7 days or from the second payment onward, no refund is given for the current month’s period. Request refunds via the site’s support/contact or Paddle customer support.","section3Title":"3. Cancellation","section3Content":"You may cancel your subscription at any time. After cancellation, access to paid Learn chapters continues until the end of the current billing period; you will not be charged from the next billing date. No refund is given for the current month already paid.","section4Title":"4. Applicability and contact","section4Content":"Refund and cancellation procedures follow the policy in effect at the time of payment and Paddle’s policy. For refund, payment, or cancellation questions, use the mdooai.com support page or Paddle customer support.","refundUrlLabel":"Refund Policy URL"},"privacy":{"title":"Privacy Policy","effectiveDate":"Effective: March 2, 2026 (updates will be announced on this page).","section1Title":"1. Scope","section1Content":"This Privacy Policy applies to the Everyone's AI (mdooai.com) website and related services (Learn, Book reading, Community, Chrome extension, etc.). Only Learn has paid subscription chapters.","section2Title":"2. Information we collect","section2Intro":"The following information may be collected and used when you use our services.","section2List1":"Account information: email, password, display name, etc. when you sign up or log in.","section2List2":"Usage data: learning progress, community posts and comments, book reading roadmaps, etc.","section2List3":"Device and environment: browser, access logs, etc. (for service improvement and security).","section2List4":"Learn payment and subscription: payment is processed by Paddle or another payment provider; we do not store card details. Learn subscription and purchase information may be used to provide access to paid chapters and handle refunds or cancellation.","section2Extension":"The Chrome extension does not collect or transmit user data. It only opens the learning page in a new tab when you click the icon.","section3Title":"3. How we use the information","section3Content":"Collected information is used to provide and improve the service, respond to inquiries, ensure security and prevent abuse, and comply with applicable laws.","section4Title":"4. Retention and deletion","section4Content":"Personal information is securely deleted after the purpose of use is fulfilled or after the legally required retention period. We also delete data in accordance with our procedures when you request deletion or account closure.","section5Title":"5. Third-party sharing","section5Content":"We do not sell or provide your personal information to third parties without your consent. We may share information only when required by law or with your consent.","section6Title":"6. Policy changes","section6Content":"We will update this page when the Privacy Policy changes. Significant changes will be announced with an effective date.","section7Title":"7. Contact","section7Content":"For questions about how we handle personal information, please contact us via mdooai.com or the contact option on the site.","privacyUrlLabel":"Privacy Policy URL"},"common":{"appName":"Everyone's AI","headerBrand":"Everyone's AI","loading":"Loading...","close":"Close","back":"Back","backToHome":"← Home","chapterSelect":"Select chapter","chapterSearchNoResults":"No results found.","chapterListEmpty":"No chapters.","chapters":"Learn","curriculum":"Book reading","community":"Community","itNews":"IT News","language":"Language","openMenu":"Open menu","closeMenu":"Close menu","menu":"Menu","communityComingSoon":"Community section is coming soon.","searchPlaceholder":"Search chapters, concepts…","globalSearchPlaceholder":"Search all chapters…","globalSearchNoResults":"No results found.","answer":"Answer","wrongAnswerGuideButton":"Why was it wrong?","mcTfFalse":"False","mcTfTrue":"True","mcCircled1":"①","mcCircled2":"②","mcCircled3":"③","mcCircled4":"④","signIn":"Sign in","signUp":"Sign up","myAccount":"My Account","signOut":"Sign out","aboutLink":"What is Everyone's AI?","myAchievements":"My achievements","moreServices":"More","allServices":"All services","saving":"Saving…"},"community":{"title":"IT News","subtitle":"Stay up to date with the latest AI and IT news and development trends.","allPosts":"All posts","viewFullCommunity":"View full community","sortNewest":"Newest","sortOldest":"Oldest","newPost":"New post","createPost":"Create post","uploadMaterial":"Upload material","uploadTitle":"Title","category":"Category","categoryAll":"All","categoryPlaceholder":"Select category","category_ai_news":"AI News","category_ai_basics":"AI Basics","category_machine_learning":"Machine Learning","category_deep_learning":"Deep Learning","category_nlp":"Natural Language Processing","category_computer_vision":"Computer Vision","category_llm":"Large Language Models","category_prompt_engineering":"Prompt Engineering","category_ai_ethics":"AI Ethics","category_ai_tools":"AI Tools","category_study_material":"Study Materials","priceTypeFree":"Free","priceTypePaid":"Paid","price":"Price","pricePlaceholder":"e.g. 10,000 KRW","uploadTitlePlaceholder":"e.g. Dot product practice sheet","uploadDescription":"Description","uploadDescriptionPlaceholder":"Describe the material and how to use it...","uploadFile":"Attach file (optional)","uploadSubmit":"Publish","uploading":"Publishing...","download":"Download","postedAt":"posted","noPosts":"No posts yet. Be the first to share!","searchPlaceholder":"Search title or description","prevPage":"Previous","nextPage":"Next","pageOf":"Page {current} of {total}","scrollToTop":"Scroll to top","signInToPost":"Sign in to upload materials.","errorLoad":"Failed to load posts.","errorPublish":"Failed to publish. Try again.","errorPriceRequired":"Please enter the price for paid posts.","backToFeed":"Back to feed","postedAnUpdate":"posted an update","postLabel":"Post","inThisPost":"In this post","replyPlaceholder":"Reply to {name}'s post","replyComingSoon":"Replies are coming soon.","errorPostNotFound":"Post not found.","deletePost":"Delete post","deleteConfirm":"Delete this post?","errorDelete":"Failed to delete.","editPost":"Edit post","comments":"Comments","commentPlaceholder":"Write a comment","commentSubmit":"Post","commentSubmitting":"Posting…","commentEdit":"Edit","commentDelete":"Delete","commentDeleteConfirm":"Delete this comment?","commentCancel":"Cancel","commentSave":"Save","noComments":"No comments yet.","errorComment":"Failed to post comment.","errorCommentEdit":"Failed to update.","errorCommentDelete":"Failed to delete.","removeFile":"Remove","editForbidden":"You don't have permission to edit.","backToPost":"Back to post","currentFile":"Current","removeFileLabel":"Remove attachment"},"curriculum":{"title":"Book reading","listTitle":"Book reading","listSubtitle":"Create and share book-based learning roadmaps. Browse recommended book reading.","createNew":"New book reading","newTitle":"Create book reading","subtitle":"Search for a textbook and get a learning roadmap so you can follow the track to reach your learning goal.","searchBooks":"Search books","autocompleteLabel":"Autocomplete","searchResults":"Select from search results","searchResultsEmpty":"Search for books to see results here.","requiredBookTitle":"Please enter the book title. (Required)","aiAutoLabel":"AI auto-generate","generateHint":"After entering the book title, click the button and AI will generate a learning roadmap.","generateWithAI":"Generate book reading with AI","fillRequiredToGenerate":"Enter a book title to enable this button.","resultEmptyHint":"Click \"Generate book reading with AI\" above to fill this area. You can edit and save.","requiredToSave":"Please enter both book title and book reading content to save.","searchPlaceholder":"Search by book title, author, or topic…","searchButton":"Search","searching":"Searching…","noBooks":"No results. Try a different search.","selectBook":"Create book reading from this book","editBookInfo":"Book info (editable)","searchOrManualHint":"Search for a book to select it, or enter the details below. You can create a book reading with just a title if the book is not in the catalog.","bookTitle":"Book title","bookTitlePlaceholder":"e.g. Introduction to Deep Learning","bookImageUrl":"Book cover image URL","isbnPubdate":"ISBN / Publication date","bookInfo":"Book information","bookDescription":"Book description","isbn":"ISBN","pubdate":"Publication date","generating":"Generating book reading…","generateError":"Failed to generate book reading. Please try again.","searchError":"Book search failed.","optionalRequest":"Additional request (optional)","optionalRequestPlaceholder":"e.g. For beginners, 2-week course, focus on understanding ML…","resultTitle":"Generated learning roadmap","shortDescription":"Short description (shown in list)","shortDescriptionPlaceholder":"e.g. Step-by-step learning roadmap from basics to advanced","shortDescriptionHint":"Shown as preview on the list. Leave empty to use content summary.","editCurriculum":"Edit the content below if needed, then save.","save":"Save","saving":"Saving…","saveSuccess":"Saved.","saveError":"Failed to save.","signInToSave":"Sign in to save.","author":"Author","publisher":"Publisher","sortNewest":"Newest","sortOldest":"Oldest","sortPopular":"Popular","curriculaSearchPlaceholder":"Search title or summary","prevPage":"Previous","nextPage":"Next","pageOf":"Page {current} of {total}","scrollToTop":"Scroll to top","noCurricula":"No saved book reading yet. Create one!","notFound":"Book reading not found.","like":"Recommend","likes":"Recommends","createdBy":"Created by","anonymous":"Anonymous","edit":"Edit","delete":"Delete","deleteConfirm":"Delete this book reading?","editCurriculumMenu":"Menu","editTitle":"Edit book reading","cancel":"Cancel","backToCurriculum":"Back to book reading","backToDetail":"Back to detail","editForbidden":"Only the author can edit."},"auth":{"loading":"Loading...","signIn":{"title":"Sign in","subtitle":"Enter your email or username and password.","identifierLabel":"Email or username","identifierPlaceholder":"Enter email or username","passwordLabel":"Password","passwordPlaceholder":"Enter password","submit":"Continue","submitting":"Signing in...","noAccount":"Don't have an account?","signUpLink":"Sign up"},"signUp":{"title":"Create your account","subtitle":"Please fill in the details below to get started.","usernameLabel":"Username","usernamePlaceholder":"4–64 characters, letters and numbers","usernameRules":"4–64 characters, Latin letters only. Special characters ^ $ ! . ` # + ~ are not allowed.","emailLabel":"Email address","emailPlaceholder":"Enter your email address","passwordLabel":"Password","passwordPlaceholder":"Enter your password","submit":"Continue","submitting":"Processing...","hasAccount":"Already have an account?","signInLink":"Sign in"},"verifyEmail":{"title":"Verify your email","subtitleSignIn":"Enter the verification code sent to your email.","subtitleSignUp":"Enter the verification code sent to your email address.","codeLabel":"Verification code","codePlaceholder":"Enter verification code","submit":"Verify","submitting":"Verifying...","verifyButton":"Verify","back":"Back","backSignIn":"Sign in another way"},"errors":{"generic":"Something went wrong. Please try again.","username_length":"Username must be between 4 and 64 characters.","username_non_number":"Username must contain at least one non-numeric character (e.g. a letter).","username_latin_only":"Usernames can only use Latin letters (e.g. English). You can set a display name in your preferred language after sign-up.","password_length":"Please check the password length requirements.","form_identifier_exists":"This email or username is already in use.","form_identifier_not_found":"No account found with this identifier.","form_password_incorrect":"Incorrect password.","form_code_incorrect":"Invalid verification code.","form_password_compromised":"A security issue was detected with your password. Please sign in using another method, such as email verification.","user_locked":"Sign-in is temporarily locked. Please try again later.","display_name_min_length":"Display name must be at least 4 characters.","second_factor_not_supported":"This app only supports password sign-in. If multi-factor authentication (MFA) is required, turn it off in the Clerk Dashboard (instance MFA policy or the user’s security settings), then try again."}},"paperReview":{"title":"AI Papers","navTitle":"AI Papers","hubTitle":"AI Papers","hubDescription":"Papers grouped by theme. Choose a category below.","hubFlatListTitle":"Published AI papers","hubFlatListLead":"Open a conference hub or go straight to each paper page.","hubFlatListCount":"{count} papers","hubFlatListPaperLabel":"Paper","scopeHeading":"Scope","keywordsHeading":"Keywords","seoTitleSuffix":"CPAL 2026 paper review | Everyone's AI","categories":{"theoreticalFoundations":{"sidebarTitle":"Theory & math","headline":"Theoretical AI & Mathematical Foundations","scope":"Papers on mathematical proofs for AI algorithms, optimization theory, functional analysis, and linear-algebraic approaches (e.g. work on influence functions).","keywords":"Formal proofs, optimization, algorithmic foundations, statistical learning theory"},"modelOptimization":{"sidebarTitle":"Optimization & efficiency","headline":"Model Optimization & Efficient AI","scope":"Model compression and acceleration: low-rank approximation, LoRA, quantization, pruning, and related methods.","keywords":"Compression, parameter efficiency, inference speed, memory optimization"},"coreArchitecture":{"sidebarTitle":"Architecture & algorithms","headline":"Core Architecture & Algorithms","scope":"New neural backbones and training methodology: Transformer variants, CNNs, GNNs, loss functions, optimizers, and related contributions.","keywords":"Model structure, deep learning architecture, learning algorithms"},"predictiveTabular":{"sidebarTitle":"Tabular & prediction","headline":"Predictive Modeling & Tabular Data","scope":"Tree-based models, tabular classification/regression, churn, sports analytics, Kaggle-style and business forecasting papers.","keywords":"Machine learning, time series, tabular data, predictive modeling"},"automatedMl":{"sidebarTitle":"AutoML & ML pipelines","headline":"Automated ML & End-to-End ML Pipelines","scope":"AutoML, neural architecture search, hyperparameter/model search, meta-learning, and automation that ties preprocessing, training, evaluation, and deployment—including natural-language-driven tooling.","keywords":"AutoML, HPO, NAS, meta-learning, MLOps, pipeline automation"},"visionMultimodal":{"sidebarTitle":"Vision & multimodal","headline":"Computer Vision & Multimodal","scope":"Face analysis, object detection, segmentation, and multimodal models combining vision and language.","keywords":"Vision, image analysis, multimodal deep learning"},"nlpLlm":{"sidebarTitle":"NLP & LLMs","headline":"NLP & Large Language Models","scope":"Language modeling, text classification, translation, multilingual NLP, prompting, RAG, and other text-centric AI work.","keywords":"LLM, NLU/NLG, text mining"},"trustworthyXai":{"sidebarTitle":"Trust & XAI","headline":"Trustworthy AI & XAI","scope":"Interpretability, robustness to outliers, data attribution, AI ethics and safety.","keywords":"Explainability, robustness, model diagnostics, trustworthy AI"},"dataCentricFeatures":{"sidebarTitle":"Data-centric & features","headline":"Data-Centric AI & Feature Engineering","scope":"Improving performance via data quality, feature design, augmentation, and noisy-label handling rather than only model structure.","keywords":"Preprocessing, feature engineering, data augmentation"},"edgeWebServices":{"sidebarTitle":"Edge & web AI","headline":"AI Services & Edge/Web Computing","scope":"Browser inference (e.g. TensorFlow.js), on-device and mobile deployment, extensions, and edge serving.","keywords":"On-device AI, web AI, deployment optimization"},"domainApplications":{"sidebarTitle":"Domain applications","headline":"Domain-Specific Applications","scope":"Applied deep learning in education, coaching, recommendation, healthcare, personalization, and similar domains.","keywords":"Educational AI, recommender systems, healthcare, personalization"}},"papers":{"sidebarYear2025":"2025","sidebarYear2026":"2026","sidebarVenueCpal":"CPAL","sidebarVenueIcml":"ICML","sidebarVenueIclr":"ICLR","cpal2026":{"sidebarLabel":"CPAL2026","hubTitle":"CPAL2026","hubDescription":"CPAL 2026 papers under Theoretical AI & Mathematical Foundations.","metaTitle":"CPAL2026","metaDescription":"CPAL 2026 paper hub under the theory & mathematical foundations category."},"nlpCpal2026":{"hubTitle":"CPAL2026","hubDescription":"CPAL 2026 papers under NLP & Large Language Models.","metaTitle":"CPAL2026","metaDescription":"CPAL 2026 paper hub under the NLP & large language models category."},"influenceKernelVonMises":{"sidebarTitle":"Kernel von Mises Formula of the Influence Function","title":"Kernel von Mises Formula of the Influence Function","placeholder":"Review content coming soon.","metaTitle":"Kernel von Mises Formula of the Influence Function","metaKeywords":"influence function, kernel von Mises, CPAL 2026, paper review, statistical learning, robust statistics","metaDescription":"CPAL 2026 review: Kernel von Mises Formula of the Influence Function—influence functions, kernels, and key takeaways."},"curseDepthLlm":{"sidebarTitle":"The Curse of Depth in Large Language Models","title":"The Curse of Depth in Large Language Models","placeholder":"Review content coming soon.","metaTitle":"The Curse of Depth in Large Language Models","metaKeywords":"LLM, curse of depth, LayerNorm scaling, CPAL 2026, large language models, transformers","metaDescription":"CPAL 2026 review: The Curse of Depth in Large Language Models—depth pathology, LayerNorm scaling, and training fixes."},"polarQuant":{"sidebarTitle":"PolarQuant: Quantizing KV Caches with Polar Transformation","title":"PolarQuant: Quantizing KV Caches with Polar Transformation","description":"A deep dive into PolarQuant, which removes normalization overhead by quantizing polar angles of KV caches after random preconditioning.","placeholder":"Review content coming soon.","viewOriginalPdf":"View original PDF","metaTitle":"PolarQuant paper review | KV cache quantization (arXiv 2502.02617)","metaKeywords":"PolarQuant, arXiv 2502.02617, KV cache, KV cache quantization, LLM inference, long-context LLM, attention, VRAM, polar transformation, random preconditioning, angle quantization, INT4, FP16, ModuDL","metaDescription":"arXiv 2502.02617 PolarQuant explained: random preconditioning + polar angles for 4.2x+ KV-cache compression, LLM inference memory, formulas and benchmarks."},"coreCpal2026":{"hubTitle":"CPAL2026","hubDescription":"CPAL 2026 papers in the Core Architecture & Algorithms category.","metaTitle":"CPAL2026","metaDescription":"CPAL 2026 paper hub for Core Architecture & Algorithms."},"alphaFormerEndToEnd":{"sidebarTitle":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","title":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","description":"Deep dive into AlphaFormer: synthetic pre-training, linear alpha pools, IC metrics, and PPO-stabilized generation of interpretable symbolic factors via Transformers.","placeholder":"Review content coming soon.","viewOriginalPdf":"View original PDF","metaTitle":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","metaKeywords":"AlphaFormer, alpha factors, symbolic regression, transformers, CPAL 2026, quantitative finance, PPO","metaDescription":"CPAL 2026 AlphaFormer review: end-to-end symbolic regression of alpha factors—pools, IC, PPO, formulas, and intuition."},"icml2025":{"sidebarLabel":"ICML 2025"},"iclr2025":{"sidebarLabel":"ICLR 2025","hubTitle":"ICLR 2025","hubDescription":"ICLR 2025 papers in the Automated ML & ML Pipelines category.","metaTitle":"ICLR 2025","metaDescription":"ICLR 2025 paper hub under Automated ML & ML Pipelines."},"autoMlIcml2025":{"hubTitle":"ICML 2025","hubDescription":"ICML 2025 papers in the Automated ML & ML Pipelines category.","metaTitle":"ICML 2025","metaDescription":"ICML 2025 paper hub under Automated ML & ML Pipelines."},"automlAgent":{"sidebarTitle":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","title":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","authors":"Patara Trirat, Wonyong Jeong, Sung Ju Hwang","venue":"ICML 2025","abstractHeading":"Abstract","abstract":"$19","placeholder":"Review content coming soon.","metaTitle":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","metaKeywords":"AutoML, multi-agent systems, LLM, ICML 2025, full pipeline, retrieval-augmented planning, AutoML-Agent","metaDescription":"ICML 2025 AutoML-Agent review: a multi-agent LLM framework for end-to-end AutoML from data retrieval to deployment—planning, parallel agents, and verification."},"sela":{"sidebarTitle":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","sidebarLabel":"ICLR 2025","title":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","authors":"Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, Chenglin Wu","venue":"ICLR 2025 · arXiv:2410.17238","metaTitle":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","metaKeywords":"SELA, MCTS, AutoML, LLM agent, UCT-DP, tree search, ICLR 2025, arXiv:2410.17238","metaDescription":"SELA paper review: MCTS tree search for LLM-based AutoML, UCT-DP prioritization, and normalized scores (NS)—formulas and intuition."}}},"landing":{"heroTitle":"Where you learn AI the easy way","heroSubtext":"Learn step by step, the right way.","heroTagline":"The place where everyone learns AI.","forEveryone":"The platform for learning AI from the ground up—concepts, computation, and instant feedback.","heroCurriculum":"Create and share book-based learning roadmaps with other learners.","heroCommunity":"Share and download AI learning materials in the community.","ctaAbout":"What is Everyone's AI?","ctaExplore":"Deep learning","ctaMath":"Math","ctaMl":"Machine learning","ctaPaperReview":"AI papers","ctaBrowse":"Browse book reading","ctaBrowseCommunity":"Browse community","trendingLabel":"Quick access","recentChaptersSectionLabel":"New lessons","recentChaptersTitle":"Recently added chapters","recentChaptersSubtitle":"See what’s new and jump straight into the latest material.","recentChaptersCardCta":"Open chapter","recentChaptersRecentTooltip":"Published within the last 5 days","homeOfTitle":"The Home of AI Learning","homeOfSubtitle":"Discover step by step, practice hands-on, and learn with AI feedback.","featurePlatformTitle":"Learning platform","featurePlatformDesc":"Learn foundations, deep learning, and machine learning chapter by chapter, with no limits.","featureFasterTitle":"Move faster","featureFasterDesc":"Concepts, practice problems, and instant AI feedback to level up.","featureExploreTitle":"Explore all levels","featureExploreDesc":"Foundations, deep learning, and ML—step by step. We improve continuously with your feedback.","featureBadgeTitle":"Achievements & certificate","featureBadgeDesc":"Complete chapters to earn achievements and receive a certificate of completion.","featurePortfolioTitle":"Grow together","featurePortfolioDesc":"Share your learning, get the latest development news, and connect with fellow learners.","signUpCta":"Sign Up","problemTitle":"Why you need to do the math yourself","problemBody":"If you only use APIs, it's hard to explain why a model produced a given result.\n\nDot products, matrix multiplication, gradients—unless you work through these calculations yourself, it's difficult to grasp why performance dropped or where things went wrong.\n\nMost courses show only results and formulas, and don't give you enough opportunity to check the computation step by step.","solutionSectionLabel":"How it works","solutionTitle":"Learn concepts easily and solve problems. When stuck, just ask the AI","solutionIntro":"From dot product to gradient—core deep learning math, structured across 12 chapters.","solutionList":"Every chapter has concept overviews and practice problems. When wrong or stuck, you can ask the AI.","solutionBody":"When you're curious or got it wrong, you can ask the AI coach.","ctaStartLearning":"Start learning deep learning","globalPlatform":"KO · EN · JA · ZH","learnShortDesc":"Basic deep learning: 12 chapters from dot product to gradient—concepts, problems, and instant grading.","heroImageAlt":"AI learning background","dlCardTitle":"Basic Deep Learning","advMathCardTitle":"Advanced Math","learnAdvMathShortDesc":"SVD, tensors, Markov, MCMC, variational inference, Wasserstein, SDE, information geometry. Advanced math for generative models and optimization, chapter by chapter.","ctaAdvMath":"Advanced Math","advMlCardTitle":"Advanced Machine Learning","learnAdvMlShortDesc":"Feature engineering, PCA, SVM, boosting, XGBoost, imbalanced data, anomaly detection, DBSCAN, XAI, SHAP, time series, recommender systems. Chapter-by-chapter advanced ML.","ctaAdvMl":"Advanced ML","mlCardTitle":"Basic Machine Learning","learnMlShortDesc":"From data and features to KNN, linear and logistic regression, and recommendation systems. Learn basic machine learning chapter by chapter.","learnPaperReviewShortDesc":"Curated AI and deep learning papers by theme. Read reviews in categories such as theory, architecture, NLP, and vision.","midDlCardTitle":"Intermediate deep learning","learnMidDlShortDesc":"Weight init, Adam, regularization, CNN, ResNet, transfer learning, object detection, tokenization, RNN, LSTM, attention. Stable training and unstructured data in chapters.","ctaMidDl":"Intermediate deep learning","advDlCardTitle":"Advanced deep learning","learnAdvDlShortDesc":"Transformer, BERT, GPT, LoRA, QLoRA, RLHF, RAG, agents, GAN, diffusion, VLM, knowledge distillation, deployment. Large models and generative AI in chapters.","ctaAdvDl":"Advanced deep learning","learnMathShortDesc":"From functions, vectors, and matrices to uniform and normal distributions. Build the foundations for understanding AI.","mathCardTitle":"Basic math","midMathCardTitle":"Intermediate math","learnMidMathShortDesc":"Vectors, matrices, linear transformation, eigenvalues, gradient, Jacobian, Hessian, convex optimization, Bayes, MLE, entropy. Learn multivariable and uncertainty math chapter by chapter.","ctaMidMath":"Intermediate math","quickAccessTitle":"Math · Deep learning · Machine learning · AI papers","curriculumShortDesc":"Design your own book-based learning roadmap and grow together with other learners.","communityShortDesc":"Share AI and deep learning materials, get the latest development news, and connect with fellow learners.","itNews":"IT News","itNewsShortDesc":"Stay up to date with the latest AI and IT news and development trends.","coupangBannerText":"Discover a wide range of products on Coupang"},"adminPopup":{"title":"Session overview","languageNote":"The session will be conducted in Korean.","meetLinkNote":"We will send you the Google Meet link before the seminar.","freeSeminarNote":"This seminar is free.","seminarDateLabel":"Date & time","seminarDateTime":"Friday, March 27, 2026, 8:00–9:00 PM (KST)","competitionLinkLabel":"Competition","applyCta":"Apply","speakerTitle":"About the speaker","speakerPara1":"A working professional majoring in AI at Yonsei University, with hands-on experience in data-driven ML problem-solving and model improvement through AI competitions.","speakerPara2":"This session shares practical approaches and decision criteria around problem definition, analysis, and model design as required in competitions.","sessionTitle":"About this session","sessionPara1":"This session covers how to interpret and define ML problems using competition data, and how to improve models and strategies based on analysis.","sessionPara2":"Rather than listing algorithms or techniques, it focuses on how data was re-analyzed when performance fell short of expectations, and how those insights were reflected in model structure and inference strategy.","sessionPara3":"The goal is to share the strategies and thinking adopted under the constraints of an AI competition.","mainContentTitle":"Main topics","mainContent1":"Problem definition from competition data","mainContent2":"Criteria for connecting analysis to model design","mainContent3":"Strategy adjustments when improvement stalled","mainContent4":"Generalization and approach in a competition setting","recommendTitle":"Recommended for","recommend1":"Those unsure how to approach AI competition problems","recommend2":"Those who want to understand data analysis and model design flow","recommend3":"Those needing direction when performance improvement stalls","recommend4":"Those who want to systematize ML strategy for competitions","recommend5":"Developers who want to grow through AI competitions","dismissCheckboxLabel":"Don’t show for 3 days"},"home":{"introButton":"About the service","intro":"An AI-powered learning platform that helps beginners not get stuck on concepts and formulas. Practice calculations, get feedback from an AI coach to fix misconceptions, and understand step by step how AI learns and reasons.","problem":"Problem","advDlAskProblemContext":"Advanced Deep Learning — {chapterTitle}. Current problem:\n{problem}","problemPrompt":"Find the dot product __DOT_FORMULA__ of the vectors below.","problemPromptMatrix":"Find the value that goes in the blank (?) in the matrix product __MATRIX_AB__ below.","problemPromptLinearLayer":"Find the value that goes in the blank (?) in the linear layer __LINEAR_FORMULA__ below.","problemPromptActivation":"Given the activation function (Sigmoid, ReLU, Tanh₃), find Y for each X and fill in the blank (?).","problemPromptArtificialNeuron":"Artificial neuron: apply the given activation (ReLU, Sigmoid, or Tanh) to get Y, and fill in the blank (?).","problemPromptBatch":"Fill in the blank (?) in the batch operation (weight times input plus bias, add, subtract, multiply, subtract mean, sum, or mean).","prev":"Previous","next":"Next","prevChapter":"Previous chapter","nextChapter":"Next chapter","inputSectionTitle":"Your solution","askSectionTitle":"Ask a question","practicePadTitle":"Practice pad","tabletInkFabAria":"Open handwriting mode","tabletInkFabLabel":"Write","learnToolsFabAria":"Open learning tools menu","learnToolsFabLabel":"Tools","pageInkModeTitle":"Ink mode — draw directly on the page","pageInkClear":"Clear ink","pageInkModeExit":"Exit ink mode","pageInkCanvasAria":"Page ink canvas","pageInkPaletteAria":"Ink color palette","pageInkPaletteToggleAria":"Open or close color palette","pageInkScrollMode":"Scroll mode","pageInkDrawMode":"Ink mode","pageInkColorSwatchAria":"Color {color}","fabMenuLabel":"Ask menu","practicePadSeeMain":"Solve the problem on the main screen.","drawMode":"Handwrite","keyboardMode":"Type","drawHint":"Draw your solution in the area below. After drawing, click \"Grade with AI\" to get feedback.","keyboardHint":"Enter your solution or answer below. After entering, click \"Grade with AI\" to get feedback.","askDrawHint":"Write your question here by hand. After writing, click \"Ask\" to get an answer.","askKeyboardHint":"Type your question here. Click \"Ask\" to get an answer.","askPlaceholder":"e.g. Why does this formula work like this?","askSubmit":"Ask","asking":"Sending...","askResponseTitle":"Answer","drawQuestionLabel":"(Question with drawing)","askEmptyAlert":"Please draw or type your question, then click Ask.","errorAsk":"An error occurred while sending your question. Please try again.","errorAskRequest":"Ask request failed","askErrorEmptyQuestion":"Please draw or type your question.","solutionErrorNoContent":"Could not generate the solution.","solutionErrorServer":"An error occurred while generating the solution.","ariaAskInput":"Type your question","placeholder":"Enter your steps or final answer. e.g. a·b = 3·5 = 15","ariaKeyboardInput":"Type your solution","clear":"Clear","grade":"Grade with AI","gradeShort":"Grade","grading":"Grading...","correctAnswer":"Correct!","wrongAnswer":"Incorrect. Please try again.","wrongAnswerPanelHint":"After a wrong answer a “why was I wrong?” hint is requested automatically. It nudges you without revealing the exact answer.","tryAgain":"Try again.","checkAnswer":"Check answer","chapterCompleteTitle":"Chapter complete!","chapterCompleteBadge":"{chapterName} achievement earned","chapterCompleteLoginHint":"If you sign in now, this chapter will be marked complete and you won't need to solve it again.","chapterCompleteSignInCta":"Sign in and save completion","chapterCompleteTryAgain":"Try again","chapterCompleteNextChapter":"Next chapter","badgeSaved":"Achievement saved.","certificateTitle":"Certificate of Completion","certificateSubtitlePrefix":"This is to certify that the person named below has completed the following courses of Everyone's AI (https://mdooai.com) ","certificateSubtitleEnd":"Learn.","certificateHolder":"Holder","certificateHolderEditHint":"You can type the name directly.","certificateHolderModalTitle":"Enter the recipient name","certificateHolderModalConfirm":"Confirm","certificateHolderModalPrint":"Print","certificateHolderEdit":"Edit","certificateCompleted":"Completed courses","certificateIssuer":"Issuer","certificateIssuerName":"Everyone's AI","certificateIssuerUrl":"https://mdooai.com","certificateDate":"Date issued","certificatePrint":"Print certificate","certificateNoBadges":"No completed chapters yet. Complete chapters to receive a certificate.","certificateSignInRequired":"Please sign in to issue a certificate.","certificateIssue":"Issue certificate","profileTitle":"My learning","profileBadgesSection":"Earned achievements","profileNoBadges":"No completed chapters yet.","profileCertificateLink":"Issue certificate","profileMyBadges":"My achievements","profileBadgesCta":"View my achievements / Issue certificate","badgesPageTitle":"My Achievements & Certificate","badgesPageDesc":"View your earned achievements and certificate of completion.","badgesAdminMode":"(Admin Preview)","badgesAdminModeDesc":"All achievements are shown and the full certificate is printed.","mathFunctionsProblemPrompt":"Use the function expression and input to find the missing value.","mathFunctionsProblemPromptInput":"Given f(?) = value, find x","mathFunctionsProblemPromptCompare":"Choose the larger value.","mlKnnProblemPrompt":"Use KNN distance calculations and majority vote rules to find the answer.","mlLinearRegressionProblemPrompt":"Use the linear regression equation to compute predictions and slope/intercept values.","mlLinearRegressionProblemPromptPredict":"For the linear regression model $\\hat y = w x + b$ with $w={w}$, $b={b}$, find the predicted value $\\hat y$ when $x={x}$. Enter an integer.","mlLinearRegressionProblemPromptSlope":"Find the slope $w = \\frac{y_2-y_1}{x_2-x_1}$ of the line through ({x1}, {y1}) and ({x2}, {y2}). Enter an integer.","mlLinearRegressionProblemPromptIntercept":"A line with slope $w={w}$ passes through ({x}, {y}). Find the intercept $b = y - w x$. Enter an integer.","mlLinearRegressionProblemPromptTwoPointPredict":"The line through ({x1}, {y1}) and ({x2}, {y2}) is given. Find the $y$ value on the line when $x={x}$. Enter an integer.","mlLinearRegressionProblemPromptResidual":"The line $\\hat y={w}x+{b}$ predicts values. The actual observation is at ({x}, {y}). Find the residual $y - \\hat y$. Enter an integer.","mlLinearRegressionProblemPromptResidualSum":"Points {points} and line $\\hat y={w}x+{b}$. Find the sum of residuals $\\sum_i (y_i - \\hat y_i)$. Enter an integer.","mlMseProblemPrompt":"Compute squared error, SSE, MSE, and RMSE to find the answer.","mlMseProblemPromptSquaredError":"When actual $y={y}$ and prediction $\\hat y={yHat}$, find the squared error $(y - \\hat y)^2$. Enter an integer.","mlMseProblemPromptSse":"For the following (actual, prediction) pairs, find the sum of squared errors $\\sum_i (y_i - \\hat y_i)^2$. {pairs} Enter an integer.","mlMseProblemPromptMse":"For the following (actual, prediction) pairs, find the mean squared error MSE $= \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$. {pairs} Enter an integer.","mlMseProblemPromptMseFromLine":"Points {points} and line $\\hat y={w}x+{b}$. Find the MSE. Enter an integer.","mlMseProblemPromptMissingSquaredError":"MSE $= {mse}$, $n = {n}$, and $n-1$ squared errors are {squaredErrors}. Find the remaining squared error. Enter an integer.","mlMseProblemPromptRmse":"When MSE $= {mse}$, find RMSE $= \\sqrt{\\text{MSE}}$. Enter an integer.","mlMseProblemSolvingTable":"$1a","mlLogisticProblemPrompt":"Use logistic regression linear scores and decision boundaries to find the prediction.","mlLogisticProblemPromptLinearScore":"In the linear score $z = wx + b$ of logistic regression, when $w={w}$, $x={x}$, $b={b}$, find $z$ as an integer.","mlLogisticProblemPromptMultiScore":"In the linear score $z = w_1 x_1 + w_2 x_2 + b$, when weights are {weights}, features are {features}, and $b={b}$, find $z$ as an integer.","mlLogisticProblemPromptClassifyFromZ":"When the linear score $z = {z}$, according to the decision boundary ($z>0 \\Rightarrow \\hat y=1$, $z \\le 0 \\Rightarrow \\hat y=0$), find the predicted class $\\hat y$.","mlLogisticProblemPromptClassifyFromProb":"When probability $p = {p}$ and threshold $= {threshold}$, if $p \\ge$ threshold then $\\hat y=1$, otherwise $\\hat y=0$. Find the predicted class $\\hat y$.","mlLogisticProblemPromptCountClassOne":"For the following linear scores, we classify as class 1 when $z>0$. Find the number of samples classified as class 1 as an integer. $z$ list: {zList}","mlLogisticProblemPromptCountMisclassified":"When the true labels are {labels} and the linear score $z$ for each sample is {zList}, we predict $\\hat y_i = 1$ if $z_i>0$ else $0$. Find the number of misclassified samples.","mlLogisticProblemSolvingTable":"$1b","mlDecisionTreeProblemPrompt":"Apply decision-tree split rules and impurity metrics to find the answer.","mlDecisionTreeProblemPromptCountNodes":"In a decision tree there are {internal} internal nodes and {leaves} leaf nodes. Find the total number of nodes.","mlDecisionTreeProblemPromptCountLeaves":"In a decision tree there are {leaves} leaf nodes. Find the number of leaf nodes.","mlDecisionTreeProblemPromptTreeDepth":"The maximum depth of the tree (root = 0) is {depth}. Find the depth.","mlDecisionTreeProblemPromptFollowPath":"In the decision tree, the path is {path} (0 = no/left, 1 = yes/right). Find the predicted class at the leaf you reach.","mlDecisionTreeProblemPromptLeafMajority":"At one leaf, class 0 has {c0} samples and class 1 has {c1}. Find the predicted class by majority vote.","mlDecisionTreeProblemPromptGini":"When class counts are {counts}, compute Gini impurity $G = 1 - \\sum_i p_i^2$ and find $100 \\times G$ (integer).","mlDecisionTreeProblemPromptEntropy":"When class counts are {counts}, compute entropy $H = -\\sum_i p_i \\log_2 p_i$ and find $100 \\times H$ (integer).","mlDecisionTreeProblemPromptInformationGain":"Parent node class counts {parentCounts}, left child {leftCounts}, right child {rightCounts}. Find $100 \\times \\text{IG}$ (integer).","mlDecisionTreeProblemPromptWeightedGini":"After split: left child class counts {leftCounts}, right child {rightCounts}. Find $100 \\times$ weighted Gini $(n_L/n)G_L + (n_R/n)G_R$ (integer).","mlDecisionTreeProblemSolvingLabel":"Explanation for solving the problems","mlDecisionTreeProblemSolvingTable":"**Decision tree — solving guide**\n\n- **Node count** — Add the number of internal nodes and leaf nodes.\n- **Leaf count** — Read the number given as the leaf count.\n- **Depth** — Read the maximum depth value with root = 0.\n- **Follow path** — Start at the root, move left for 0 and right for 1, then read the prediction at the leaf you reach.\n- **Gini** — Get $p_i$ from class counts, compute $G = 1 - \\sum_i p_i^2$, then compute $100 \\times G$.\n- **Entropy** — Compute $H = -\\sum_i p_i \\log_2 p_i$, then compute $100 \\times H$.\n- **Weighted Gini** — Compute $(n_L/n)G_L + (n_R/n)G_R$, then compute $100 \\times$ that value.\n- **Leaf majority** — Compare the counts of class 0 and class 1, and use the larger side as the prediction.","mlEnsembleProblemPrompt":"Apply ensemble voting/averaging rules to find the final prediction.","mlEnsembleProblemSolvingLabel":"Explanation for solving the problems","mlEnsembleProblemPromptMajorityVote":"In a random forest, class 0 received {votes0} votes and class 1 received {votes1} votes. Find the final predicted class by majority vote.","mlEnsembleProblemPromptCountVotes":"There are {totalTrees} trees; class 0 has {votes0} votes and class 1 has {votes1} votes. Find the number of votes for the winning class.","mlEnsembleProblemPromptRegressionMean":"In a regression ensemble, {B} trees predicted {predictions}. Find the mean $\\hat y = \\frac{1}{B}\\sum_{b=1}^B \\hat y_b$ (integer).","mlEnsembleProblemPromptNumTrees":"In a random forest there are {B} trees. Find the number of trees $B$.","mlEnsembleProblemPromptOobCount":"There are {nTrees} trees, and a sample was in the bootstrap of only {nInBag} of them. Find the number of trees that did not use this sample (OOB count).","mlEnsembleProblemPromptFormulaMean":"In an ensemble, {B} trees have predictions summing to {sum}. Find the mean $\\hat y = \\frac{1}{B}\\sum_{b=1}^B \\hat y_b$ (integer).","mlEnsembleProblemPromptDefinition":"If the following statement is true enter 1, otherwise 0. {statement}","mlEnsembleProblemPromptFeatureImportance":"Feature importances are {importances}. Find the index (starting from 1) of the feature with the highest importance.","mlEnsembleProblemPromptWeightedVote":"There are 2 trees: the first gives class {c1} weight {w1}, the second gives class {c2} weight {w2}. Find the final prediction by the class with the larger weight.","mlEnsembleStatement_0":"In bagging, each base model is trained independently.","mlEnsembleStatement_1":"Random forest is an ensemble that combines bagging and decision trees.","mlEnsembleStatement_2":"In classification ensembles, the final prediction is usually by majority vote.","mlEnsembleStatement_3":"In boosting, later models focus on samples that previous models got wrong.","mlEnsembleStatement_4":"OOB (Out-of-Bag) means predicting a sample using only trees that did not have it in their bootstrap sample.","mlEnsembleStatement_5":"In stacking, a meta-model uses the predictions of base models as input.","mlEnsembleStatement_6":"In regression ensembles, the final prediction is usually the average of tree predictions.","mlEnsembleStatement_7":"In random forest, at each split only a random subset of features is considered.","mlEnsembleStatement_8":"An ensemble combines predictions from multiple models into one prediction.","mlEnsembleStatement_9":"Random forest tends to reduce variance compared to a single decision tree.","mlEnsembleStatement_10":"In boosting, each base model is trained independently.","mlEnsembleStatement_11":"In regression ensembles, the final prediction is by majority vote.","mlEnsembleStatement_12":"OOB evaluation requires a separate validation set.","mlEnsembleStatement_13":"In random forest, each tree is trained on the full training data.","mlEnsembleStatement_14":"In stacking, the meta-model uses only the original input features of the base models.","mlEnsembleProblemSolvingTable":"$1c","mlKmeansProblemPrompt":"Compute K-Means distances, center updates, and SSE to find the answer.","mlKmeansProblemPromptDistanceSquared":"Two points ({x1}, {y1}) and ({x2}, {y2}). Find the squared Euclidean distance $(x_2-x_1)^2+(y_2-y_1)^2$ (integer).","mlKmeansProblemPromptAssignCluster":"Point ({px}, {py}) with centers {centers}. Find the cluster number (from 1) of the nearest center.","mlKmeansProblemPromptCenterMeanX":"When the points in a cluster are {points}, find the $x$ coordinate of the new center (mean, integer).","mlKmeansProblemPromptCenterMeanY":"When the points in a cluster are {points}, find the $y$ coordinate of the new center (mean, integer).","mlKmeansProblemPromptSse":"Cluster points {points}, center ({cx}, {cy}). Find SSE $\\sum_i \\|\\mathbf{x}_i - \\boldsymbol{\\mu}\\|^2$ (integer).","mlKmeansProblemPromptNumClusters":"In K-Means, $K = {K}$. Find the value of $K$.","mlKmeansProblemPromptDefinition":"If the following statement is true enter 1, otherwise 0. {statement}","mlKmeansStatement_0":"K-Means is unsupervised learning.","mlKmeansStatement_1":"In K-Means, the user sets the number of clusters K.","mlKmeansStatement_2":"K-Means minimizes the within-cluster sum of squared distances (SSE).","mlKmeansStatement_3":"In the assignment step, each point is assigned to the nearest center.","mlKmeansStatement_4":"In the update step, the new center is the mean of the points in the cluster.","mlKmeansStatement_5":"K-Means forms clusters from data without labels.","mlKmeansStatement_6":"K-Means uses Euclidean distance (or squared distance) for comparison.","mlKmeansStatement_7":"K-Means repeats assignment and update until convergence.","mlKmeansStatement_10":"K-Means is supervised learning.","mlKmeansStatement_11":"K-Means automatically chooses K.","mlKmeansStatement_12":"K-Means maximizes the number of clusters.","mlKmeansStatement_13":"In the assignment step, points are assigned to clusters at random.","mlKmeansStatement_14":"In the update step, the new center is the median of the cluster.","mathExponentialProblemPrompt":"Find the value of the exponential","mathExponentialProblemPromptExponent":"Find the exponent","mathExponentialProblemPromptCompare":"Choose the larger one.","mathExponentialProblemPromptProduct":"Same base product: find the exponent sum","mathExponentialProblemPromptQuotient":"Same base quotient: find the exponent difference","mathExponentialProblemPromptPowerOfPower":"Find the value of the power of a power.","mathLogProblemPrompt":"Find the value of the logarithm","mathLogProblemPromptInput":"Find the argument","mathLogProblemPromptCompare":"Choose the larger value.","mathLogProblemPromptSum":"Log sum: $\\log_a(b) + \\log_a(c) = \\log_a(b \\cdot c)$.","mathLogProblemPromptDiff":"Log difference: $\\log_a(b) - \\log_a(c) = \\log_a(b/c)$.","mathLimitProblemPrompt":"Find the limit (Types: polynomial, constant, x→∞, ε-δ concept)","mathLimitProblemPromptDirect":"Find the limit","mathLimitProblemPromptConstant":"Find the limit of the constant.","mathLimitProblemPromptLinear":"Find the limit of the linear expression.","mathLimitProblemPromptAtInfinity":"Find the limit as x → ∞.","mathLimitProblemPromptEpsilon":"Choose the number that matches the ε-δ definition.","mathLimitProblemEpsilonQuestion":"In ε-δ, what does δ represent?","mathLimitProblemEpsilonHint":"(1=distance, 2=error)","mathContinuityProblemPrompt":"Continuity: find the limit or continuity","mathContinuityProblemPromptLimitPoly":"Polynomial is continuous, so limit = function value.","mathContinuityProblemPromptLimitLinear":"Find the limit of the linear expression (equals the function value).","mathContinuityProblemPromptYesNo":"Choose 1 if continuous at that point, 0 if discontinuous.","mathContinuityProblemPromptLimitAtHole":"Find the limit at the point where there is a hole.","mathContinuityProblemAtPoint":" at ","mathContinuityProblemContinuousQ":" continuous?","mathContinuityProblemLimitAtHoleIntro":"A function with a hole at","mathContinuityProblemLimitAtHoleQ":"has limit = ?","mathDerivativeProblemPrompt":"Derivative: find the derivative (tangent slope) at the given point","mathDerivativeProblemPromptPower":"Power rule: $(x^n)' = n x^{n-1}$. Find $f'(x)$ at the given point.","mathDerivativeProblemPromptLinear":"Linear: $(mx+b)' = m$. Find $f'(x)$ at the given point.","mathDerivativeProblemPromptPoly2":"Quadratic derivative. Find $f'(x)$ at the given point.","mathDerivativeProblemPromptConstMul":"Constant multiple: $(c \\cdot x^n)' = c \\cdot n \\cdot x^{n-1}$. Find $f'(x)$ at the given point.","mathDerivativeProblemAtPoint":" at","mathChainRuleProblemPrompt":"Chain rule: find $f'(x)$ at the given point (Types: power, exponential, trig, sqrt, ln, quadratic)","mathPartialGradientProblemPrompt":"Partial derivative & gradient: find the required component","mlKnnProblemSolvingTable":"**Steps**\n\n- **Input** — New feature vector $\\mathbf{x}$\n- **Stored** — Labeled examples $(\\mathbf{x}_i, y_i)$\n- **1** — Compute distance $d(\\mathbf{x}, \\mathbf{x}_i)$ to each $\\mathbf{x}_i$\n- **2** — Select K smallest distances\n- **3 (classification)** — Predict by **majority vote** of the K labels\n- **3 (regression)** — Predict **average** of the K values\n\n---\n\n**Example (distance squared)**\n\nTwo points A(0, 0) and B(3, 4) lie in the plane. Find the distance squared $(x_2-x_1)^2 + (y_2-y_1)^2$.\n\n**Solution**\n\n$(3-0)^2 + (4-0)^2 = 9 + 16 = 25$, so the answer is **25**.","mlLinearRegressionProblemSolvingTable":"$1d","mathIntegralProblemPrompt":"Integral: find the definite integral or antiderivative value","mathIntegralProblemPromptDefiniteConst":"Find the definite integral of the constant function.","mathIntegralProblemPromptDefiniteLinear":"Find the definite integral of the linear function.","mathIntegralProblemPromptAntiderivative":"Find the value of the antiderivative at the given point.","mathRandomVariableProblemPrompt":"Follow the instruction below","mathRandomVariableProblemPromptProbSumSix":"Find the blank c so the three probabilities sum to 1.","mathRandomVariableProblemPromptExpectedValueScale6":"Find 6×E[X] = Σ(value × numerator).","mathRandomVariableProblemPromptVarianceShort":"Find 36 times the variance for the distribution below.","mathRandomVariableProblemVarianceHowToCalc":"Variance = how spread out values are from the average. Variance = E[X²]−(E[X])², 36×variance = 6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²","mathRandomVariableProblemVarianceLabel":"36×variance","mathRandomVariableProblemPromptVarianceScale36":"For the same distribution, Var(X)=E[X²]-E[X]². Find 36×Var(X). (6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²)","mathRandomVariableProblemPromptVarianceIntro":"For the same distribution, ","mathRandomVariableProblemPromptVarianceMid":". Find ","mathRandomVariableProblemPromptVarianceEnd":". (6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²)","mathRandomVariableProblemPromptVarianceAsk":". ","mathRandomVariableProblemPromptVarianceFormula":"(6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²)","mathRandomVariableProblemProbSumHint":"c","mathRandomVariableProblemExpectationHint":"Sum of (value × numerator)","mathRandomVariableProblemVarianceHint":"36×Var(X)","mathRandomVariableProblemPromptMode":"Which value of X has the highest probability? (mode)","mathRandomVariableProblemPromptExpectedValueInt":"Find the expected value E[X] (average value).","mathRandomVariableProblemPromptCumulativeNumerator":"When the probability that X is at most the given value is written as numerator/6, choose the numerator.","mathRandomVariableProblemModeLabel":"Most likely X","mathRandomVariableProblemExpectedValueIntLabel":"E[X]","mathRandomVariableProblemCumulativeLabel1":"P(X≤1) = ?/6 → ?","mathRandomVariableProblemCumulativeLabel2":"P(X≤2) = ?/6 → ?","mathMeanVarianceProblemPrompt":"Follow the instructions below","mathMeanVarianceProblemPromptProbSumSix":"Find the blank c so that the three probabilities sum to 1.","mathMeanVarianceProblemPromptMeanScale6":"Find 6×E[X] = Σ(value×numerator).","mathMeanVarianceProblemPromptVarianceShort":"Find 36×variance for the following distribution.","mathMeanVarianceProblemVarianceHowToCalc":"Variance = spread around the mean. 36×variance = 6×Σ(nᵢ·xᵢ²) − (Σ nᵢ·xᵢ)²","mathMeanVarianceProblemVarianceLabel":"36×variance","mathMeanVarianceProblemPromptVarianceScale36":"Find 36×Var(X) for the same distribution.","mathMeanVarianceProblemProbSumHint":"c","mathMeanVarianceProblemMeanScale6Label":"6×mean","mathMeanVarianceProblemMeanIntegerLabel":"Mean E[X]","mathMeanVarianceProblemPromptMeanInteger":"Find the mean (expected value) E[X].","mathMeanVarianceProblemPromptMode":"Find the X value with the highest probability (mode).","mathMeanVarianceProblemPromptCumulativeNumerator":"When P(X≤given) is written as numerator/6, choose the numerator.","mathMeanVarianceProblemModeLabel":"Most likely X","mathMeanVarianceProblemCumulativeLabel1":"P(X≤1) = ?/6 → ?","mathMeanVarianceProblemCumulativeLabel2":"P(X≤2) = ?/6 → ?","mathUniformNormalProblemPrompt":"Follow the instructions below","mathUniformNormalProblemPromptUniformMean":"For the uniform distribution on [a,b], find the mean (a+b)/2.","mathUniformNormalProblemPromptUniformVar12":"For uniform U[a,b], find 12×variance = (b−a)².","mathUniformNormalProblemPromptUniformLength":"Find the length of the interval [a,b], i.e. b−a.","mathUniformNormalProblemPromptNormalPct68":"In a normal distribution, about what percent of values lie within μ±σ?","mathUniformNormalProblemPromptNormalPct95":"In a normal distribution, about what percent of values lie within μ±2σ?","mathIntegralProblemAntiderivativeIntro":"Given that","mathIntegralProblemAntiderivativeAt":" what is the value at ","mathIntegralProblemAntiderivativeQ":"?","mathPartialGradientProblemAtPoint":"at","mathPartialGradientProblemGradientFirst":"First component","mathPartialGradientProblemGradientSecond":"Second component","wrongAnswerGuideButton":"Why was it wrong?","wrongAnswerGuideTitle":"Wrong answer guide","wrongAnswerGuideSubmittedAnswer":"Your answer:","wrongAnswerGuideHint":"The AI will infer why you solved it that way and guide you in the right direction without giving the answer.","wrongAnswerGuideApiQuestion":"The user got the problem wrong with the answer \"{answer}\". Infer why they might have solved it that way and guide them in the right direction without revealing the correct answer.","wrongAnswerGuideAsking":"Getting guide...","wrongAnswerQuestionPrompt":"I answered {answer}. Why was it wrong?","getSolution":"Get solution","loadingSolution":"Loading...","feedbackTitle":"AI grading feedback","solutionTitle":"Solution","alertDrawFirst":"Please draw your solution before grading.","alertInputFirst":"Please enter your solution before grading.","errorGrade":"An error occurred while grading. Please try again.","errorSolution":"An error occurred while loading the solution. Please try again.","errorGradeRequest":"Grading request failed","errorSolutionRequest":"Solution request failed","errorStream":"Could not read stream.","errorDefault":"Could not generate feedback.","placeholderChapter":"This chapter is coming soon.","conceptVisualPlaceholder":"A visualization for this concept is coming soon.","conceptComingSoon":"Learning content for this concept will be added in a future update.","conceptMatrixMulIntro":"One row of A · one column of B (dot product) → one entry of the result matrix","conceptMatrixMulCell":"This entry","conceptLinearLayerIntro":"Multiply input X by weight matrix W and add bias b to get output Y. __LINEAR_FORMULA__","conceptLinearLayerLegendRow0":"W row 1·X + b[0] → Y[0]","conceptLinearLayerLegendRow1":"W row 2·X + b[1] → Y[1]","conceptArtificialNeuronIntro":"An artificial neuron computes the weighted sum __WEIGHTED_SUM_FORMULA__ , then applies an activation (e.g. ReLU, Sigmoid, or Tanh) to produce output Y.","conceptArtificialNeuronCalcCaption":"Step by step: (W·X) multiplied + b added = Z → ReLU(Z) = Y","conceptBatchIntro":"A batch stacks multiple samples as columns of a matrix. The same W and b are applied at once: __LINEAR_FORMULA__ .","conceptBatchCaption":"One column = one sample. Same W and b applied to all columns at once.","conceptBatchExampleTitle":"Example: calculation for one column (sample)","conceptBatchFormulaRow":"Z{n} = (W row {row}·this column)+b[{bi}] = ({calc})+({b}) = {result}","conceptConnectionIntro":"Connections describe how neurons in one layer link to the next. Only non-zero weights are actual links; the graph below shows those partial connections.","conceptConnectionGraphCaption":"Connection structure (zero-weight links omitted)","conceptConnectionCalcCaption":"Each output: (W row·X) multiplied + b added = Y","conceptConnectionFormulaRow1":"Y₁ = (W row 1·X) + b₁ = ({calc}) + {b} = {wx} + {b} = {y}","conceptConnectionFormulaRow2":"Y₂ = (W row 2·X) + b₂ = ({calc}) + {b} = {wx} + {b} = {y}","conceptActivationTitleSigmoid":"Y = Sigmoid(X)","conceptActivationTitleRelu":"Y = ReLU(X)","conceptActivationTitleTanh":"Y = Tanh₃(X)","conceptActivationTableHeader":"X ~ Y","conceptDotProductIntro":"a = [{a1}, {a2}], b = [{b1}, {b2}] → a·b = {samePositionSum}","conceptDotProductSamePositionSum":"sum of element-wise products","problemPromptConnection":"In the connection __LINEAR_FORMULA__ , find the value for the blank (?). Inputs with W=0 are not connected to that output.","conceptHiddenIntro":"A hidden layer takes input, applies a linear transform (__LINEAR_CORE__) and ReLU to produce an intermediate representation H, then applies another linear transform and ReLU to produce the final output Y.","conceptHiddenGraphCaption":"Input → Hidden (H) → Output (Y)","problemPromptHidden":"In the forward pass with a hidden layer (X → (W·X+b) → ReLU → H → (W·H+b) → ReLU → Y), fill in the blank (?).","conceptDeepIntro":"A deep network stacks many hidden layers. Each layer applies Linear (W·input + b) and ReLU to produce an intermediate representation before the next layer.","conceptDeepFormulaCaption":"Each layer: Linear & ReLU","conceptDeepFormulaWithSymbols":"Linear = W·(prev layer) + b → ReLU","conceptDeepGraphCaption":"Input (X) → Hidden (A,B,C,D) → Output (Y)","problemPromptDeep":"In the multi-layer forward pass (each layer Linear & ReLU), fill in the blank (?).","conceptWideIntro":"Width means having many neurons in one layer. Wider layers can express more features at once; each layer is computed with Linear & ReLU.","conceptWideFormulaCaption":"Each layer: Linear & ReLU (layer gets wider)","conceptWideGraphCaption":"Input (X) → Hidden (A,B) → Output (Y) → 1→2→4→8 neurons","problemPromptWide":"In the forward pass where layers get wider (each layer Linear & ReLU), fill in the blank (?).","conceptSoftmaxIntro":"Softmax turns numbers into values between 0 and 1 that sum to 1. Compute __WEIGHTED_SUM_FORMULA__, then __SOFTMAX_EXP__, then divide each by the sum (__SOFTMAX_SUM__) to get probabilities.","conceptSoftmaxFormulaCaption":"Z = W·X + b → e^Z (e^x) → Y = e^Z / Σ","conceptSoftmaxGraphCaption":"Often used in the final layer for multi-class classification.","problemPromptSoftmax":"Compute __SOFTMAX_FLOW__ , then fill in the blank (?).","conceptSoftmaxEHint":"In this problem we use e = 3 for easy calculation. So __E_Z_3Z__. (e.g. Z=1 → 3, Z=2 → 9)","conceptGradientIntro":"The gradient is a vector that shows the direction and rate of change of a function. To reduce loss, we update parameters in the opposite direction. Forward: __GRADIENT_FORWARD__; backward: __GRADIENT_BACKWARD__.","conceptGradientForwardLabel":"Forward","conceptGradientBackwardLabel":"Backward","conceptGradientFormulaCaption":"Forward Z = W·X → Backward dZ = dW·X","conceptGradientGraphCaption":"The same idea applies to linear layers, hidden layers, and so on.","conceptGradientBlankHint":"In the problems, the blank (?) is **one entry of X** or **one entry of Z** (forward) / **dZ** (backward).","conceptGradientForwardDesc":"Forward: Z = W·X (each row of W dotted with X)","conceptGradientBackwardDesc":"Backward: dZ = dW·X (same structure, gradient values)","conceptInputX":"Input X","conceptLinear":"Linear","conceptLinearReLULayer1":"Linear & ReLU (layer 1)","conceptLinearReLULayer2":"Linear & ReLU (layer 2)","conceptSoftmaxFlowCaption":"Score (__Z__) → __3Z__ → divide by sum → probability (__Y__)","conceptSoftmaxZLabel":"Z (score)","conceptSoftmaxExpLabel":"3^Z","conceptSoftmaxSumLabel":"Sum","conceptSoftmaxProblemFlow":"Score (__Z__) → __3Z__ → divide by sum (__SIGMA__) → probability (__Y__)","conceptSoftmaxProbability":"Prob.","conceptSoftmaxExampleTitle":"Example: step-by-step calculation","conceptSoftmaxStepZ":"Z{n} = (W row {row}·X)+b[{bi}] = {calc}+{b} = {result}","conceptSoftmaxStepExp":"3^Z{n} = 3^{z} = {result}","conceptSoftmaxStepSum":"Σ = {items} = {result}","conceptSoftmaxStepY":"Y{n} = 3^Z{n}/Σ = {num}/{den} = {result}","conceptWideLinearReLU1":"Linear & ReLU (layer 1, width 2)","conceptWideLinearReLU2":"Linear & ReLU (layer 2, width 4)","conceptWideLayer1Formula":"Layer 1 (width 2): H = ReLU(W₁·X + b₁)","conceptWideLayer2Formula":"Layer 2 (width 4): Y = ReLU(W₂·H + b₂)","conceptMatrixMulCellDot":"Row {row} of A · column {col} of B (one dot product)","conceptMatrixMulARow":"Row {row} of A","conceptMatrixMulBCol":"Column {col} of B","conceptBatchLinear":"Multiply the table by weights and add bias to fill the blank.","conceptBatchLinearRelu":"Multiply by weights, add bias, then set negatives to 0 and fill the blank.","conceptBatchAdd":"Add the right-hand value to each row and fill the blank.","conceptBatchSubtract":"Subtract the right-hand value from each row and fill the blank.","conceptBatchMultiply":"Multiply each row by the right-hand value and fill the blank.","conceptBatchCenter":"Subtract each row's mean from that row and fill the blank.","conceptBatchSum":"Sum all numbers in each row and fill the blank.","conceptBatchMean":"Find the mean (integer) of each row and fill the blank.","conceptBatchRowMeanHint":"(row mean → 0)","conceptBatchRowSumHint":"(row sum)","conceptBatchRowMeanIntHint":"(row mean, integer)","conceptRowN":"row {n}","conceptDeepLayer1Title":"Layer 1: A₁, A₂, A₃ (W₁ each row·X + b₁)","conceptDeepLayer2Title":"Layer 2: B₁, B₂, B₃ (W₂ each row·A + b₂)","conceptDeepFormulaA":"A{n} = (W₁ {row}·X)+b₁[{bi}] = ({calc})+({b}) = {linear} → ReLU = {result}","conceptDeepFormulaAZero":"A{n} = (W₁ {row}·X)+b₁[{bi}] = ({calc})+({b}) = {linear} → ReLU(-1)=0 → {result}","conceptDeepFormulaB":"B{n} = (W₂ {row}·A)+b₂[{bi}] = ({calc})+({b}) = {linear} → ReLU = {result}","conceptHiddenLayer1Title":"Layer 1: H = ReLU(W₁·X + b₁)","conceptHiddenLayer2Title":"Layer 2: Y = ReLU(W₂·H + b₂)","conceptHiddenLinear1":"Linear₁","conceptHiddenLinear2":"Linear₂","conceptHiddenFormulaL1":"{linearLabel} = (W₁ {row}·X)+b₁[{bi}] = ({calc}) + ({b}) = {linear} → ReLU = {result}","conceptHiddenFormulaL2":"{linearLabel} = (W₂ {row}·H)+b₂[{bi}] = ({calc}) + ({b}) = {linear} → ReLU = {result}","conceptWideFormulaH1":"H₁ = (W₁ {row}·X)+b₁[0] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaH2":"H₂ = (W₁ {row}·X)+b₁[1] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY1":"Y₁ = (W₂ {row}·H)+b₂[0] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY2":"Y₂ = (W₂ {row}·H)+b₂[1] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY3":"Y₃ = (W₂ {row}·H)+b₂[2] = {calc} = {linear} → ReLU = {result}","conceptWideFormulaY4":"Y₄ = (W₂ {row}·H)+b₂[3] = {calc} = {linear} → ReLU = {result}","conceptGradientZLine":"Z{n} = (W {row})·X = {calc} = {z}","conceptGradientDZLine":"dZ{n} = (dW {row})·X = {calc} = {dz}","problemPromptGradient":"Fill in the blank (?) in __GRADIENT_FORWARD__ or __GRADIENT_BACKWARD__ .","tinyNNTitle":"Deep learning diagram by chapter","tinyNNDescription":"As you complete each chapter, the diagram below fills in. This is the structure so far.","tinyNNComplete":"By the last chapter you'll see the full picture: forward → loss → backward → update.","tinyNNAriaLabel":"Deep learning diagram progress by chapter","mathDiagramTitle":"Math diagram by chapter","mathDiagramDescription":"Select a chapter to see its diagram below. View the flow of basic math at a glance.","midMathDiagramTitle":"Math diagram by chapter","midMathDiagramDescription":"Select a chapter to see its diagram below. View the flow of intermediate math at a glance.","mathDiagramComplete":"Through Ch01 Functions you see the full input → function → output structure.","mathDiagramAriaLabel":"Math diagram by chapter","mlDiagramTitle":"ML diagram by chapter","mlDiagramDescription":"Select a chapter to see its diagram below. View the machine learning flow at a glance.","mlDiagramAriaLabel":"ML diagram by chapter","linkToPlayground":"How this computation is used in neural networks","introRoadmapHeading":"What you learn in Ch01–Ch12","mathIntroRoadmapIntro":"Understanding deep learning and machine learning requires basic math such as **functions**, **exponential and log**, **limits, derivatives, integrals**, and **probability and distributions**. Ch01–Ch12 cover exactly that. **Functions** are the basis of input→output; **derivatives and gradients** are what the model uses to decide **where and how much** to change parameters when learning; **probability and distributions** are needed for prediction and uncertainty.","midMathIntroRoadmapHeading":"What you learn in Ch01–Ch20","midMathIntroRoadmapIntro":"Intermediate math deepens the language you use to understand AI. You learn how data is represented and transformed using **vectors**, **matrices**, and **linear transformations**, then quantify similarity and direction with **dot products** and **projections**. Next, you interpret change and curvature using **Jacobians** and **Hessians**, which lets you understand the shape of the loss landscape. Finally, you design learning more robustly with **Taylor series** and **convex optimization**, and learn uncertainty with **Bayes**, **covariance**, and the **multivariate normal distribution**.","premiumBadge":"Premium","premiumTitle":"This is a Premium Chapter","premiumDescription":"This chapter is exclusive to Learn paid subscribers. After subscribing, you get unlimited access to all Learn chapters: concept explanations, problem sets, and AI coaching.","premiumFeature1":"Unlock all Chapters 04–12","premiumFeature2":"Unlimited AI learning coach questions","premiumFeature3":"Early access to new chapters","premiumMonthly":"month","premiumCTA":"Subscribe to Premium","premiumComingSoon":"Coming soon","premiumLogin":"Already subscribed?","premiumLoginLink":"Log in","premiumLoginFirst":"Sign in to subscribe to Premium.","freeChaptersNote":"Chapters 01–03 are free to use.","mlMseProblemPromptBinaryCrossEntropyLog2Y1":"Binary cross-entropy for one sample: $\\ell = -\\big( y \\log_2 \\hat p + (1-y) \\log_2(1-\\hat p) \\big)$. Given $y=1$ and $\\hat p = {pFrac}$, find $\\ell$ as an integer. Hint: {logHint}","mlMseProblemPromptBinaryCrossEntropyLog2Y0":"Binary cross-entropy for one sample: $\\ell = -\\big( y \\log_2 \\hat p + (1-y) \\log_2(1-\\hat p) \\big)$. Given $y=0$ and $1-\\hat p = {pFrac}$, find $\\ell$ as an integer. Hint: {logHint}"},"playground":{"title":"Mini neural network playground","configTitle":"Model settings","inputNodes":"Input nodes","hiddenNeurons":"Hidden layer neurons","activation":"Activation","createModel":"Create model","inputTarget":"Input and target","runForward":"Run forward","forwardSteps":"Forward steps","training":"Training","oneStep":"One step","epochs50":"50 epochs","weightsAndGradients":"Weights and gradients","linkFromProblem":"How this computation is used in the network","fromDotBanner":"Linked to the dot product exercise. The first neuron in the model below computes the dot product of input and weights. Run Forward to see.","inputXLabel":"Input X (comma-separated)","targetLabel":"Target (comma-separated)","trainingInProgress":"Training…","weightsW1":"W₁ (hidden layer weights)","weightsW2":"W₂ (output layer weights)","gradientsDW1":"dW₁ (gradient)","gradientsDW2":"dW₂ (gradient)","createModelHint":"Select settings above and click \"Create model\".","lossGraphEmpty":"Run training to see the loss per epoch.","lossGraphTitle":"Loss per epoch","epochLabel":"Epoch","lastLossLabel":"Last loss: {value} ({count} epochs total)"},"tinyNN":{"batchPhase0":"Samples 1, 2, 3 are separate.","batchPhase1":"When we merge them into one table → we compute with the same W, b at once.","batchPhase2":"The same W, b applies at once to every column (sample).","batchPhase3":"So output Y also comes out as one table at once.","batchInputSeparate":"Input (samples separate)","batchInputTable":"Input table X","batchSample1":"Sample 1","batchSample2":"Sample 2","batchSample3":"Sample 3","batchOneColOneSample":"One column = one sample","batchMergeHint":"Merging makes one table","batchSameWb":"Same W, b","batchComputeOnce":"Compute at once","batchResultY":"Output Y","batchResultCaption":"← Result from same W, b at once","batchFooter1":"Stacking samples into one matrix lets us compute with the same W, b at once.","batchFooter2":"So when we merge inputs into one table, output Y also comes out as one table at once.","batchFooter3":"One table goes through the same W, b. Only the input differs per column; the rule (W, b) is the same.","connDescription":"Each line between layers is a weight (w). Multiply input by weights, add them, then add bias (b) to get the next layer Y.","connWeightLabel":"weight(w)","connBiasLabel":"+bias(b)","connFooter":"Circles are values, lines are weights (w). Add bias (b) to the weighted sum to get the next layer Y.","hiddenDescription":"We only see input (X) and output (Y). The layer in between is used only inside the network, so it’s the hidden layer.","hiddenVisibleInput":"Visible: input","hiddenHiddenH":"Hidden: H","hiddenVisibleOutput":"Visible: output","hiddenBoxLabel":"Hidden layer (not visible from outside)","hiddenFooter":"Values flow input → hidden → output. The hidden layer is an internal representation we don’t see.","deepDescription":"Deep = many hidden layers (middle steps). The “deep” in deep learning is this depth.","deepLayerN":"Layer {n}","deepFooter":"More steps mean a deeper network. Deeper networks can learn more refined patterns.","wideWidthN":"Width {count}","wideNeuronsN":"{count} neurons","wideFooter":"The number of neurons in one layer is the width. Wider layers can handle more features at once.","softmaxScoreToProb":"Score → probability","softmaxExample":"(example: e ≈ 3)","softmaxScore":"Score","softmaxMid":"Mid","softmaxPowerOf3":"3 to the power","softmaxProb":"Probability","softmaxDivideBySum":"Divide by sum","softmaxRaise":"raised to","softmaxPowerLabel":"(3^{n})","activationDescription":"Representative activation functions where output Y changes nonlinearly with input X. (3-level quantized version)","activationSigmoid":"Sigmoid(X)","activationRelu":"ReLU(X)","activationTanh":"Tanh₃(X)","hiddenLayer1Formula":"W₁·X+b₁ → ReLU","hiddenLayer2Formula":"W₂·H+b₂ → ReLU","captionDotProduct":"Left X1,X2,X3 and right Y1,Y2,Y3 are connected by lines. Each right node is the dot product of the left with weights.","captionMatrixMul":"Left is one row of matrix A; right Y1–Y3 are dot products with columns of B. Together they form the matrix product A·B.","captionLinearLayer":"This block is a linear layer. Input is computed to the next layer at once as Y = W·X + b.","captionActivation":"Node values change in a nonlinear way through ReLU or σ. The last layer Y1, Y2, Y3 come from that.","captionArtificialNeuron":"Inside the dashed circle is one artificial neuron. Input (X) times weights (w·x+b), then ReLU, gives output (Y).","captionBatch":"One column = one sample. The same W, b is applied to all columns at once to compute Y = W·X + b.","captionConnection":"Lines between layers are weights (w). Values flow along these lines to the next layer.","captionHidden":"We only see input (X) and output (Y). The layer H in between is used only inside the network, so it’s the hidden layer. Data flows input → hidden → output.","captionDeep":"Deep means many hidden (middle) layers. More steps like X→A→B→C→…→Y mean deeper; deeper networks learn more refined patterns.","captionWide":"The number of neurons in one layer is the width. 1 neuron = 1 feature, 256 = 256 at once. Width can differ per layer (e.g. 1→2→4→8 or 256→128→64).","captionSoftmax":"Softmax divides so the last layer Y1,Y2,Y3 sum to 1. You can treat them as probabilities.","captionGradient":"Gradient (∇) flows from right to left, updating each layer a bit to reduce loss.","captionSummary":"Ch01–Ch12 in one network: forward, backward, weights, activation, gradient all in one picture.","labelWeightedSum":"Weighted sum","labelWeightBias":"Weight·input+bias","labelWeight":"Weight","labelProbSum":"(probability, sum=1)","labelResult":"Result","labelMatrixResult":"Matrix product result","labelNeuron":"Neuron"},"categories":{"math":{"title":"Foundations","navTitle":"Math"},"midMath":{"title":"Intermediate Math"},"advMath":{"title":"Advanced Math"},"dl":{"title":"Basic Deep Learning","navTitle":"Deep learning"},"midDl":{"title":"Intermediate Deep Learning"},"advDl":{"title":"Advanced Deep Learning"},"ml":{"title":"Basic Machine Learning","navTitle":"Machine learning"},"midMl":{"title":"Intermediate Machine Learning"},"advMl":{"title":"Advanced Machine Learning"},"comingSoon":"Coming soon","preparing":"(Coming soon)","completed":"completed"},"concepts":{"sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters in deep learning","howUsed":"How it is used","problemSolving":"Tips for solving the problems"},"intro":{"sectionTitle":"What is Deep Learning?","whatIs":["**Deep learning is like a smart calculator that learns by itself** — Instead of humans defining every rule one by one, it's a way for computers to find rules on their own by looking at huge amounts of data. Inspired by **neurons** in the brain exchanging signals, small computing units are stacked in **many layers (Layer)**, which is why we call it **deep** learning.","**Deep learning is everywhere in our lives** — From conversational AI you use every day like **ChatGPT** and **Gemini**, to **self-driving cars** that read the road with cameras, to **Netflix and YouTube recommendation systems** that know your taste better than you do—they're all products of deep learning. The core idea is turning complex images and sounds into **numbers**, then adding and multiplying those numbers to find the right answer.","**You need the basics to build more powerful AI** — Beyond just using ready-made models, knowing the **basic math** that happens inside is important if you want to adapt and use models for your own goals. When you understand how numbers are grouped and computed, you can clearly see why an AI made a certain decision and tune it for better performance.","**What one layer in deep learning does** — Each layer multiplies the incoming numbers by **weights** (importance) and adds them, then passes the result to the next layer. As layers get deeper, the AI goes from dots and lines in the data to eyes, nose, mouth, and finally **high-level features** like dog vs. cat. The guide for adjusting those weights precisely toward the right answer is **gradient**.","**This course's learning roadmap** — Deep learning is ultimately an efficient repetition of multiplication and addition. You'll learn the basics of how data moves through **Ch01 dot product** and **Ch02 matrix multiplication**, go through **Ch03–05 artificial neurons and activation functions**, and grasp **Ch06–10 the structure of deep and wide neural networks**. Finally, in **Ch11–12**, you'll conquer step by step the core idea of how AI learns by itself: the gradient.","Follow the **roadmap** below to see what each chapter aims for. If you follow along step by step, you'll gain the ability to interpret what kind of mathematical language state-of-the-art AI systems use internally."],"whyImportant":[],"howUsed":[],"problemSolving":[]},"dotProduct":{"sectionTitle":"Dot product in deep learning","whatIs":["The **dot product** multiplies **same-position elements** of two vectors and sums the results into a single number. For example, [2, 3] · [4, 1] = 2×4 + 3×1 = 11.","It also measures **how aligned** two vectors are: a large positive dot product means **similar direction**, zero means **perpendicular (unrelated)**, and negative means **opposite direction**. That's why it's great for measuring similarity.","In formula form: **a · b = a₁b₁ + a₂b₂ + … + aₙbₙ**. Both vectors must have the **same number of elements** for the dot product to work.","In real AI systems, dot products are computed between vectors with **hundreds or thousands of dimensions**. Computers do this instantly, so we can compare “how similar two texts are” or “how well an image matches a caption” with **one number**."],"whyImportant":["In deep learning, **one neuron's output is computed as a dot product** between its weights and the input. Multiply same-position values and sum them up—that gives the neuron's \"response score\" for that input.","The dot product is the **most fundamental operation** in deep learning because **matrix multiplication is just many dot products bundled together**. Linear layers, attention, embedding comparison—all rely on repeated dot products.","It also serves as a **similarity measure**: for example, Netflix computes the dot product of a user vector and a movie vector to get a \"match score.\" This idea is also called **cosine similarity**."],"howUsed":["**Recommendation systems (Netflix, YouTube)**: Compute the dot product of a user vector and a content vector to get a \"how much this user would like this content\" score. Higher score = higher recommendation rank.","**Search engines & chatbots**: Convert queries and documents to vectors, then rank by dot product (similarity). ChatGPT uses the same principle when finding the most relevant information for your question.","**Attention mechanism**: In translators and chatbots, word vectors are dotted to compute \"relevance scores\"—the model focuses more on words with high scores.","**Translation & summarization**: The model compares the current token to others with dot products to get relevance scores—this is how it decides **which words to attend to** in context."],"problemSolving":["**How to compute**: Multiply **same-position elements**, then add all the products. Example: [1, 2, 3] · [4, 5, 6] = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32.","**Finding a blank**: If the total dot product and the other products are given, sum the known products first, then subtract from the total to get the missing product. Divide by the known element to find the blank.","**Watch out**: Both vectors must have the **same number of elements**. Also, make sure to include **every** pair of elements—checking off each pair one by one helps avoid mistakes.","**Double-check**: Missing one pair changes the sum. After forming all products, **add them twice** or add in a fixed order to catch slips."],"paragraphs":["The **dot product** is the sum of **element-wise products** of two vectors: a·b = a₁b₁ + a₂b₂ + ….","In deep learning, one step of a linear transform is a **weight vector** dotted with an **input vector**, giving one **neuron**'s output. With many neurons, a **weight matrix** times the input (**matrix multiplication**) computes them at once; each entry is one dot product.","A larger dot product also means the two vectors are more **aligned**, so it is used for **attention**, **similarity**, and **embedding comparison**—measuring how similar two things are with a single number."]},"matrixMul":{"sectionTitle":"Matrix multiplication in deep learning","whatIs":["**Matrix multiplication** combines two number tables (matrices) into a new one. Take **one row** of the first matrix and **one column** of the second, compute their **dot product**, and that fills **one entry** in the result.","Repeat this for **every row-column combination** and the result matrix is complete. For example, a 2×3 matrix times a 3×2 matrix gives a 2×2 result.","The rule for it to work: the **number of columns** of the first matrix must equal the **number of rows** of the second. Remember this, and you can always tell whether two matrices can be multiplied.","**Why matrices?** Packing many samples into one matrix lets the **GPU** run one matrix multiply for many inputs at once—essential for batches of images, sentences, or users."],"whyImportant":["A **linear layer** in deep learning multiplies the input by a weight matrix—that's matrix multiplication. If you have 10 neurons, you'd need 10 dot products; matrix multiplication does **all 10 at once**.","**GPUs** are specifically designed to do **thousands of matrix multiplications in parallel**. This is why millions of multiplications finish instantly, enabling real-time image recognition and chatbots.","**Nearly every operation** in deep learning boils down to matrix multiplication—attention, convolution, recurrent networks. Understanding matrix multiplication means understanding the backbone of deep learning."],"howUsed":["**Image recognition**: Pixel values are arranged in a matrix, multiplied by weight matrices to extract features like 'is there a dog or a cat?' This repeats across many layers.","**Chatbots & translators**: ChatGPT and Google Translate convert sentences into number matrices, then multiply by huge weight matrices dozens to hundreds of times to generate answers. Matrix multiplication accounts for most of the computation.","**Recommendations & self-driving**: Netflix computing recommendation scores for thousands of users at once, and a self-driving car recognizing obstacles from camera frames—both rely on large-scale matrix multiplication inside.","**Batch scoring**: User–item or query–document vectors are stacked; **one matrix multiply** scores many pairs at once."],"problemSolving":["**Finding one entry**: Entry **(i, j)** of the result = dot product of **row i of A** and **column j of B**. Multiply same-position elements and sum.","**Blank strategy**: If the blank is in the result, just compute the dot product for that row and column. If the blank is in A or B, use the known result and other values to work backwards.","**Check dimensions**: Before multiplying, verify that A's **column count** equals B's **row count**. The result size is (A's rows) × (B's columns).","**Double-check**: One wrong entry poisons the whole row/column. Compute **one full row or column** first, then match the rest."],"paragraphs":["**Matrix multiplication** fills each entry of the result by taking the **dot product** of **each row** of the first matrix and **each column** of the second.","A **linear layer** in deep learning multiplies the input by a **weight matrix** and adds a **bias**; that multiplication is **matrix multiplication**. (m neurons, n inputs → m×n matrix times n-dimensional input → m outputs.)","**GPUs** are optimized for massive **parallel** matrix multiplication, so most of deep learning is **matrix multiplication**."]},"linearLayer":{"sectionTitle":"Linear layer in deep learning","whatIs":["A **linear layer** multiplies the input by **weights (W)** and adds a **bias (b)** to produce output: **Y = W·X + b**. The W·X part is matrix multiplication, and b shifts the baseline up or down.","Think of it like a grading formula: 'math×0.3 + science×0.5 + English×0.2 + 10'. Here 0.3, 0.5, 0.2 are **weights (W)**, 10 is **bias (b)**, and the subject scores are **input (X)**.","A single linear layer decides **'how much to scale each input and how much to add.'** With multiple outputs, each output uses different weights and bias, computing many scores at once.","**Why 'linear'?** Doubling the input roughly doubles the output (before activation)—a **linear** map. Linear stacks alone cannot draw arbitrary curves, so we **always pair linear layers with nonlinear activations**."],"whyImportant":["**Almost every deep learning model** uses linear layers as basic building blocks. ChatGPT, translators, and image classifiers all repeat 'W·X + b' hundreds to thousands of times. It's the **brick** of deep learning.","**Model size (parameter count)** is determined by 'how many inputs × how many outputs' for each linear layer. This size controls how complex things the model can learn (**capacity**) vs. the risk of **overfitting** (just memorizing training data).","However, stacking linear layers alone is equivalent to **one linear operation** (only straight lines). That's why an **activation function** (a bending function) is always added after each linear layer to enable **curves and complex patterns**."],"howUsed":["**ChatGPT & translators**: Sentences are converted to number vectors, then passed through dozens to hundreds of linear layers, each computing W·X + b followed by an activation, to understand context and generate answers.","**Image recognition**: Feature vectors from photos are fed into linear layers to compute 'dog score,' 'cat score,' 'bird score' simultaneously. The final linear layer's outputs become per-class scores.","**Recommendation systems**: User info and product info are combined into a vector, fed through linear layers to get a 'how much this user would like this product' score. More layers allow finer recommendations.","**Small devices**: Mobile models may use **narrower** linear layers to cut parameters while keeping the same W·X + b pattern."],"problemSolving":["**One formula**: Multiply input **X** by **weight matrix W** and add **bias b** to get output **Y**. So **Y = W·X + b**. Linear layer problems give you **X, W, b** and ask for **Y**, as in the purple box below.","**Numeric example**: With X = [2, 1], W = [[1,0],[1,1]], b = [1, -1], we get W·X = (2, 3). Adding bias b gives **Y = (2+1, 3-1) = [3, 2]**. The bias shifts each output up or down. Each entry of **Y** is the **dot product** of the corresponding **row of W** with **X**, plus the corresponding entry of **b**.","**Blank strategy**: If the blank is in **Y**, compute that row's W·X + b. If the blank is in **W** or **b**, use the known Y and X and rearrange the equation. Then **verify** by plugging back into Y = W·X + b."],"paragraphs":["A **linear layer** computes y = Wx + b: multiply input x by **weight matrix** W and add **bias** b.","Each output **neuron** is one **dot product** of its weight row with the full input. So **dot product** and **matrix multiplication** are the building blocks of linear layers.","Linear maps alone cannot express **nonlinear** functions well, so linear layers are usually followed by an **activation function**."]},"activation":{"sectionTitle":"Activation in deep learning","whatIs":["An **activation function** transforms a neuron's raw output (weighted sum) into a **specific range or shape**. The most common ones are **ReLU** (negative → 0, positive → unchanged), **Sigmoid** (compresses to 0–1), and **Tanh** (compresses to −1 to 1).","Think of it like a **faucet**: when water (signal) comes in, it either 'only lets through above a threshold (ReLU)' or 'reduces the flow if it's too strong (Sigmoid, Tanh).' This transformation makes the output suitable for the next layer.","**ReLU** is the most popular because it's simple to compute (keep if positive, zero if negative) and trains fast. **Sigmoid** is used when you need probability-like outputs, and **Tanh** when you want values centered around zero.","**GELU / SiLU** are smoother ReLU variants common in modern transformers and generators; choice of activation affects **training dynamics** and sometimes accuracy."],"whyImportant":["**No matter how many multiply-and-add (linear) operations you stack, the result is the same as one multiply-and-add.** Just as connecting straight lines only gives you a straight line, linear operations alone can **never represent curves or complex patterns**.","Activation functions add **bends (nonlinearity)**. These bends allow stacked layers to create **curves and complex boundaries**, enabling the model to learn patterns in images, speech, and text.","Without activation functions, no matter how deep the network, it can only do **what a single line could do**. Activations are the **essential ingredient** that makes deep learning 'deep.'"],"howUsed":["**Image recognition**: After computing W·X + b at each layer, **ReLU** clips irrelevant features (negatives to zero) and passes relevant ones (positives) to the next layer, progressively extracting 'eyes,' 'ears,' 'wheels,' etc.","**Chatbots & translators**: Hidden layers use **ReLU** or **GELU** (a smoother version) for nonlinearity; the final layer uses **Sigmoid** (yes/no decisions) or **Softmax** (choosing among multiple candidates) to produce the answer.","**Speech recognition & self-driving**: Sound waves or camera images are converted to numbers, then passed through many linear + activation layers to determine 'what word is this' or 'what object is that.' Without activation, such complex decisions would be impossible.","**Image generation**: Denoising networks apply linear layers plus **ReLU/SiLU** at each step to predict pixel updates."],"problemSolving":["Find X's interval in the table; that gives Y.","Function | Rule","ReLU | 0 or less → 0; positive → same as X","Sigmoid | Small → 0, middle → 0.5, large → 1","Tanh₃ | Small → -1, middle → 0, large → 1","Note | Check the problem's table for boundaries."],"paragraphs":["An **activation function** makes a neuron's linear output (**weighted sum**) **nonlinear**. **ReLU**, **sigmoid**, and **tanh** are common.","Stacking only **linear layers** is equivalent to one big linear map. **Nonlinear** activations between layers are needed for **deep networks** to learn complex patterns.","Choosing where and which **activation** to use is a key **design decision** in deep learning."],"problemDiagramCaption":"Node values change in a squiggly way through ReLU or σ. The final layer Y1, Y2, Y3 come out that way.","solutionIntro":"For activation problems, Y is determined by which interval X falls into. Below is how to solve ReLU, Sigmoid, and Tanh₃ problems.","solutionRelu":"**ReLU**: X ≤ 0 → Y = 0, X > 0 → Y = X. If Y is blank, just check the sign of X.","solutionSigmoid":"**Sigmoid**: X < -1.5 → 0, -1.5~1.5 → 0.5, X > 1.5 → 1. Find X's interval from the table/graph and use the corresponding Y. Check the problem's table for boundaries.","solutionTanh":"**Tanh₃**: X ≤ -1 → -1, -1 < X < 1 → 0, X ≥ 1 → 1. Find X's interval from the table and fill Y (-1, 0, or 1). For boundary values, check which side the problem uses.","solutionCaption":"Interval boundaries may differ per problem; always check the table (or graph) given in the problem first."},"artificialNeuron":{"sectionTitle":"Artificial neuron in deep learning","whatIs":["An **artificial neuron** is the **smallest computational unit** of deep learning. It does exactly two steps: ① compute the **weighted sum** Z = W·X + b, ② apply an **activation function** Y = ReLU(Z) or Sigmoid(Z).","It's inspired by biological neurons: real neurons receive multiple signals, weight each one differently, sum them up, and fire if the total exceeds a threshold. The artificial neuron is a **mathematical simplification** of this process.","Summary: **Input (X)** → **Weight and bias (Z = W·X + b)** → **Activation (Y = f(Z))** → **Output (Y)**. That's everything an artificial neuron does.","**One neuron’s output** becomes input to many neurons in the next layer; **billions** of such units power modern vision and language models."],"whyImportant":["AI models like ChatGPT, image classifiers, and recommendation systems are built by **connecting thousands to billions of these neurons**. Understand one neuron, and you can **read the entire model's behavior**.","**Training** means gradually adjusting each neuron's **weights (W) and bias (b)** so the output gets closer to the correct answer. Knowing how W and b affect the output is key to understanding learning.","A single neuron combines **dot product + bias + activation**, so it unifies everything from the previous chapters: **dot product, matrix multiplication, linear layer, and activation function** all come together here."],"howUsed":["**Real-life analogy—exam pass prediction**: Compute 'Math×0.4 + Science×0.4 + English×0.2 + 5 = 75' (weighted sum), then 'if ≥60 → pass (1), else fail (0)' (activation). That's exactly one neuron's operation.","**One neuron in image recognition**: It takes a specific region of pixels, computes weighted sum + bias, passes through ReLU to get a 'is there a horizontal line here?' score. Thousands of such neurons together can determine 'dog or cat.'","**Chatbots, translators, speech recognition**: Each part of a sentence or sound is converted to numbers, neurons score 'what patterns are present,' and those scores flow to the next layer's neurons to grasp increasingly complex meaning."],"problemSolving":["**Step 1—Weighted sum (Z)**: Compute Z = W·X + b. Dot product W's row with X, then add b. If the blank is in Z, fill it at this step.","**Step 2—Activation (Y)**: Apply the given activation to Z. **ReLU**: Y = Z if Z > 0, Y = 0 if Z ≤ 0. **Sigmoid**: check the table to see which interval Z falls in.","**Blank in W or b**: If Y and X are given, reverse the activation to find Z first, then solve Z = W·X + b for the blank. The key is to **work backwards one step at a time**."],"paragraphs":["An **artificial neuron** takes **weighted** inputs (**weighted sum**), then applies an **activation function** to produce one output.","The weighted sum is a **dot product** of the input and weight vectors; then a **nonlinear** activation is applied.","**Deep learning models** chain many such **neurons** to transform input to output in multiple stages."]},"batch":{"sectionTitle":"Batch in deep learning","whatIs":["A **batch** means **grouping multiple inputs (samples) into one table (matrix) and computing them all at once with the same weights**. Each **column = one sample** in the table.","Imagine a teacher grading tests **one by one** vs. feeding **30 tests into a grading machine** at once—the machine is much faster. Batching works the same way: the GPU processes many inputs **simultaneously**.","Key idea: the **same W (weights) and b (bias)** are applied to all samples. The only thing that differs per sample is the **input X**. That's why one matrix multiplication can compute results for many samples at once.","**Mini-batches**: Training splits data into chunks (e.g. 32–128 samples), runs forward/backward on each chunk, and updates weights—balancing **memory**, **speed**, and **gradient noise**."],"whyImportant":["**Speed**: GPUs are optimized for processing **thousands of numbers simultaneously** rather than one at a time. Batching lets you use the GPU's full power, computing **tens to hundreds of times faster** than one-by-one.","**Training stability**: Updating weights based on just 1 sample is **noisy**. Using a **mini-batch** (e.g., 32 or 64 samples) averages the gradients for much more **stable** learning. Batch size is a critical training setting.","**Memory management**: With 1 million data points, you can't fit them all at once (GPU memory!). So you split into **mini-batches** (e.g., 64 at a time), process each batch, update weights, and repeat."],"howUsed":["**Netflix & YouTube recommendations**: Instead of computing for one user at a time, **thousands of users' data are batched** for simultaneous scoring. This enables real-time service.","**ChatGPT & translators**: When many users ask questions at the same time, their queries are **batched together** for one GPU pass. That's how millions of users get fast responses simultaneously.","**Image training**: When training on 100,000 images, they're split into mini-batches of 32, running 3,125 iterations. Each mini-batch computes Z = W·X + b, measures error (loss), and slightly adjusts weights.","**Parallel inference**: Many inputs (images, tokens, users) are evaluated together in one batch for throughput."],"problemSolving":["**X has multiple columns**: Each column is one sample. Use the **same W and b** for each column. Find which row and column the blank is in, and use **only that column's numbers** to compute.","**Add/subtract/multiply/mean operations**: These apply to **same positions (same row, same column)**. For mean (e.g., zero-centering), compute the average **per column**. Use only that column's values for the blank.","**Verification tip**: Each column is independent—one column's result doesn't affect another. **Check each column separately** to catch mistakes easily."],"paragraphs":["**Batching** means grouping several **samples** into one **matrix** and computing with the same **weights** at once.","One **matrix operation** over many samples uses the **GPU** much better than processing one sample at a time.","Training usually computes **gradients** and **updates** weights per **mini-batch**."]},"connection":{"sectionTitle":"Connection in deep learning","whatIs":["A **connection** describes **how neurons in one layer link to neurons in the next layer**. Each connection has a **weight (number)** that determines 'how much this input affects this output.'","**Fully connected**: **Every** neuron in the previous layer connects to **every** neuron in the next. The linear layer (Y = W·X + b) we've learned is exactly a fully connected layer—every entry in W has a number.","**Partially connected**: Some entries in W are **zero**, meaning 'no connection.' That input has **no effect** on that output. CNNs, which connect only nearby pixels, are a classic example of partial connections.","**More connections** mean more capacity but also **more compute and memory**; mobile models **prune or quantize** connections to stay efficient."],"whyImportant":["**Connection structure defines the model's character.** Fully connected considers all inputs (more information but more parameters), while partial connections only look at what's needed (efficient and fast but may miss some information).","**AI training is the process of adjusting connection strengths (weights).** 'Make this connection stronger, that one weaker'—gradually adjusting to produce outputs closer to the correct answer. Large models have billions of such connections.","**Looking at where W is zero** reveals what the model ignores. After training, connections with near-zero weights indicate 'unimportant information.' This is used in **pruning** to make models lighter."],"howUsed":["**Image recognition (CNN)**: Uses **partial connections** where only nearby pixels connect. Distant pixels are less relevant, so this reduces parameters and is faster and more efficient.","**Chatbots & translators (Transformer)**: **Attention** determines 'which words relate to which other words'—it learns which connections to strengthen **dynamically** from the data.","**Recommendation & speech recognition**: The weights connecting user features to product features directly become recommendation scores. In speech recognition, the model learns how each sound frequency connects to the next layer's features."],"problemSolving":["**W = 0 means no connection**: For example, if W(2,1) = 0, the 1st input has **zero effect** on the 2nd output. You can **skip it** entirely in the calculation.","**Finding one output**: Find which inputs **are connected** (W ≠ 0) to that output, multiply W · X for those positions only, sum them, and add b. Zero entries multiply to zero, so skipping them gives the same result.","**Blank strategy**: First, **identify the zero entries in W**. Then set up equations using only the non-zero connections. If the blank is in W, use Y and X to reverse-calculate; if it's in Y, compute forward from W and X."],"paragraphs":["**Connection** is the structure that shows **how neurons** in one **layer** link to neurons in the next layer.","Networks are often described as **fully connected**, **partially connected**, or **recurrent**. In **fully connected** layers, every neuron in one layer links to every neuron in the next (e.g. a **Linear layer**). **Partially connected** means only some neurons link to the next layer (e.g. in CNNs, filters connect only some inputs). **Recurrent** connections feed output back into the same or earlier step.","Each link has a **weight** that scales the signal. Entry (i,j) of matrix W is the strength from input j to output neuron i; these **weights** are learned.","In deep learning there can be millions to billions of connection weights. In Y = W·X + b, a zero in W means **no connection** from that input to that output (partial connection)."]},"hidden":{"sectionTitle":"Hidden layers in deep learning","whatIs":["A **hidden layer** is an **intermediate stage between input and output**. Users only see the input (e.g., a photo) and output (e.g., 'dog'), but in between, hidden layers create **'hidden features.'**","The flow is: **X → Linear(W·X+b) → ReLU → H (hidden representation) → Linear(W·H+b) → ReLU → Y (output)**. H is the hidden layer's result, containing compressed 'key features' of the input.","**Analogy**: When you see a photo and say 'dog,' your brain goes through 'colors → edges → eyes/nose/ears → dog!' These **intermediate thinking steps** are the hidden layers. The number of neurons (width) in the hidden layer determines how many different features it can capture.","A **wider** hidden layer can hold **more diverse features** at once; **deeper** stacks of layers can learn **more abstract** concepts."],"whyImportant":["Hidden layers **progressively summarize and transform** input data. **Early layers** capture simple features (brightness, edges), **later layers** capture complex features (eyes, wheels, letters).","**Without hidden layers**, the model maps input directly to output, only expressing very simple (linear) relationships. **With hidden layers**, it can learn complex relationships (curves, multi-condition combinations).","The **number of neurons (width)** and **number of layers (depth)** determine the model's **representational power**. Too small = information bottleneck and poor performance; too large = overfitting (memorizing instead of learning)."],"howUsed":["**Image recognition**: The stages 'pixels → edges → textures → object parts (eyes, wheels) → whole objects (dog, car)' are all hidden layers. Deeper layers extract more abstract features.","**Chatbots & translators**: After converting text to numbers, multiple hidden layers progressively refine 'word meaning → sentence context → answer direction.' ChatGPT passes through dozens of hidden layers (Transformer blocks) to generate responses.","**Speech recognition**: The transformation 'sound wave → frequency features → phonemes → words → sentences' goes through hidden layers at each stage."],"problemSolving":["**Compute in order**: X → (W·X+b) → ReLU → H → (W·H+b) → ReLU → Y. Compute each step **sequentially**. If the blank is in H, compute only through the first linear+ReLU. If in Y, compute H first then the second stage.","**ReLU caution**: When the linear result (W·input+b) is **negative, ReLU turns it to 0**. In the next layer, that value is 0, so that term **contributes nothing**—you can ignore it entirely. This is a frequent key point in hidden layer problems.","**Blank in W or b**: Hidden layer problems have **two stages** (two linear+activation). First identify which stage the blank is in. If you know the input and output of that stage, solve for the blank using that stage's equation alone."],"paragraphs":["**Hidden layers** sit between **input** and **output** layers and learn internal **representations** not directly observed.","They gradually transform input into **higher-level features**; **lower layers** capture simple patterns, **higher layers** more abstract ones.","The number of **neurons** and **layers** in hidden layers is a key factor in model **capacity**."]},"deep":{"sectionTitle":"Depth in deep learning","whatIs":["**Deep** means having **many hidden layers (intermediate stages)**. The **'deep'** in **deep learning** refers exactly to this depth! Each layer does linear (W·input+b) + activation (ReLU), then passes the result to the next layer.","**X → A → B → C → … → Y**—the more stages, the deeper. Analogy: with **1 stage** you can only 'draw a line,' with **10 stages** you can 'draw simple shapes,' and with **100 stages** you can 'draw a human face.' More depth = **more precise, complex patterns**.","But deeper isn't always better. Too many layers can cause **vanishing gradients** (learning signals don't reach early layers) or **overfitting** (memorizing training data instead of learning general patterns).","**Image generation** models also get **deeper** as they add more denoising steps; **translation and chat** models stack many blocks into **deep** architectures."],"whyImportant":["**More layers enable more complex functions.** Each layer's activation adds 'bends,' and stacking layers **combines many bends** into very complex curves and decision boundaries.","In image recognition: **layers 1–2** learn 'lines, edges,' **layers 3–5** learn 'eyes, noses, wheels,' **layer 6+** learn 'dogs, cars.' This is possible because of **depth**.","Famous architectures like **ResNet** and **Transformer** can be **dozens to hundreds of layers** deep and still train well. The secret is **skip connections (residual connections)**: gradients can skip layers and flow directly to earlier layers. These techniques overcome the 'limits of depth.'"],"howUsed":["**ChatGPT**: GPT-4 consists of **dozens to hundreds** of Transformer blocks. Each block understands context more deeply, and the final layer generates the answer.","**Self-driving cars**: Camera images go through **deep networks** (e.g., ResNet-152, 152 layers!) to accurately distinguish obstacles, lane markings, and signs through many stages. Depth enables handling complex road situations.","**Speech recognition & translation**: Converting speech to text, or Korean to English, also goes through **deep networks** where each layer progressively captures 'phonemes → words → context → meaning.'","**Speech & translation (detail)**: Deep nets map sound or text through layers from low-level units up to **meaning**—the same depth pattern as vision."],"problemSolving":["**Example**: Input X = [3, 1, 2]. Layer 1: W₁·X+b₁ = [4, -1, 2] (linear), then ReLU gives A = [4, 0, 2]. Layer 2: W₂·A+b₂ = [2, 1, 5], ReLU gives B = [2, 1, 5]. If **A₂ is blank**?","**Solution**: The second entry of layer 1 linear output is -1, so ReLU(-1) = 0. So **A₂ = 0**. For a blank in a middle layer, compute that layer's **linear (W·input+b)** first, then apply **ReLU (negative → 0)**.","**In general**: Wherever the blank is, compute **all previous layers** in order to get that layer's input, then take the **dot product of the corresponding row of W with the input**, add the **bias entry**, and apply ReLU to get the answer."],"paragraphs":["**Deep** means having many **hidden layers**—many **layers** in the **network**. That is the 'deep' in **deep learning**.","More depth allows more stages of **nonlinear transformation** and more **complex functions**, but also **harder training**, **overfitting**, and **cost**.","Architectures like **ResNet** and **Transformer** help **train** very deep networks **stably** with **structural techniques**."]},"wide":{"sectionTitle":"Width in deep learning","whatIs":["**Width** refers to **how many neurons are in a single layer**. More neurons (wider) = the layer can **represent more features simultaneously**. For example, 1 neuron = 1 feature; 256 neurons = 256 features at once.","Analogy: if an **exam has 1 question**, you can only evaluate one skill; with **100 questions**, you can assess many abilities at once. Similarly, a wider layer **processes more diverse information** in one step.","Layers can have different widths. For example, '1 → 2 → 4 → 8' (widening) or '256 → 128 → 64' (narrowing) are both common designs, depending on the purpose.","**Large server models** may use **thousands** of hidden units per layer; **mobile** models shrink width to save compute and memory."],"whyImportant":["**Depth (number of layers)** and **width (neurons per layer)** together determine the model's **total size (parameter count)**. With the same number of parameters, you can choose '**deep and narrow**' or '**shallow and wide**'—and this choice significantly affects performance.","Greater width means **more features processed simultaneously** per layer, but it also increases **computation and memory**. Too wide risks **overfitting** (memorizing training data).","In practice, **bottleneck** designs are popular: keep the input and output narrow but make the middle wide. This way, the **wide layer extracts key features** while the rest stays compressed. Both ResNet and Transformer use this technique."],"howUsed":["**Image recognition (CNN)**: The **channel count** (number of feature maps) at each layer is its width. Starting from 3 channels (RGB), deeper layers grow to 64 → 128 → 256 → 512 channels, extracting **increasingly diverse features**.","**Chatbots & translators (Transformer)**: The **hidden dimension** (e.g., 768, 1024, 4096) is the number of numbers each layer processes at once (its width). Large models like GPT-4 have dimensions in the thousands—very wide.","**Recommendation systems**: A 'user vector of 256 dimensions' means width 256. It holds 256 features (age, preferences, watch history, etc. transformed into numbers), enabling more detailed recommendations."],"problemSolving":["**Same formula per layer even when widening**: Linear (W·input+b) → ReLU. Find which layer and neuron the blank belongs to, then use **that layer's input** and **the corresponding row of W and entry of b** to compute.","**Watch W dimensions**: When width changes between layers, **W's size changes too**. W is (current width × previous width), so find the right **row** for the blank's neuron and dot it with the previous layer's output, plus b.","**Layer by layer**: Just like with depth problems, **compute previous layers' outputs first** before moving to the next. Don't forget ReLU (negative → 0) at each layer."],"paragraphs":["**Width** is the number of **neurons** (or **channels**) in one layer. A **wider layer** can represent more **features** at once.","How you balance **depth** (number of layers) and **width** (neurons per layer) affects **capacity** and **efficiency**. With the same **parameters**, you can go deeper or wider.","In practice, **width** often varies per layer to add **capacity** where needed."]},"softmax":{"sectionTitle":"Softmax in deep learning","whatIs":["**Softmax** is a function that **converts multiple scores (numbers) into probabilities**. All values become **between 0 and 1**, and they **sum to exactly 1**. So you can read them as probabilities.","The formula is __SOFTMAX_FORMULA__. Because it uses **powers of e (≈2.718)**, the largest score gets **amplified significantly** while others shrink relatively. The gap between 1st and 2nd place becomes more pronounced.","Example: scores [3, 1, 0] → e³≈27, e¹≈2.7, e⁰=1 → sum ≈ 23.7 → probabilities → [0.84, 0.11, 0.04]. The score of 3 was only 3× larger than 1, but the probability is about 8× larger!","**Why exponentiate, then divide?** To **sharpen** differences between scores so the most likely choice stands out clearly."],"whyImportant":["Softmax is used at the **final layer of almost every classification model**. 'This photo is 70% dog, 25% cat, 5% bird' lets you see **per-class probabilities** and **how confident** the model is.","When combined with **cross-entropy loss** during training, the gradients work out **cleanly and stably**. The model naturally learns to 'increase the correct class probability and decrease the rest.'","Softmax's property of 'all positive values that sum to 1' exactly matches the definition of a **probability distribution**. This makes it the **most natural way** to convert scores to probabilities, both statistically and theoretically."],"howUsed":["**Image classification**: The model's final layer outputs scores (logits) like [5.2, 2.1, 0.8, ...]. Softmax converts them to [0.70, 0.25, 0.05, ...]—**probabilities for each class**. The highest probability class is the final answer.","**Chatbots & translators**: When ChatGPT picks the next word, it scores every word in its vocabulary (tens of thousands!), converts to probabilities via softmax, and samples a word based on those probabilities. High-probability words appear often, but occasionally low-probability words are picked for variety.","**Attention mechanism**: In translators, relevance scores for 'which input words to focus on' are passed through softmax to become probabilities (weights). These weights create a **weighted average** of inputs that emphasizes the most relevant parts.","**Spam filters**: Compute spam vs. not-spam probabilities with softmax, then classify by the larger one."],"problemSolving":["**Computation order**: ① Compute __WEIGHTED_SUM_FORMULA__ (logits) ② Compute __SOFTMAX_EXP__ (problem uses __E_APPROX_3__) ③ Compute __SOFTMAX_SUM__ (sum) = add all __SOFTMAX_EXP__ values ④ __SOFTMAX_Y_DIV__ (divide each by the sum). Follow this order.","**Finding blanks**: If Y is blank, compute 'that __SOFTMAX_EXP_DIV_SUM__.' If __SOFTMAX_EXP__ is blank, compute '__Y_TIMES_SUM__.' If Z is blank, reverse from __SOFTMAX_EXP__. If __SOFTMAX_SUM__ is blank, just add all __SOFTMAX_EXP__ values.","**Verification**: After computing, check that all Y values are **between 0 and 1** and **sum to 1**. If not, there's a calculation error. Also confirm whether the problem uses __E_APPROX_3__ or __E_APPROX_2718__."],"paragraphs":{"0":"**Softmax** maps a vector to values in **(0,1)** that **sum to 1**, so it can be interpreted as a **probability distribution**.","1":"In **classification**, applying softmax to the last layer gives **class** **probabilities** and is typically used with **cross-entropy loss**.","2":"The formula is __SOFTMAX_FORMULA__; the **exponent** **amplifies** the largest value."}},"gradient":{"sectionTitle":"Gradient in deep learning","whatIs":["The **gradient** tells you **'if you change a weight (parameter) slightly, how much and in which direction does the loss (error) change.'** Think of it as a **compass** pointing toward 'which way to go to reduce error.'","**Analogy**: Imagine walking down a mountain blindfolded. You feel the **slope (gradient) under your feet** and step toward the downhill direction. Walking **opposite to the gradient** leads you to the valley (minimum loss). This is **gradient descent**.","**Backpropagation** passes gradients **from the output back toward the input, one layer at a time**. Using the **chain rule** from calculus, it efficiently computes the gradient for every weight in every layer **in one pass**.","**Forward pass** computes outputs from inputs; **backpropagation** sends gradients from the loss back toward inputs. Training alternates these two."],"whyImportant":["**AI training = looking at gradients and updating weights.** Without gradients, there's no way to know 'which direction to adjust,' making **learning impossible**. The gradient is the **heart** of deep learning training.","**Learning rate** controls 'how far to step each time.' Too large → overshoot the valley (diverge); too small → takes forever to arrive. Optimizers like **Adam** automatically **adjust the step size** based on gradient magnitude.","If gradients get **too large (gradient explosion)**, training becomes unstable; if they get **too small (gradient vanishing)**, early layers barely learn. Techniques like **gradient clipping**, **batch normalization**, and **skip connections** are used to prevent this."],"howUsed":["**Every trained AI model**: ChatGPT, image recognition, recommendation systems—**every model** computes gradients to update weights. Forward pass → compute loss → backward pass for gradients → update weights. Repeating these 4 steps millions of times is training.","**Forward and backward**: Forward computes Z = W·X going **forward**; backward propagates gradients dW, dX going **backward**. They always work as a pair.","**Fine-tuning**: When adapting ChatGPT for a specific use case, new data is used to compute gradients and slightly adjust weights. Thanks to gradients, a **pre-trained model** can quickly adapt to new purposes."],"problemSolving":["**Problem format**: The equation is either **forward Z = W·X** or **backward dZ = dW·X**. The blank (?) is **one entry of X** or **one entry of Z** (or **dZ**). W and dW are always fully given.","**Forward (Z = W·X)**: Each entry of Z = dot product of **one row of W** with **X**. If the blank is in **Z**, multiply that row of W by X and sum. If the blank is in **X**, use the other Z entries and rows of W to set up an equation and solve for that X entry.","**Backward (dZ = dW·X)**: **Same computation** as forward. Each entry of dZ = dot product of **one row of dW** with **X**. If the blank is in **dZ**, dot that row of dW with X. If the blank is in **X**, solve from the equation."],"paragraphs":["The **gradient** is the vector of **partial derivatives** of the **loss** with respect to each **parameter**—how much and in which **direction** the loss changes.","**Training** usually moves parameters in the **opposite direction** of the gradient (**gradient descent**). Gradients are computed efficiently by **backpropagation**.","**Learning rate**, **optimizer**, and **gradient clipping** are **key settings** for how gradients are used."]},"summary":{"sectionTitle":"Summary","whatIs":["The diagram below **collects everything from Ch01–Ch12** into **one network**: input X → hidden layers (A, B, C, D) → output Y, with **weights (W)**, **activation (ReLU, etc.)**, **batch**, and **gradient (∇)** shown.","Real training repeats **forward pass** (compute output) → **loss** → **backward pass** (gradients) → **update weights**. After this course you can follow that flow in the math."],"whyImportant":[],"howUsed":[],"problemSolving":[]}},"locale":{"ko":"Korean","ja":"Japanese","en":"English","zh":"Chinese (Simplified)"},"chapters":{"intro":{"chapter":"Chapter 00","title":"First steps in deep learning: How does AI think?","description":"Find out at a glance what deep learning is and what you'll learn in Ch01–Ch12."},"dotProduct":{"chapter":"Chapter 01","title":"Vector dot product: Finding similarity between data","description":"The most basic operation: multiplying two vectors' direction and magnitude into a single value."},"matrixMul":{"chapter":"Chapter 02","title":"Matrix multiplication: The magic of computing at once","description":"The product of two matrices is a new matrix filled with dot products of rows of the first and columns of the second."},"linearLayer":{"chapter":"Chapter 03","title":"Linear layer: Weights that decide importance","description":"Linear layer (or linear transformation layer). A layer that multiplies the input by a weight matrix and adds bias."},"activation":{"chapter":"Chapter 04","title":"Activation function: Adding judgment to AI","description":"Activation function. A function that makes a neuron's output nonlinear."},"artificialNeuron":{"chapter":"Chapter 05","title":"Artificial neuron: A unit that gathers information and sends signals","description":"Artificial neuron. A unit that takes input, computes a weighted sum, and applies an activation function."},"batch":{"chapter":"Chapter 06","title":"Batch processing: Learning together in one go","description":"Batch. A unit that processes multiple samples in one computation."},"connection":{"chapter":"Chapter 07","title":"Weight connections: The countless chains that build intelligence","description":"Connections. The weighted links between layers and between neurons."},"hidden":{"chapter":"Chapter 08","title":"Hidden layer: The invisible depth of thought","description":"Hidden. Layers between the input and output layers."},"deep":{"chapter":"Chapter 09","title":"Deep network: The power to solve more complex problems","description":"Depth. A network with many hidden layers is called a deep network."},"wide":{"chapter":"Chapter 10","title":"Width and neurons: Finding more features at once","description":"Width. A layer with many neurons is called a wide layer."},"softmax":{"chapter":"Chapter 11","title":"Softmax: Turning results into confidence","description":"Softmax (probability distribution). Transforms output so values are between 0 and 1 and sum to 1."},"gradient":{"chapter":"Chapter 12","title":"Gradient and backpropagation: Learning from mistakes","description":"Gradient. Tells which direction to move parameters to reduce loss."},"summary":{"chapter":"Chapter 13","title":"Summary: A map of AI at a glance","description":"You can see what you learned in Ch01–Ch12 in one neural network diagram."}},"midMathChapters":{"midMath00":{"chapter":"Chapter 00","title":"Intermediate Math and AI: Multivariable Space and Uncertainty"},"midMath01":{"chapter":"Chapter 01","title":"Vectors and Vector Space: Magnitude and Direction Beyond Scalars"},"midMath02":{"chapter":"Chapter 02","title":"Dot Product and Projection: Angle and Similarity Between Data"},"midMath03":{"chapter":"Chapter 03","title":"Matrices and Data: Structural Representation of Many Vectors"},"midMath04":{"chapter":"Chapter 04","title":"Matrix Multiplication and Linear Transformation: Math That Manipulates Space"},"midMath05":{"chapter":"Chapter 05","title":"Inverse and Determinant: Inverse of Transformation and Change in Volume"},"midMath06":{"chapter":"Chapter 06","title":"Linear Independence and Rank: Redundancy and Effective Dimension"},"midMath07":{"chapter":"Chapter 07","title":"Eigenvalues and Eigenvectors: Principal Axes Unchanged by Transformation"},"midMath08":{"chapter":"Chapter 08","title":"Directional Derivative and Gradient: Steepest Ascent in Multidimensional Space"},"midMath09":{"chapter":"Chapter 09","title":"Jacobian Matrix: First Derivatives of Multivariable Vector Functions"},"midMath10":{"chapter":"Chapter 10","title":"Hessian Matrix: Second Derivatives and Curvature of Surfaces"},"midMath11":{"chapter":"Chapter 11","title":"Taylor Series: Approximating Complex Functions with Polynomials"},"midMath12":{"chapter":"Chapter 12","title":"Convex Optimization: Conditions for Finding the Minimum"},"midMath13":{"chapter":"Chapter 13","title":"Conditional Probability and Dependence: Probabilistic Relations Between Variables"},"midMath14":{"chapter":"Chapter 14","title":"Bayes' Theorem: Updating Probability with Observed Data"},"midMath15":{"chapter":"Chapter 15","title":"Covariance and Correlation: Measuring Linear Association Between Two Variables"},"midMath16":{"chapter":"Chapter 16","title":"Multivariate Normal Distribution: Joint Probability Model for Many Variables"},"midMath17":{"chapter":"Chapter 17","title":"Maximum Likelihood Estimation (MLE): Inferring Parameters from Observations"},"midMath18":{"chapter":"Chapter 18","title":"Entropy: Quantifying Uncertainty via Information Theory"},"midMath19":{"chapter":"Chapter 19","title":"Cross-Entropy and KL Divergence: Measuring Difference Between Two Distributions"},"midMath20":{"chapter":"Chapter 20","title":"Intermediate Math Summary: Linear Algebra and Probability Combined"}},"midMathCh00":{"chapter":"Chapter 00","title":"Intermediate Math and AI: Going One Step Deeper","description":"Intermediate math is where the language of AI becomes more precise. Instead of treating data as just numbers, this course views it as **vectors** and **matrices**, and studies the rules that move between them as **linear transformations**. You’ll also interpret how learning behaves by using **Jacobians** (how outputs change with many inputs) and **Hessians** (curvature information), so you can understand why training can be fast, slow, or unstable.","sectionTitle":"Vectors, matrices, and sensitivity: how intermediate math explains AI","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving guide"},"whatIs":{"0":"**Vector spaces** give a framework for describing data by both **direction** and **magnitude**. For example, an image can be represented as coordinates of learned features.","1":"A **matrix** represents transformations of vectors. In particular, **linear transformations** provide consistent rules for how coordinates change—this is exactly how each layer in a neural network can be expressed mathematically.","2":"**Jacobians** and **Hessians** are maps of sensitivity. Jacobians answer “how much the output changes when the inputs change,” while Hessians describe the curvature of the loss landscape. With these maps, you can design learning updates more intelligently."},"whyImportant":{"0":"Training is essentially repeated computation that reduces error. To understand why error decreases, you need multivariable change (gradients and sensitivity), which is the core of intermediate math.","1":"Linear algebra helps interpret representation. Many ideas (like embeddings and component analysis) reduce to “how vectors are rearranged.” Once you know the math, the results become explainable.","2":"Understanding **Hessians** helps you see why learning is slow near some regions and faster near others. Second-order information also underpins methods such as Newton’s method and trust-region approaches."},"howUsed":{"0":"In **forward pass**, input vectors are transformed by matrix multiplications and linear rules. This determines which features are emphasized and which are suppressed.","1":"In **backward pass**, you need how changes propagate—Jacobians play that role. The chain rule becomes a language for tracking how small changes reach the output, enabling accurate gradient computation.","2":"During optimization, curvature information (Hessians) can improve stability. Hessians tell you whether the loss surface is flat or steep, shaping the update step."},"problemSolving":{"0":"| Topic | Role in AI | Intermediate concept |\n| --- | --- | --- |\n| **Similarity & direction** | Bring similar features closer | Dot product, projection |\n| **How a layer operates** | How one layer transforms vectors | Matrices, linear transformations |\n| **Sensitivity (change)** | How output changes when inputs change | Jacobians, gradients |\n| **Learning curvature** | How fast optimization proceeds | Hessians, eigenvalues |\n| **Uncertainty language** | Describe joint behavior of multiple variables | Covariance, multivariate normal |"}},"midMathCh01":{"chapter":"Chapter 01","title":"Vectors and Vector Space: Magnitude and Direction Together","description":"A vector is both a bundle of numbers and an object that encodes **magnitude and direction** at once. In machine learning each sample becomes a feature vector $\\mathbf x$; in deep learning embeddings and weights are vectors. This chapter builds the shared language of vectors in $\\mathbb R^n$ and prepares you for **Ch.02 Dot Product**.","sectionTitle":"Vectors and Vector Space: Magnitude and Direction Together","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it's used","problemSolving":"Problem-solving guide"},"visualShort":"Vectors: components · magnitude · direction · $\\mathbb R^n$","visualIntro":"Inputs are components $(v_x,v_y)$; scalar multiple $k\\mathbf v$ and sum $\\mathbf u+\\mathbf v$ are done **componentwise**. $\\mathbb R^n$ is the space of all real vectors with $n$ components; its dimension is $n$.","visualStep1":"Data · parameters → vector $\\mathbf v\\in\\mathbb R^n$","visualStep2":"Scalar multiple $k\\mathbf v$, sum $\\mathbf u+\\mathbf v$ (componentwise)","visualStep3":"Space $\\mathbb R^n$: dimension $n$, $n$ components","visualStepsLabel":"View order","whatIs":{"intro":"**What is a vector?** An ordered list $\\mathbf v=(v_1,\\ldots,v_n)$ and, geometrically, an arrow with magnitude and direction. When a function has several real inputs, packing them into one vector keeps notation clean.","plain":"Navigation apps say “3 km east, 4 km north”—direction and distance together. On the plane that is one arrow—a 2D vector. In components $(3,4)$; length is $\\sqrt{3^2+4^2}$.","definition":"More formally, **$\\mathbb R^n$** consists of real vectors with $n$ components. **Addition** is componentwise; **scalar multiplication** multiplies each component by a real number. The **zero vector** $\\mathbf 0$ has all zeros. The **Euclidean norm** is $\\|\\mathbf v\\|=\\sqrt{\\sum_i v_i^2}$; exercises often use $\\|\\mathbf v\\|^2=\\sum_i v_i^2$ as an integer.","inAI":"In supervised learning, features are $\\mathbf x\\in\\mathbb R^d$ and linear weights $\\mathbf w\\in\\mathbb R^d$. Deep networks stack dot products and matrices; this chapter is the first step. In **Ch.10 Hessian** you will read **second derivatives (curvature)** on the same vector space."},"whyImportant":{"bridge":"Calculus “functions and continuity” becomes the habit of **one vector for many inputs**. ML features, distances, and classification—and DL dot products and matrix multiply—all rest on **vector language**.","language":"“Add only in the same dimension”; “scalar multiply hits every component the same way”—that is **vector space structure**. Mastering it reduces confusion later for independence, basis, rank, and eigenvalues."},"howUsed":{"features":"**Feature vector**: one table row (height, weight, …) as $\\mathbf x$; preprocessing, normalization, and distances are vector operations. **kNN / clustering** often use norms of differences.","dlWeights":"**Deep learning**: a neuron computes dot product of input and weight vectors (next chapter) plus bias and activation. Embeddings are vectors in a “meaning space.” **Vector = minimal bundle of numbers AI reads.**"},"summary":"**In sum**, vectors unify geometric (direction, magnitude) and algebraic (components) views; $\\mathbb R^n$ is the space of all $n$-dimensional real vectors. Addition and scalar multiplication are componentwise; inner product, matrices, and derivatives build on this. **Ch.02** turns “how similar” into a number.","problemSolving":{"focus":"The table summarizes **formulas and symbols**; the **item-by-item notes** below explain each definition. **Worked examples** walk through representative types step by step.","examplesHeading":"Worked examples","examplesTable":"$1e"},"problemSolvingLabel":"Problem-solving guide","problemSolvingTable":"$1f","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: vector · components · $\\mathbb R^n$","visualFlowStep1":"Intuition: arrow (direction · length)","visualFlowStep2":"Math: sum · scalar · norm · dot","visualFlowStep3":"Use: features · embeddings · weights","visualArrowTitle":"Vector = direction + length","visualComponentTitle":"Same direction · length × k","visualAriaLabel":"Vector sum and scalar multiple diagram. Left: u, v, and u plus v. Right: baseline u and k times u on the same line.","visualLegendGray":"Baseline u","visualLegendBlue":"k·u","visualRnLabel":"Closed in $\\mathbb R^2$","problemPromptIntro":"Read each problem and enter the vector-operation result as an integer.","promptDefinition":"If the statement is **true**, choose **1**; if **false**, choose **0**.","promptDefinitionChoice":"Three options (a), (b), and (c) are listed below. Choose the correct one.","promptMagnitudeSquared2D":"For $\\mathbf v=({vx},{vy})$, what is $\\|\\mathbf v\\|^2$ (integer)?","promptDotProduct2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is $\\mathbf u\\cdot\\mathbf v$ (integer)?","promptSumComponent2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is the value of $(\\mathbf u+\\mathbf v)_{axis}$ (integer)? (component: {axis})","promptScalarMultComponent2D":"For $\\mathbf u=({ux},{uy})$, what is the value of $({k}\\mathbf u)_{axis}$ (integer)? (component: {axis})","promptDimensionRn":"What is the dimension of $\\mathbb R^{n}$ (integer)? ($n={n}$)","promptNumComponentsRn":"How many components does a vector in $\\mathbb R^{n}$ have (integer)? ($n={n}$)","promptCrossZ2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is $u_x v_y - u_y v_x$ (integer)?","promptNormMinusSquared2D":"For $\\mathbf u=({ux},{uy})$ and $\\mathbf v=({vx},{vy})$, what is $\\|\\mathbf u\\|^2-\\|\\mathbf v\\|^2$ (integer)?","promptDefault":"Choose the correct option below.","mcDefChoice1":"(a)","mcDefChoice2":"(b)","mcDefChoice3":"(c)","mcDefChoice4":"(d) None of (a)–(c)","definitionStatements":{"0":"A vector has magnitude and direction and can be written as components.","1":"A vector in $\\mathbb R^n$ has $n$ real components.","2":"The sum of two vectors of the same dimension is defined componentwise.","3":"Scalar multiplication $k\\mathbf v$ multiplies each component of $\\mathbf v$ by $k$.","4":"The zero vector has all components equal to zero.","5":"A vector space must be closed under addition and scalar multiplication.","6":"$$\\mathbb R^2$ is a 2-dimensional vector space over the reals.","7":"If one vector is a scalar multiple of another, they lie on the same line through the origin.","10":"The Euclidean norm $\\|\\mathbf v\\|$ can be negative.","11":"The dimension of $\\mathbb R^3$ is 2.","12":"You can define the sum $\\mathbf u+\\mathbf v$ for vectors $\\mathbf u$ and $\\mathbf v$ of different dimensions.","13":"Vector addition does **not** satisfy associativity $(\\mathbf u+\\mathbf v)+\\mathbf w=\\mathbf u+(\\mathbf v+\\mathbf w)$.","14":"The dot product $\\mathbf u\\cdot\\mathbf v$ of real vectors is always a vector."},"definitionChoiceQuestions":{"0":"(a) $4$\n(b) $5$\n(c) $6$\n\nQuestion: What is the dimension of $\\mathbb R^5$?","1":"(a) $2$\n(b) $3$\n(c) $1$\n\nQuestion: What is the dimension of $\\mathbb R^2$?","2":"(a) $16$\n(b) $25$\n(c) $9$\n\nQuestion: For $\\mathbf v=(3,4)$, what is $\\|\\mathbf v\\|^2$?","3":"(a) $3$\n(b) $2$\n(c) $5$\n\nQuestion: What is the $y$-component of $2\\mathbf e_1+3\\mathbf e_2$? ($\\mathbf e_1=(1,0),\\mathbf e_2=(0,1)$)","4":"(a) Always $\\mathbf v$\n(b) Always $\\mathbf 0$\n(c) Undefined\n\nQuestion: When $k=0$, what is $k\\mathbf v$?","5":"(a) Parallel\n(b) Orthogonal\n(c) Equal\n\nQuestion: If $\\mathbf u\\cdot\\mathbf v=0$, the vectors are often said to be:","6":"(a) $n-1$\n(b) $n$\n(c) $2n$\n\nQuestion: How many components does a vector in $\\mathbb R^n$ have?","7":"(a) $5$\n(b) $4$\n(c) $3$\n\nQuestion: What is the $x$-component of $(1,2)+(3,4)$?"}},"midMathCh02":{"chapter":"Chapter 02","title":"Dot Product and Orthogonal Projection: Measuring Similarity with Numbers","description":"The **dot product** compresses “how aligned two vectors are” into **a single number**. An **orthogonal projection** moves one vector onto the line (or subspace) spanned by another—like a **shadow**. On $\\mathbb{R}^n$ from Ch.01, this chapter trains you to read **similarity, angles, and distance** in the language of dot products, and connects naturally to **similarity, attention, and linear layers** in ML and deep learning.","sectionTitle":"Dot Product and Orthogonal Projection: Measuring Similarity with Numbers","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"How to solve problems"},"visualShort":"Dot product · angle · projection · cosine similarity","visualIntro":"For arrows $\\mathbf{u},\\mathbf{v}$, the dot product $\\mathbf{u}\\cdot\\mathbf{v}$ couples lengths and angle. The vector you get by “dropping $\\mathbf{v}$ onto $\\mathbf{u}$” is the projection $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$, and the remainder $\\mathbf{v}-\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$ is **orthogonal** to $\\mathbf{u}$.","visualStep1":"Concept: $\\mathbf{u}\\cdot\\mathbf{v}=\\sum_i u_i v_i=\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$","visualStep2":"Intuition: positive if similar direction, 0 if orthogonal, negative if opposite","visualStep3":"Projection: $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}=\\frac{\\mathbf{u}\\cdot\\mathbf{v}}{\\mathbf{u}\\cdot\\mathbf{u}}\\mathbf{u}$","visualStep4":"Applications: embedding similarity, linear layers, least-squares as projection","visualStepsLabel":"Suggested viewing order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: dot product · angle · orthogonality","visualFlowStep1":"Intuition: shadow (projection) · residual","visualFlowStep2":"Math: projection · cosine · Pythagoras","visualFlowStep3":"Applications: recommendation · deep layers · dimensionality reduction","dotVisualAriaLabel":"Dot product, projection, and cosine similarity: rotating vector with live readouts","dotVisualMainTitle":"Similarity as v spins","dotVisualPlotTitle":"Plane: u, v, projection","dotVisualMetricsTitle":"Direction, cosine & values","dotVisualHudDot":"u·v","dotVisualHudCos":"cos θ (direction)","dotVisualHudProj":"|proj| / |v|","dotVisualLegendU":"base u","dotVisualLegendV":"rotating v","dotVisualLegendProj":"shadow","dotVisualLegendRes":"residual ⊥ u","dotVisualInsetLabel":"direction","dotVisualCaption":"As **green vector** $v$ rotates, **$\\theta$** changes, and the **amber shadow (projection)** length, **dot product**, and $\\cos\\theta$ move together. Nearer **same direction** → larger **dot product**; **orthogonal** → $0$; **opposite** → **negative**. The small circle shows **only** $v$'s **direction**.","whatIs":{"intro":"The **dot product** is the “multiply matching entries and add” rule from Ch.01 folded into **one number**. Geometrically it is $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$. A **projection** onto a direction is the **shadow vector** you get after rescaling by that dot-product coefficient.","plain":"In plain words, the dot product scores **how much two arrows point the same way**. Same direction → large positive; perpendicular → $0$; opposite → negative. Think of projection as the **shadow** of a flashlight on a wall.","definition":"$20","inAI":"In **deep learning**, each linear layer is built from dot products between rows of weights and the input. **Attention** uses query–key dot products (or scores) to decide where to look. **Recommendation** uses dot products / cosines between user and item embeddings."},"whyImportant":{"bridge":"After Ch.01’s “boxes of numbers,” this chapter is the rule that **pairs boxes to make one score**. That score becomes the common language for **distance, angle, and similarity** before matrices, eigenvalues, and optimization.","similarity":"To make “similar” precise you need a **measure**. Dot products and cosines separate **direction vs magnitude** in high dimensions and tie directly to preprocessing (e.g. normalization)."},"howUsed":{"ml":"**Machine learning:** similarity for kNN, kernels, linear/logistic terms $\\mathbf{w}\\cdot\\mathbf{x}$; outliers may show up as small dot products or large angles.","geometry":"**Geometry:** least-squares fits as **projection** onto the column space; PCA / orthogonal bases; Gram–Schmidt subtracts projections to orthogonalize."},"summary":"**Summary:** The dot product is “sum of products of components” and couples length and angle; projection is the **shadow** along a direction; cosine focuses on direction; projections pair with orthogonal residuals. **Ch.03 matrices** bundle many dot products at once.","problemSolving":{"focus":"The table summarizes **formulas and symbol meanings** for solving problems, followed by **item-by-item notes** on why those definitions are set up that way. **Worked examples** walk through representative types step by step.","examplesHeading":"Worked examples","examplesTable":"$21"},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"$22","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **10 problems** sampled from a bank of **60** (easy 4 · medium 3 · hard 3; order easy→medium→hard). Each item is **multiple choice**—pick the option number.","practiceProblemsInstruction":"Read the question and choose the best option.","problems":{"definition_0":"In $\\mathbb{R}^n$, which expression best matches writing the **dot product** $\\mathbf{u}\\cdot\\mathbf{v}$ in components?\n\n① $\\sum_i u_i v_i$ (multiply matching entries and add)\n② $\\sum_i u_i + v_i$\n③ $\\max_i u_i v_i$\n④ $\\prod_i u_i v_i$","definition_1":"When two vectors are **orthogonal**, what is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① Always $0$\n② Always $1$\n③ Always positive\n④ Always a vector","definition_2":"In $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$, what does $\\theta$ represent?\n\n① The **angle** between the vectors (the smaller one)\n② The dimension\n③ Only the norms\n④ The rank of a matrix","definition_3":"For $\\mathbf{u}\\neq\\mathbf{0}$, the **orthogonal projection** of $\\mathbf{v}$ onto $\\mathbf{u}$ is which vector?\n\n① $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\mathbf{u}\\cdot\\mathbf{u}}\\,\\mathbf{u}$\n② $\\mathbf{v}-\\mathbf{u}$\n③ $\\dfrac{\\mathbf{v}}{\\|\\mathbf{u}\\|}$\n④ $\\mathbf{u}\\times\\mathbf{v}$","definition_4":"What is the usual range of **cosine similarity** $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|}$ for real vectors (typically)?\n\n① $[-1,1]$\n② $[0,\\infty)$\n③ $(-\\infty,\\infty)$ only\n④ Only $0$ or $1$","definition_5":"The result of $\\mathbf{u}\\cdot\\mathbf{v}$ is best described as:\n\n① A **scalar** (one real number)\n② Always a vector\n③ Always a matrix\n④ Always a Boolean","definition_6":"Which relation always holds between $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|$ and $\\|\\mathbf{v}\\|$?\n\n① $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|\\le \\|\\mathbf{v}\\|$\n② $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|> \\|\\mathbf{v}\\|$ always\n③ They are always equal\n④ Cannot compare","definition_7":"In logistic regression with $z=\\mathbf{w}\\cdot\\mathbf{x}+b$, what does $\\mathbf{w}\\cdot\\mathbf{x}$ mainly encode?\n\n① A score of **alignment** between weight and feature vectors\n② A cross product\n③ A determinant\n④ The probability itself","definition_8":"Which is a correct **property of the dot product**? ($\\mathbf{a},\\mathbf{b},\\mathbf{c}$ same dimension; $c$ is a scalar)\n\n① $(c\\mathbf{a})\\cdot\\mathbf{b}=c(\\mathbf{a}\\cdot\\mathbf{b})$\n② $(\\mathbf{a}\\cdot\\mathbf{b})\\cdot\\mathbf{c}$ is always defined\n③ $\\mathbf{a}\\cdot\\mathbf{b}=\\mathbf{a}+\\mathbf{b}$\n④ The dot product never commutes","definition_9":"To connect with $\\mathbb{R}^n$ from Ch.01, for $\\mathbf{u}\\cdot\\mathbf{v}$ to be defined we need:\n\n① The same dimension (**same $n$**)\n② Different dimensions are fine\n③ Both must be unit vectors\n④ One must be the zero vector","trueFalse_0":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\mathbf{u}\\cdot\\mathbf{v}=0$ always implies both vectors are the zero vector.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_1":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nFor every $\\mathbf{v}$, $\\mathbf{0}\\cdot\\mathbf{v}=0$.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_2":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\mathbf{u}\\cdot\\mathbf{v}=\\mathbf{v}\\cdot\\mathbf{u}$ always holds (when defined).\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_3":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nThe projection $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$ is always parallel to $\\mathbf{u}$ ($\\mathbf{u}\\neq\\mathbf{0}$).\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_4":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nCosine similarity is always nonnegative.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_5":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\|\\mathbf{u}+\\mathbf{v}\\|^2=\\|\\mathbf{u}\\|^2+\\|\\mathbf{v}\\|^2$ always holds.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_6":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nDot products are linear: $\\mathbf{u}\\cdot(\\mathbf{v}+\\mathbf{w})=\\mathbf{u}\\cdot\\mathbf{v}+\\mathbf{u}\\cdot\\mathbf{w}$.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_7":"If the statement is **true**, choose ①; if **false**, choose ②.\n\n$\\mathbf{u}\\cdot\\mathbf{u}=\\|\\mathbf{u}\\|^2$.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_8":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nIn recommender systems, dot products / cosines can score similarity between user and item embeddings.\n\n① True\n② False\n③ Neither\n④ Empty statement","trueFalse_9":"If the statement is **true**, choose ①; if **false**, choose ②.\n\nThe residual $\\mathbf{v}-\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$ is orthogonal to $\\mathbf{u}$ ($\\mathbf{u}\\neq\\mathbf{0}$).\n\n① True\n② False\n③ Neither\n④ Empty statement","calc_0":"$$\\mathbf{u}=(2,3)$, $\\mathbf{v}=(4,-1)$. What is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① $5$\n② $11$\n③ $-5$\n④ $14$","calc_1":"$$\\mathbf{a}=(1,1,1)$, $\\mathbf{b}=(2,-3,1)$. What is $\\mathbf{a}\\cdot\\mathbf{b}$?\n\n① $0$\n② $3$\n③ $6$\n④ $-1$","calc_2":"$$\\|\\mathbf{u}\\|=5$, $\\|\\mathbf{v}\\|=4$, and the vectors point the same direction. What is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① $20$\n② $9$\n③ $1$\n④ $0$","calc_3":"$$\\mathbf{u}=(3,4)$. What is $\\mathbf{u}\\cdot\\mathbf{u}$?\n\n① $25$\n② $5$\n③ $12$\n④ $7$","calc_4":"$$\\mathbf{u}=(2,0)$, $\\mathbf{v}=(1,\\sqrt{3})$. What is cosine similarity $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|}$?\n\n① $\\dfrac{1}{2}$\n② $1$\n③ $0$\n④ $\\dfrac{\\sqrt{3}}{2}$","calc_5":"$$\\mathbf{u}=(1,2)$, $\\mathbf{v}=(2,4)$, and $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}=\\alpha\\mathbf{u}$. What is $\\alpha$?\n\n① $2$\n② $1$\n③ $0$\n④ $4$","calc_6":"$$\\mathbf{e}_1=(1,0,0)$, $\\mathbf{v}=(3,-2,6)$. What is the first component of $\\mathrm{proj}_{\\mathbf{e}_1}\\mathbf{v}$ (the $x$-coordinate)?\n\n① $3$\n② $6$\n③ $-2$\n④ $0$","calc_7":"$$\\mathbf{u}=(1,0)$, $\\mathbf{v}=(0,5)$. What is $\\|\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}\\|$?\n\n① $0$\n② $1$\n③ $5$\n④ $25$","calc_8":"What is the norm $\\|\\mathbf{a}\\|$ for $\\mathbf{a}=(1,2,2)$?\n\n① $3$\n② $9$\n③ $\\sqrt{5}$\n④ $5$","calc_9":"$$\\mathbf{u}=(-1,2)$, $\\mathbf{v}=(4,2)$. What is $\\mathbf{u}\\cdot\\mathbf{v}$?\n\n① $0$\n② $10$\n③ $-4$\n④ $6$","concept_0":"In deep learning, which best matches treating an attention score as a dot product?\n\n① It scores **alignment** between query and key vectors\n② It only uses norms\n③ It disables backprop\n④ It only looks at activations","concept_1":"In least squares with an orthogonal (orthonormal) design matrix, what becomes easier?\n\n① **Independently** interpreting each coefficient\n② Always diverging\n③ Learning rate becomes 0\n④ Dot products are always 0","concept_2":"When feature scales differ a lot, cosine similarity is often preferable to Euclidean distance because:\n\n① You care about **direction** more than **magnitude**\n② You want larger lengths\n③ Derivatives vanish\n④ It is always slower","concept_3":"Which operation is central to Gram–Schmidt?\n\n① **Subtract** projections onto previous directions to orthogonalize\n② Computing determinants\n③ Only eigenvalues\n④ Probability integrals","concept_4":"What basic idea connects PCA (covariance eigenvectors) to projections and orthogonality?\n\n① Maximizing variance along **orthogonal** axes (quadratic forms)\n② Dot products are always 0\n③ Only cross products\n④ Probability only","concept_5":"For loss $L(\\mathbf{w})=\\|\\mathbf{y}-X\\mathbf{w}\\|^2$, what role does $X\\mathbf{w}$ play?\n\n① It is the **projection** of $\\mathbf{y}$ onto the column space of $X$ (in the LS sense)\n② Random noise\n③ Always zero\n④ An activation function","concept_6":"For a linear layer row $\\mathbf{w}_i^{\\mathsf T}\\mathbf{x}$ in $\\mathbf{z}=W\\mathbf{x}$ before ReLU, what is it?\n\n① One **dot product** between a weight row and the input\n② A cross product\n③ Softmax\n④ Only batch norm","concept_7":"Why can cosine similarity become unstable when $\\|\\mathbf{u}\\|$ is tiny?\n\n① The denominator $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|$ is **near 0**, blowing up scale\n② Dot product is always 0\n③ Cosine is always 1\n④ Vectors are orthogonal","concept_8":"If word embeddings are **unit-normalized**, what happens when you compare them with cosine similarity?\n\n① Cosine $\\approx$ plain **dot product** (direction only)\n② Always wrong\n③ Dot product undefined\n④ Dimension changes","concept_9":"Which description highlights that an **orthogonal projection** is a **linear map**?\n\n① It preserves sums and scalar multiples (can be written as a matrix $P$)\n② Always nonlinear\n③ Only rotations\n④ Only changes probability","projection_0":"$$\\mathbf{u}=(1,1)$, $\\mathbf{v}=(3,0)$. If $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}=(a,a)$, what is $a$?\n\n① $\\dfrac{3}{2}$\n② $3$\n③ $\\dfrac{1}{2}$\n④ $0$","projection_1":"$$\\mathbf{u}=(2,1)$, $\\mathbf{v}=(1,2)$. What is the $x$-component of $\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$?\n\n① $\\dfrac{8}{5}$\n② $2$\n③ $1$\n④ $0$","projection_2":"Project $\\mathbf{v}=(6,8)$ onto $\\mathbf{e}_1=(1,0)$. What is the norm of the result?\n\n① $6$\n② $8$\n③ $10$\n④ $0$","projection_3":"If $\\mathbf{\\hat{u}}$ is a unit vector, what simplified form does the projection onto $\\mathbf{\\hat{u}}$ take?\n\n① $(\\mathbf{v}\\cdot\\mathbf{\\hat{u}})\\,\\mathbf{\\hat{u}}$\n② $\\mathbf{v}-\\mathbf{\\hat{u}}$\n③ Only $\\|\\mathbf{v}\\|\\mathbf{\\hat{u}}$\n④ $\\mathbf{\\hat{u}}/\\|\\mathbf{v}\\|$","projection_4":"$$\\mathbf{a}=(1,1,1)$, $\\mathbf{b}=(1,0,0)$. What is the sum of the three components of $\\mathrm{proj}_{\\mathbf{a}}\\mathbf{b}$?\n\n① $1$\n② $3$\n③ $0$\n④ $\\dfrac{1}{3}$","projection_5":"Let $\\mathbf{r}=\\mathbf{v}-\\mathrm{proj}_{\\mathbf{u}}\\mathbf{v}$. What is $\\mathbf{r}\\cdot\\mathbf{u}$? ($\\mathbf{u}\\neq\\mathbf{0}$)\n\n① $0$\n② $\\|\\mathbf{u}\\|^2$\n③ $\\|\\mathbf{v}\\|^2$\n④ $1$","projection_6":"Let $\\mathbf{\\hat{u}}$ be the unit vector in the direction of $\\mathbf{u}=(4,3)$. What is $\\|\\mathrm{proj}_{\\mathbf{\\hat{u}}}\\mathbf{v}\\|$ for $\\mathbf{v}=(1,0)$? (Use dot products only.)\n\n① $\\dfrac{4}{5}$\n② $1$\n③ $\\dfrac{3}{5}$\n④ $5$","projection_7":"The area of the parallelogram spanned by two plane vectors is $\\|\\mathbf{u}\\|\\|\\mathbf{v}\\||\\sin\\theta|$; in 3D this equals $\\|\\mathbf{u}\\times\\mathbf{v}\\|$. How does this connect to dot products?\n\n① $\\sin^2\\theta=1-\\cos^2\\theta$ links to the **orthogonal component**\n② Unrelated to dot products\n③ Always 0\n④ Norms are always 1","projection_8":"Let $\\mathbf{v}=\\mathbf{p}+\\mathbf{r}$ be the orthogonal decomposition where $\\mathbf{p}$ is the projection onto $\\mathbf{u}$ and $\\mathbf{r}$ is the residual. Which Pythagorean relation holds between $\\|\\mathbf{v}\\|^2$ and $\\|\\mathbf{p}\\|^2+\\|\\mathbf{r}\\|^2$?\n\n① Always $\\|\\mathbf{v}\\|^2=\\|\\mathbf{p}\\|^2+\\|\\mathbf{r}\\|^2$\n② Always $\\|\\mathbf{v}\\|^2=\\|\\mathbf{p}\\|^2-\\|\\mathbf{r}\\|^2$\n③ Never holds\n④ $\\|\\mathbf{p}\\|=\\|\\mathbf{r}\\|$","projection_9":"For matrix $A$, if $y_i=\\mathbf{a}_i\\cdot\\mathbf{x}$ for each row $\\mathbf{a}_i^{\\mathsf T}$, what viewpoint is this?\n\n① The coordinates of $A\\mathbf{x}$ are **row–vector dot products** with $\\mathbf{x}$\n② Cross product magnitude\n③ Determinant\n④ Variance","scenario_0":"Two document embeddings have cosine similarity $0.92$. In a recommender, a simple reading is:\n\n① Topics are **fairly aligned** (after scale normalization)\n② The probability is 92%\n③ The documents have equal length\n④ They must use the same words","scenario_1":"Image feature dim $\\neq$ text feature dim. To use cosine similarity directly you should:\n\n① First map both into the **same dimension** (shared embedding space)\n② It always works if dims differ\n③ Dot products ignore dimension\n④ Match probabilities only","scenario_2":"Mini-batch SGD loss is noisy. Which intuition about gradient $\\mathbf{g}$ matches an update step?\n\n① A step moves mainly opposite to $\\mathbf{g}$ (**steepest descent** direction)\n② Always same direction as $\\mathbf{g}$\n③ Unrelated to $\\mathbf{g}$\n④ Dot product always 0","scenario_3":"Collaborative filtering predicts $\\hat{r}=\\mathbf{u}\\cdot\\mathbf{v}$. If the dot product is large, the model usually means:\n\n① User and item factors **line up** (under the model)\n② Always dislike\n③ Cannot learn\n④ Probability equals 1","scenario_4":"Why divide by $\\sqrt{d_k}$ in scaled dot-product attention (Transformer)?\n\n① Reduce variance so softmax is less **saturated**\n② Remove dot products\n③ Disable backprop\n④ Force orthogonality","scenario_5":"After standardizing features, linear SVM margins connect naturally to:\n\n① Separating via distances/angles in an **inner-product** space (kernels)\n② Probability only\n③ Clustering only\n④ Unsupervised only","scenario_6":"When is cosine better than Euclidean distance between two autoencoder latent vectors?\n\n① When **direction** (pattern) matters more than **length**\n② Only when distance is always better\n③ Only without images\n④ Never","scenario_7":"Which is closest to how **projection** appears in ML pipelines?\n\n① Visualizing high-dim data in a **low-dimensional subspace** (e.g. PCA)\n② Only estimating probabilities\n③ Always deleting data\n④ Only changing batch size","scenario_8":"Why can a larger dot product after normalization **not** guarantee semantic similarity?\n\n① Embeddings depend on **training data and objectives**\n② Dot products are always wrong\n③ Cosine is always 0\n④ Vectors are orthogonal","scenario_9":"Viewing matrix–vector product $A\\mathbf{x}$ (Ch.03) through dot products:\n\n① It is the vector of **row dot products** with $\\mathbf{x}$\n② Determinant only\n③ Always a scalar\n④ Cross products only"},"problemAnswers":{"definition_0":1,"definition_1":1,"definition_2":1,"definition_3":1,"definition_4":1,"definition_5":1,"definition_6":1,"definition_7":1,"definition_8":1,"definition_9":1,"trueFalse_0":2,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":2,"trueFalse_5":2,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":1,"calc_0":1,"calc_1":1,"calc_2":1,"calc_3":1,"calc_4":1,"calc_5":1,"calc_6":1,"calc_7":1,"calc_8":1,"calc_9":1,"concept_0":1,"concept_1":1,"concept_2":1,"concept_3":1,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"projection_0":1,"projection_1":1,"projection_2":1,"projection_3":1,"projection_4":1,"projection_5":1,"projection_6":1,"projection_7":1,"projection_8":1,"projection_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1},"problemSolutions":{"definition_0":"**(1) Concept:** Multiply matching components and add. **(2) Example:** $\\mathbf{u}=(1,2)$, $\\mathbf{v}=(3,-1)$ gives $1\\cdot3+2\\cdot(-1)=1$. **(3) Answer: ①**","definition_1":"**(1) Concept:** Orthogonal means $90^\\circ$, so $\\cos\\theta=0$ and the dot product is $0$. **(2) Example:** $(1,0)\\cdot(0,1)=0$. **(3) Answer: ①**","definition_2":"**(1) Concept:** In $\\mathbf{u}\\cdot\\mathbf{v}=\\|\\mathbf{u}\\|\\|\\mathbf{v}\\|\\cos\\theta$, $\\theta$ is the angle between arrows. **(2) Example:** Same direction $\\Rightarrow\\theta=0$. **(3) Answer: ①**","definition_3":"**(1) Concept:** Keep only the $\\mathbf{u}$ component by projecting; coefficient $\\dfrac{\\mathbf{u}\\cdot\\mathbf{v}}{\\|\\mathbf{u}\\|^2}$. **(2) Example:** $\\mathbf{u}=(1,0)$, $\\mathbf{v}=(3,4)$ $\\Rightarrow$ projection $(3,0)$. **(3) Answer: ①**","definition_4":"**(1) Concept:** Cosine lies in $[-1,1]$. **(2) Example:** Same direction $\\approx1$, opposite $\\approx-1$, orthogonal $0$. **(3) Answer: ①**","definition_5":"**(1) Concept:** The dot product is a single real **scalar**. **(2) Example:** $(1,2)\\cdot(3,1)=5$. **(3) Answer: ①**","definition_6":"**(1) Concept:** A projection cannot be longer than the original vector (right triangle). **(2) Example:** Equality if $\\mathbf{v}$ is already parallel to $\\mathbf{u}$. **(3) Answer: ①**","definition_7":"**(1) Concept:** Linear models use dot products to score feature alignment. **(2) Example:** Text similarity often uses dot/cosine-style scores. **(3) Answer: ①**","definition_8":"**(1) Concept:** Scalar multiples factor out of one side. **(2) Example:** $(2\\mathbf{a})\\cdot\\mathbf{b}=2(\\mathbf{a}\\cdot\\mathbf{b})$. **(3) Answer: ①**","definition_9":"**(1) Concept:** Componentwise multiply/add needs equal length. **(2) Example:** $(1,2)\\in\\mathbb{R}^2$ vs $(1,2,0)\\in\\mathbb{R}^3$ cannot pair. **(3) Answer: ①**","trueFalse_0":"**(1) Counterexample:** $(1,0)\\cdot(0,1)=0$ but neither vector is zero (orthogonal). **(2) Answer: ②**","trueFalse_1":"**(1) Concept:** $\\mathbf{0}$ has all zero components. **(2) Answer: ①**","trueFalse_2":"**(1) Concept:** Commutativity of the dot product. **(2) Answer: ①**","trueFalse_3":"**(1) Concept:** The projection lies on the line spanned by $\\mathbf{u}$. **(2) Answer: ①**","trueFalse_4":"**(1) Counterexample:** Opposite directions can make cosine negative. **(2) Answer: ②**","trueFalse_5":"**(1) Concept:** $\\|\\mathbf{u}+\\mathbf{v}\\|^2=\\|\\mathbf{u}\\|^2+\\|\\mathbf{v}\\|^2+2\\mathbf{u}\\cdot\\mathbf{v}$. **(2) Answer: ②**","trueFalse_6":"**(1) Concept:** Distributivity / linearity. **(2) Answer: ①**","trueFalse_7":"**(1) Example:** $(3,4)\\cdot(3,4)=25=\\|\\mathbf{u}\\|^2$. **(2) Answer: ①**","trueFalse_8":"**(1) Practice:** Similarity scoring in recommender systems. **(2) Answer: ①**","trueFalse_9":"**(1) Concept:** The residual is orthogonal to $\\mathbf{u}$. **(2) Answer: ①**","calc_0":"**(1) Compute:** $2\\cdot4+3\\cdot(-1)=5$. **(2) Answer: ①**","calc_1":"**(1) Compute:** $2-3+1=0$. **(2) Answer: ①**","calc_2":"**(1) Concept:** Same direction $\\Rightarrow\\cos\\theta=1$, dot $=5\\cdot4=20$. **(2) Answer: ①**","calc_3":"**(1) Compute:** $9+16=25=\\|\\mathbf{u}\\|^2$. **(2) Answer: ①**","calc_4":"**(1) Compute:** Dot $=2$, norms $2$ and $2$ $\\Rightarrow$ $\\dfrac{2}{4}=\\dfrac{1}{2}$. **(2) Answer: ①**","calc_5":"**(1) Compute:** $\\mathbf{u}\\cdot\\mathbf{v}=10$, $\\mathbf{u}\\cdot\\mathbf{u}=5$ $\\Rightarrow$ $\\alpha=2$. **(2) Answer: ①**","calc_6":"**(1) Concept:** Projection onto $\\mathbf{e}_1$ keeps the $x$ component: $3$. **(2) Answer: ①**","calc_7":"**(1) Compute:** $\\mathbf{v}$ is orthogonal to $\\mathbf{u}$, so projection length $0$. **(2) Answer: ①**","calc_8":"**(1) Compute:** $\\sqrt{1+4+4}=3$. **(2) Answer: ①**","calc_9":"**(1) Compute:** $-4+4=0$ (orthogonal). **(2) Answer: ①**","concept_0":"**(1) Practice:** Larger dot $\\Rightarrow$ more attention mass. **(2) Answer: ①**","concept_1":"**(1) Concept:** Orthogonal columns decouple coefficient effects. **(2) Answer: ①**","concept_2":"**(1) Intuition:** Cosine focuses on topic direction, not document length. **(2) Answer: ①**","concept_3":"**(1) Concept:** Subtract projections to orthogonalize. **(2) Answer: ①**","concept_4":"**(1) Bridge:** Dot products, orthogonality, projections underpin PCA. **(2) Answer: ①**","concept_5":"**(1) Practice:** Least squares $\\Leftrightarrow$ projection onto $\\mathrm{Col}(X)$. **(2) Answer: ①**","concept_6":"**(1) Practice:** Deep layers stack dot products. **(2) Answer: ①**","concept_7":"**(1) Practice:** Stabilize with regularization / clipping when norms are tiny. **(2) Answer: ①**","concept_8":"**(1) Concept:** On the unit sphere, $\\mathbf{u}\\cdot\\mathbf{v}=\\cos\\theta$. **(2) Answer: ①**","concept_9":"**(1) Concept:** Projection matrices $P=\\dfrac{\\mathbf{u}\\mathbf{u}^{\\mathsf T}}{\\mathbf{u}^{\\mathsf T}\\mathbf{u}}$. **(2) Answer: ①**","projection_0":"**(1) Compute:** $\\mathbf{u}\\cdot\\mathbf{v}=3$, $\\mathbf{u}\\cdot\\mathbf{u}=2$ $\\Rightarrow$ coeff. $3/2$, projection $(3/2,3/2)$. **(2) Answer: ①**","projection_1":"**(1) Compute:** Dot $4$, $\\|\\mathbf{u}\\|^2=5$ $\\Rightarrow$ projection $\\dfrac{4}{5}(2,1)$, $x=8/5$. **(2) Answer: ①**","projection_2":"**(1) Concept:** Project to $x$-axis: $(6,0)$, norm $6$. **(2) Answer: ①**","projection_3":"**(1) Concept:** If $\\|\\mathbf{\\hat{u}}\\|=1$, the scalar coefficient is $\\mathbf{v}\\cdot\\mathbf{\\hat{u}}$. **(2) Answer: ①**","projection_4":"**(1) Compute:** $\\mathbf{a}\\cdot\\mathbf{b}=1$, $\\mathbf{a}\\cdot\\mathbf{a}=3$ $\\Rightarrow$ $\\dfrac{1}{3}(1,1,1)$, sum $=1$. **(2) Answer: ①**","projection_5":"**(1) Concept:** Residual is orthogonal to $\\mathbf{u}$. **(2) Answer: ①**","projection_6":"**(1) Compute:** $\\mathbf{\\hat{u}}=(4/5,3/5)$, dot $=4/5$, length $=|\\mathbf{v}\\cdot\\mathbf{\\hat{u}}|=4/5$. **(2) Answer: ①**","projection_7":"**(1) Bridge:** $\\cos$ from dot products, $\\sin$ from cross/area. **(2) Answer: ①**","projection_8":"**(1) Concept:** With $\\mathbf{p}\\perp\\mathbf{r}$, Pythagoras gives $\\|\\mathbf{v}\\|^2=\\|\\mathbf{p}\\|^2+\\|\\mathbf{r}\\|^2$. **(2) Answer: ①**","projection_9":"**(1) Practice:** $A\\mathbf{x}$ is stacked row dot products (preview Ch.03). **(2) Answer: ①**","scenario_0":"**(1) Practice:** Embeddings give approximate similarity only. **(2) Answer: ①**","scenario_1":"**(1) Practice:** Dot products need a common $\\mathbb{R}^n$. **(2) Answer: ①**","scenario_2":"**(1) Bridge:** Gradient descent is the optimization story (next topics). **(2) Answer: ①**","scenario_3":"**(1) Practice:** Matrix-factorization-style models use dot scores. **(2) Answer: ①**","scenario_4":"**(1) Practice:** Scale dot products to keep softmax stable. **(2) Answer: ①**","scenario_5":"**(1) Bridge:** Vectors $\\rightarrow$ kernels via inner products. **(2) Answer: ①**","scenario_6":"**(1) Practice:** Cosine is common when magnitude is arbitrary. **(2) Answer: ①**","scenario_7":"**(1) Practice:** PCA projects onto a subspace. **(2) Answer: ①**","scenario_8":"**(1) Practice:** Math tools assume training setup. **(2) Answer: ①**","scenario_9":"**(1) Preview:** Rows $\\cdot\\,\\mathbf{x}$ build deep linear layers. **(2) Answer: ①**"},"problemTestCodes":{"definition_0":"answer = 1\nassert answer == 1","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 1\nassert answer == 1","definition_3":"answer = 1\nassert answer == 1","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 1\nassert answer == 1","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 1\nassert answer == 1","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 2\nassert answer == 2","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 2\nassert answer == 2","trueFalse_5":"answer = 2\nassert answer == 2","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 1\nassert answer == 1","calc_0":"answer = 1\nassert answer == 1","calc_1":"answer = 1\nassert answer == 1","calc_2":"answer = 1\nassert answer == 1","calc_3":"answer = 1\nassert answer == 1","calc_4":"answer = 1\nassert answer == 1","calc_5":"answer = 1\nassert answer == 1","calc_6":"answer = 1\nassert answer == 1","calc_7":"answer = 1\nassert answer == 1","calc_8":"answer = 1\nassert answer == 1","calc_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","projection_0":"answer = 1\nassert answer == 1","projection_1":"answer = 1\nassert answer == 1","projection_2":"answer = 1\nassert answer == 1","projection_3":"answer = 1\nassert answer == 1","projection_4":"answer = 1\nassert answer == 1","projection_5":"answer = 1\nassert answer == 1","projection_6":"answer = 1\nassert answer == 1","projection_7":"answer = 1\nassert answer == 1","projection_8":"answer = 1\nassert answer == 1","projection_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1"}},"midMathCh03":{"chapter":"Chapter 03","title":"Matrices and data batches: putting many vectors on one sheet","description":"A **matrix** is a rectangular grid of numbers—**one sheet**. In machine learning, one **row** is often one **sample** (one person, one image) and one **column** is one **feature**. This chapter connects vectors (Ch.01) and dot products (Ch.02) to how they appear **many times at once** inside a matrix, and sets up **matrix multiplication and linear layers (Ch.04)**.","sectionTitle":"Matrices and data batches: putting many vectors on one sheet","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Notes for solving problems"},"visualShort":"Matrix · rows/columns · transpose · data matrix","visualIntro":"An $m\\times n$ matrix $A$ has **$m$ rows** and **$n$ columns**. Side‑by‑side **columns** are “many vectors on one sheet”; each **row** is one equation line or one sample record. The **transpose** $A^{\\mathsf T}$ swaps rows and columns.","visualStep1":"Concept: $A\\in\\mathbb{R}^{m\\times n}$, entries $a_{ij}$","visualStep2":"Intuition: columns = stacked vectors / rows = one sample line","visualStep3":"Ops: sum, scalar multiply, transpose (product next chapter)","visualStep4":"Use: design matrix, mini‑batch, weight tables","visualStepsLabel":"Suggested viewing order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: matrix as a grid","visualFlowStep1":"Intuition: column vs row reading","visualFlowStep2":"Math: match dimensions · transpose","visualFlowStep3":"Link: row dot products and $A\\mathbf{u}$","visualFlowStep4":"Apply: data matrix · batch tensor","dotVisualAriaLabel":"Matrix with column highlight: animated emphasis and dimension panel","dotVisualMainTitle":"What changes when the column changes","dotVisualPlotTitle":"Grid: highlighting a column in $3\\times 3$","dotVisualMetricsTitle":"Shape · highlight · summary","dotVisualHudDot":"Rows $m$","dotVisualHudCos":"Cols $n$","dotVisualHudProj":"Highlighted column","dotVisualLegendU":"Grid","dotVisualLegendV":"Highlighted column","dotVisualLegendProj":"Axes","dotVisualLegendRes":"Labels","dotVisualInsetLabel":"Column index","dotVisualCaption":"**Purple columns** highlight in turn. Each column is a **vector of the same length**; placing three columns next to each other forms **one matrix**. The right panel shows **$m\\times n$** and **which column** is active. Reading by **rows** gives **one line per sample** (a common data convention).","whatIs":{"intro":"If a **vector** is numbers in a line, a **matrix** stacks several such lines into a rectangle. Size $m\\times n$ means **$m$ rows** meet **$n$ columns**. Notation varies (sometimes rows are samples, sometimes columns)—**always check the shape first**.","plain":"Think of a matrix as **one spreadsheet**: each cell is a number; a **whole column** can be one **feature vector**; a **whole row** can be one **record**. The same table changes meaning depending on **which direction you read**.","definition":"Core facts:\n\n1. **Shape**: $m\\times n$ means $m$ rows and $n$ columns of real entries.\n2. **Entries**: the value at row $i$, column $j$ is $a_{ij}$.\n3. **Transpose**: $A^{\\mathsf T}$ is $n\\times m$ with $(A^{\\mathsf T})_{ji}=a_{ij}$.\n4. **Columns as vectors**: columns $\\mathbf{a}_j\\in\\mathbb{R}^m$ can be written $A=[\\mathbf{a}_1\\ \\cdots\\ \\mathbf{a}_n]$.\n5. **Add / scale**: for the same shape, $(A+B)_{ij}=a_{ij}+b_{ij}$ and $(cA)_{ij}=c\\,a_{ij}$.\n\nThis chapter focuses on **reading stacked vectors safely** and **matching dimensions** before full **matrix multiplication**.","inAI":"In **deep learning**, weights are often **matrices** (or 2D slices of tensors). One layer’s linear map is “many dot products at once”; **batching** stacks samples along a row/column. In **machine learning**, the **design matrix** stacks feature vectors into one data table."},"whyImportant":{"bridge":"Ch.01 gave vectors; Ch.02 gave dot products for one interaction. Ch.03 extends that interaction to **whole tables**. Matrices are the language of **losses, gradients, and weight updates**.","similarity":"Real data is usually **many samples × many features**. Stating the **shape $m\\times n$** makes the layout explicit; wrong shapes silently break code."},"howUsed":{"ml":"Training data is often a **design matrix**; linear models are written as matrix–vector products. Logistic/softmax, linear SVM, and matrix‑factorization recommendations all use **batched vector operations**.","geometry":"Columns **span** a subspace (column space); fitting data to a lower dimension is **projection** to a subspace (later chapters)."},"summary":"**One‑line summary:** a matrix **bundles many vectors on one sheet**; **whether rows or columns are samples** follows convention. The **transpose** swaps axes to **match dimensions**. Row dot products from Ch.02 become the coordinates of $A\\mathbf{u}$. Next: matrix multiplication and linear maps.","problemSolving":{"focus":"The table below lists **symbols and dimension rules** for problem solving. **Worked patterns** illustrate typical steps.","examplesHeading":"Worked examples","examplesTable":"**Example 1 — counting entries**\n\nProblem: If $A$ is $4\\times 7$, how many entries?\n\nSolution: $4\\times 7=28$.\n\n→ pick the option matching **$28$**.\n\n---\n\n**Example 2 — transpose shape**\n\nProblem: If $A$ is $3\\times 5$, what is the shape of $A^{\\mathsf T}$?\n\nSolution: $5\\times 3$.\n\n---\n\n**Example 3 — addition**\n\nProblem: If $A,B$ are both $2\\times 2$, what is $(A+B)_{11}$?\n\nSolution: $a_{11}+b_{11}$.\n\n---\n\n**Example 4 — column vectors**\n\nProblem: If $A=[\\mathbf{a}_1\\ \\mathbf{a}_2]$ and $\\mathbf{a}_1\\in\\mathbb{R}^m$, how many rows does $A$ have?\n\nSolution: each column has length $m$, so **$m$ rows**.\n\n---\n\n**Example 5 — link to Ch.02**\n\nProblem: What is the $i$‑th component of $A\\mathbf{u}$?\n\nSolution: the **dot product** of the **$i$‑th row** of $A$ with $\\mathbf{u}$."},"problemSolvingLabel":"Notes for solving problems","problemSolvingTable":"| Symbol | Meaning |\n| :--- | :--- |\n| $m\\times n$ | $m$ rows and $n$ columns |\n| $a_{ij}$ | entry at row $i$, column $j$ |\n| $A^{\\mathsf T}$ | transpose: $(A^{\\mathsf T})_{ji}=a_{ij}$ |\n| column $\\mathbf{a}_j$ | column $j$ of $A$ as a vector |\n| same shape | $A+B$ only if dimensions match |\n| $A\\mathbf{u}$ (preview) | vector of row–$\\mathbf{u}$ dot products |\n\n**Details**\n\n**① Shape** Always check dimensions before add/multiply.\n\n**② Transpose** Swaps sample/feature axes when needed.\n\n**③ Row/column view** Meaning depends on the problem setup.\n\n**④ Ch.02 link** Each row dotted with $\\mathbf{u}$ gives one coordinate of $A\\mathbf{u}$.","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **10 problems** sampled from a bank of **60** (4 easy · 3 medium · 3 hard; order easy→medium→hard). Each item is **multiple choice**—pick the option number.","practiceProblemsInstruction":"Read the question and choose the best option.","problems":{"definition_0":"How many **entries** does an $m\\times n$ matrix have?\n\n① $m+n$\n② $m\\times n$\n③ $\\max(m,n)$\n④ $m-n$","definition_1":"The usual notation for the $(i,j)$ entry of matrix $A$ is?\n\n① $a_{ij}$\n② only $a_{ji}$\n③ $A_i$\n④ $\\det(A)$","definition_2":"If $A$ is $m\\times n$, the length (dimension) of each **column vector** is?\n\n① $m$\n② $n$\n③ $m+n$\n④ $mn$","definition_3":"If $A$ is $m\\times n$, the shape of $A^{\\mathsf T}$ is?\n\n① $n\\times m$\n② $m\\times n$\n③ $m\\times m$\n④ $n\\times n$","definition_4":"A **square matrix** means?\n\n① same number of rows and columns\n② all entries are 1\n③ always invertible\n④ always zero","definition_5":"Which property matches a **zero matrix**?\n\n① every entry is 0\n② only diagonal entries are 0\n③ determinant is always 1\n④ transpose is undefined","definition_6":"The size of an identity matrix $I_n$ is?\n\n① $n\\times n$\n② $n\\times 1$\n③ $1\\times n$\n④ $2n\\times 2n$","definition_7":"$$\\mathbb{R}^{m\\times n}$ denotes?\n\n① all real $m\\times n$ matrices\n② an $(m+n)$‑dimensional vector space only\n③ the set of determinants\n④ square matrices only","definition_8":"If $A=[\\mathbf{a}_1\\ \\cdots\\ \\mathbf{a}_n]$ with $\\mathbf{a}_j\\in\\mathbb{R}^m$, the shape of $A$ is?\n\n① $m\\times n$\n② $n\\times m$\n③ $m\\times 1$\n④ $1\\times n$","definition_9":"A **row vector** of shape $1\\times n$ has how many entries?\n\n① $n$\n② $1$\n③ $n+1$\n④ $0$","trueFalse_0":"If the statement is **true** choose ①, if **false** choose ②.\n\nMatrix addition $A+B$ is defined only when $A$ and $B$ have the same shape.\n\n① True\n② False","trueFalse_1":"If the statement is **true** choose ①, if **false** choose ②.\n\n$(A^{\\mathsf T})^{\\mathsf T}=A$.\n\n① True\n② False","trueFalse_2":"If the statement is **true** choose ①, if **false** choose ②.\n\nA $2\\times 3$ matrix and a $3\\times 2$ matrix can have the same number of entries.\n\n① True\n② False","trueFalse_3":"If the statement is **true** choose ①, if **false** choose ②.\n\nEvery square matrix is invertible.\n\n① True\n② False","trueFalse_4":"If the statement is **true** choose ①, if **false** choose ②.\n\nIf $A$ is $m\\times n$, then $A^{\\mathsf T}$ is $n\\times m$.\n\n① True\n② False","trueFalse_5":"If the statement is **true** choose ①, if **false** choose ②.\n\nA common data convention is “one row = one sample”.\n\n① True\n② False","trueFalse_6":"If the statement is **true** choose ①, if **false** choose ②.\n\n$A+B=B+A$ whenever addition is defined.\n\n① True\n② False","trueFalse_7":"If the statement is **true** choose ①, if **false** choose ②.\n\n$(cA)^{\\mathsf T}=cA^{\\mathsf T}$.\n\n① True\n② False","trueFalse_8":"If the statement is **true** choose ①, if **false** choose ②.\n\nFor $I_n A=A$ to hold, $A$ must be $n\\times n$.\n\n① True\n② False","trueFalse_9":"If the statement is **true** choose ①, if **false** choose ②.\n\nDot products from Ch.02 connect to one row of a matrix–vector product.\n\n① True\n② False","calc_0":"For $A=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, $\\mathrm{tr}(A)=a_{11}+a_{22}$ equals?\n\n① $5$\n② $4$\n③ $6$\n④ $7$","calc_1":"Let $A=\\begin{pmatrix}1&0\\\\2&-1\\end{pmatrix}$, $B=\\begin{pmatrix}0&1\\\\1&1\\end{pmatrix}$. What is $(A+B)_{12}$?\n\n① $1$\n② $0$\n③ $2$\n④ $-1$","calc_2":"Let $A=\\begin{pmatrix}2&-1\\end{pmatrix}$ and $c=3$. What is $(cA)_{11}$?\n\n① $6$\n② $2$\n③ $-3$\n④ $9$","calc_3":"If $A$ is $2\\times 3$, how many entries does $A^{\\mathsf T}$ have?\n\n① $5$\n② $6$\n③ $8$\n④ $9$","calc_4":"For $A=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, the $(2,1)$ entry of $A^{\\mathsf T}$ is?\n\n① $2$\n② $3$\n③ $4$\n④ $1$","calc_5":"Let $A=\\begin{pmatrix}0&1\\\\2&3\\end{pmatrix}$, $B=\\begin{pmatrix}1&-1\\\\0&2\\end{pmatrix}$. What is $(A+B)_{21}$?\n\n① $2$\n② $3$\n③ $1$\n④ $0$","calc_6":"$$A=\\begin{pmatrix}1&2&3\\end{pmatrix}$ is $1\\times 3$. The shape of $A^{\\mathsf T}$ is?\n\n① $3\\times 1$\n② $1\\times 3$\n③ $3\\times 3$\n④ $1\\times 1$","calc_7":"For $A=\\begin{pmatrix}5\\end{pmatrix}$, what is the shape of $A^{\\mathsf T}$ (ignoring determinant)?\n\n① $1\\times 1$\n② $0\\times 0$\n③ $1\\times 0$\n④ undefined","calc_8":"What is the shape of $A=\\begin{pmatrix}1&2\\\\3&4\\\\5&6\\end{pmatrix}$?\n\n① $3\\times 2$\n② $2\\times 3$\n③ $6\\times 1$\n④ $1\\times 6$","calc_9":"Let the first column of $\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$ be $\\mathbf{a}_1$. The **second component** of $\\mathbf{a}_1$ is?\n\n① $3$\n② $1$\n③ $2$\n④ $4$","concept_0":"In linear regression, a common convention (e.g. scikit‑learn) with **samples as rows** means?\n\n① each row is one observation (sample)\n② each column is one observation\n③ only $1\\times n$ is allowed\n④ matrices are never used","concept_1":"In deep learning, a common **2D batch** layout is described as?\n\n① often something like (batch size)$\\times$(feature dim)\n② scalars only\n③ batch size is always 0\n④ matrices are never used","concept_2":"Connecting to Ch.02, the $i$‑th coordinate of $A\\mathbf{u}$ is?\n\n① dot product of row $i$ of $A$ with $\\mathbf{u}$\n② always dot with column $i$ only\n③ always 0\n④ the trace","concept_3":"Reading a matrix as **a bundle of column vectors** fits when?\n\n① each column is the same kind of feature vector\n② only when columns are samples\n③ only when rows are features\n④ transpose is impossible","concept_4":"Why **flatten** an image into a vector before a linear layer?\n\n① to match the vector input size the FC layer expects\n② images always have one pixel\n③ matrices are forbidden\n④ because of softmax only","concept_5":"Standardizing **per column** in tabular data usually means?\n\n① scale within the same feature (same column)\n② only within rows\n③ always add a constant\n④ change matrix size","concept_6":"In collaborative filtering, a user–item **rating matrix** often implies?\n\n① rows are users, columns are items (or vice versa) by convention\n② always $1\\times 1$\n③ always zero\n④ unrelated to dot products","concept_7":"Intuitively, **rank** relates to (details later)?\n\n① how many independent column/row directions\n② always equals determinant\n③ always 0\n④ always increases under transpose","concept_8":"Why is **broadcasting** easy to misuse with matrices?\n\n① adding without checking shapes can silently be wrong\n② shape checks are never needed\n③ matrices are always $1\\times 1$\n④ transpose is always identity","concept_9":"For matrix multiplication $AB$ (Ch.04 preview), a necessary match is?\n\n① #columns of $A$ equals #rows of $B$\n② $A$ and $B$ must be square\n③ always $AB=BA$\n④ product is always a scalar","projection_0":"Let $A\\in\\mathbb{R}^{m\\times n}$ and $\\mathbf{u}\\in\\mathbb{R}^n$. The vector $A\\mathbf{u}$ lives in $\\mathbb{R}^?$ with dimension?\n\n① $m$\n② $n$\n③ $m+n$\n④ $mn$","projection_1":"If row $i$ of $A$ is $\\mathbf{r}_i^{\\mathsf T}$, then $(A\\mathbf{u})_i$ equals?\n\n① $\\mathbf{r}_i\\cdot\\mathbf{u}$\n② $\\mathbf{r}_i+\\mathbf{u}$\n③ $\\|\\mathbf{r}_i\\|$\n④ $\\det(A)$","projection_2":"If $A\\mathbf{u}=\\mathbf{0}$ for **every** $\\mathbf{u}$, columns of $A$ likely satisfy?\n\n① they may be linearly dependent\n② always $A=I$\n③ always invertible\n④ all column norms are 1","projection_3":"The rank intuition of $\\mathbf{u}\\mathbf{v}^{\\mathsf T}$ (outer product form) is?\n\n① at most 1 for nonzero vectors\n② always $n$\n③ always 0\n④ always invertible","projection_4":"The column space $\\mathrm{Col}(A)$ is best described as?\n\n① all linear combinations of the columns of $A$\n② always the whole space\n③ only $\\{\\mathbf{0}\\}$\n④ the set of determinants","projection_5":"If $A\\mathbf{x}=\\mathbf{b}$ has a solution, $\\mathbf{b}$ must lie in?\n\n① $\\mathrm{Col}(A)$\n② the unit ball only\n③ only the zero vector\n④ $\\mathbb{R}$","projection_6":"Viewing $A$ by rows, each row vector lies in which space (length viewpoint)?\n\n① $\\mathbb{R}^n$\n② $\\mathbb{R}^m$\n③ $\\mathbb{R}^{mn}$\n④ $\\mathbb{R}$","projection_7":"For $A\\in\\mathbb{R}^{m\\times n}$ and standard basis $\\mathbf{e}_j\\in\\mathbb{R}^n$, $A\\mathbf{e}_j$ equals?\n\n① column $j$ of $A$\n② row $j$ of $A$\n③ always 0\n④ only the $(j,j)$ entry","projection_8":"If data matrix $X$ has **samples as rows**, what does $X^{\\mathsf T}$ swap?\n\n① sample axis and feature axis\n② nothing\n③ always becomes square\n④ always becomes zero","projection_9":"From a linear map viewpoint, $A\\mathbf{u}$ is?\n\n① the image of $\\mathbf{u}$ under a map $\\mathbb{R}^n\\to\\mathbb{R}^m$\n② always length‑preserving\n③ always a rotation only\n④ always a probability vector","scenario_0":"In scikit‑learn, feature matrix **X** with **samples as rows** usually has shape?\n\n① (n_samples)$\\times$(n_features)\n② only (n_features)$\\times$(n_samples)\n③ always $1\\times 1$\n④ (n_classes)$\\times$(batch)","scenario_1":"A 2D tensor with batch 32 and feature dim 128 is often read as a matrix of shape?\n\n① $32\\times 128$\n② only $128\\times 32$\n③ $32\\times 32$\n④ $128\\times 128$","scenario_2":"Why **flatten** after convolutions before a fully connected layer?\n\n① FC expects a vector input\n② only because of softmax\n③ images are always 1D\n④ to disable backprop","scenario_3":"Filling missing values with **column means** averages along?\n\n① the same column (same feature)\n② only rows\n③ only the diagonal\n④ one global scalar","scenario_4":"In collaborative filtering, a very **sparse** ratings matrix $R$ means?\n\n① most entries are unobserved\n② all entries are 1\n③ always invertible\n④ matrices are not used","scenario_5":"Stacking **sentence embeddings as rows** suggests?\n\n① each row is one sentence (or a pooled vector)\n② columns are always sentences\n③ always $1\\times 1$\n④ softmax only","scenario_6":"On GPU, performance often relates to?\n\n① memory layout/stride and tensor shape\n② matrices are always scalars\n③ transpose is always free\n④ rank is always 0","scenario_7":"Which claim is **easy to overstate** using only Ch.03?\n\n① “Matrices instantly mean deep learning is always best”\n② data is often tabular\n③ matching shapes matters\n④ transpose swaps axes","scenario_8":"Flattening an $H\\times W$ grayscale image gives a vector of length?\n\n① $H\\times W$\n② $H+W$\n③ $\\max(H,W)$\n④ $1$","scenario_9":"Preview of Ch.04: in $\\mathbf{y}=W\\mathbf{x}+\\mathbf{b}$, $W$ represents?\n\n① a linear map mixing features\n② always one scalar multiply\n③ always softmax\n④ always the loss"},"problemAnswers":{"definition_0":2,"definition_1":1,"definition_2":1,"definition_3":1,"definition_4":1,"definition_5":1,"definition_6":1,"definition_7":1,"definition_8":1,"definition_9":1,"trueFalse_0":1,"trueFalse_1":1,"trueFalse_2":1,"trueFalse_3":2,"trueFalse_4":1,"trueFalse_5":1,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":2,"trueFalse_9":1,"calc_0":1,"calc_1":1,"calc_2":1,"calc_3":2,"calc_4":1,"calc_5":1,"calc_6":1,"calc_7":1,"calc_8":1,"calc_9":1,"concept_0":1,"concept_1":1,"concept_2":1,"concept_3":1,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"projection_0":1,"projection_1":1,"projection_2":1,"projection_3":1,"projection_4":1,"projection_5":1,"projection_6":1,"projection_7":1,"projection_8":1,"projection_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1},"problemSolutions":{"definition_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","definition_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","definition_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","trueFalse_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","trueFalse_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","trueFalse_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ②**","calc_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","calc_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","concept_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","projection_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_0":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_1":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_2":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_3":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_4":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_5":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_6":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_7":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_8":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**","scenario_9":"**1) Steps:** identify whether the question is definition, computation, or application. **2) Example:** check a tiny $2\\times 2$ numeric example to avoid shape mistakes. **3) Answer ①**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 1\nassert answer == 1","definition_3":"answer = 1\nassert answer == 1","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 1\nassert answer == 1","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 1\nassert answer == 1","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 2\nassert answer == 2","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 1\nassert answer == 1","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 2\nassert answer == 2","trueFalse_9":"answer = 1\nassert answer == 1","calc_0":"answer = 1\nassert answer == 1","calc_1":"answer = 1\nassert answer == 1","calc_2":"answer = 1\nassert answer == 1","calc_3":"answer = 2\nassert answer == 2","calc_4":"answer = 1\nassert answer == 1","calc_5":"answer = 1\nassert answer == 1","calc_6":"answer = 1\nassert answer == 1","calc_7":"answer = 1\nassert answer == 1","calc_8":"answer = 1\nassert answer == 1","calc_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","projection_0":"answer = 1\nassert answer == 1","projection_1":"answer = 1\nassert answer == 1","projection_2":"answer = 1\nassert answer == 1","projection_3":"answer = 1\nassert answer == 1","projection_4":"answer = 1\nassert answer == 1","projection_5":"answer = 1\nassert answer == 1","projection_6":"answer = 1\nassert answer == 1","projection_7":"answer = 1\nassert answer == 1","projection_8":"answer = 1\nassert answer == 1","projection_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1"}},"midMathCh04":{"chapter":"Chapter 04","title":"Matrix multiplication and linear maps: a smart filter that designs your data","description":"Matrix multiplication is not just a tedious pile of additions and multiplications. A matrix plays the same role as a **smart filter in a digital photo editor**—rotating, twisting, and compressing raw data. This chapter dives into **linear transformations**: putting one datum (a vector) through an editor (a matrix) and mapping it into a genuinely different “space.” We unpack what it means, mathematically, for the backbone $\\mathbf{y} = W\\mathbf{x} + \\mathbf{b}$ of deep learning models to work the way it does.","sectionTitle":"Matrix multiplication and linear maps: edit space with full control","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Notes for solving problems"},"visualShort":"Matrix×vector = move coordinates in one shot · matrix product = chain two maps","visualIntro":"Multiplying $A$ by a vector **mixes the numbers** into a new vector. **$AB$** means “do $B$, then $A$,” written as **one** product. Picture a grid tilting—that’s enough intuition.","visualStep1":"Concept: $A\\in\\mathbb{R}^{m\\times n}$ maps $\\mathbb{R}^n\\to\\mathbb{R}^m$","visualStep2":"Intuition: grid tilts and stretches (origin fixed)","visualStep3":"Formula: $(AB)_{ij}$ row–column dot; $(AB)\\mathbf{x}=A(B\\mathbf{x})$","visualStep4":"Use: FC layer, batched matmul, attention scores","visualStepsLabel":"Suggested viewing order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: linear map = matrix × vector","visualFlowStep1":"Intuition: grid deformation and composition","visualFlowStep2":"Math: product rules · transpose · composition","visualFlowStep3":"Link: Ch.02 dot = one row of matmul","visualFlowStep4":"Apply: FC · batch · score matrix","mapVisualStep1":"① Choose input x","mapVisualStep2":"② Multiply by A","mapVisualStep3":"③ Output Ax stays on blue","mapVisualPanelLeft":"Input","mapVisualPanelRight":"Output","dotVisualAnimateHint":"On one plane, the square grid becomes a parallelogram under A, and x moves to Ax.","dotVisualPhaseHint0":"**1/4** Only $x_1$ moves; $x_2$ is fixed. The output slides **along the first column** (orange segment).","dotVisualPhaseHint1":"**2/4** Only $x_2$ moves; $x_1$ is fixed. The output moves **along the second column** (teal segment).","dotVisualPhaseHint2":"**3/4** $x_1=x_2$ together. The output follows the **sum of the two columns**.","dotVisualPhaseHint3":"**4/4** $(x_1,x_2)$ traces a circle; $A\\mathbf{x}$ loops on the **blue patch**.","dotVisualHudCoeffLine":"$$x_1={x1}$, $x_2={x2}$","dotVisualDecompKey":"","dotVisualEasyHook":"**In one line:** $A\\mathbf{x}$ **moves** the input $\\mathbf{x}$ to a new spot in one go. The **big green box** is “where answers could live”; the **blue patch** is “where they **actually** land” (spanned by the columns of $A$).","dotVisualAriaLabel":"Unit square grid is transformed by matrix A; point x moves to Ax","dotVisualMainTitle":"Square grid → $A$ → skewed grid","dotVisualPlotTitle":"The **same coordinates** on the left land **in one step** on the right—the whole grid stretches with them.","dotVisualMetricsTitle":"Remember this picture","dotVisualHudDot":"","dotVisualHudCos":"$$T(\\mathbf{x})=A\\mathbf{x}$. **$T$** is just a **name (function symbol)** for the **linear map** “multiply by $A$”. So **$T(\\mathbf{x})$** means “apply $T$ to $\\mathbf{x}$”, which is exactly **$A\\mathbf{x}$**.","dotVisualHudPlain":"**Blue region** = parallelogram spanned by the **two column vectors** of $A$. The output **$T(\\mathbf{x})=A\\mathbf{x}$** always lies **inside** that region (reachable as a combination of the columns).","dotVisualHudProj":"Column space","dotVisualLegendU":"Input / box","dotVisualLegendV":"$$A$","dotVisualLegendProj":"Reachable","dotVisualLegendRes":"$$\\mathbf{x} \\mapsto A\\mathbf{x}$","dotVisualInsetLabel":"t","mapVisualDomainCaption":"Input","mapVisualCodomainCaption":"All possible outputs","mapVisualRangeCaption":"Actually reachable","mapVisualRnLabel":"ℝ²","mapVisualRnSubLabel":"Real plane · input","mapVisualRnSvgTitle":"ℝ²: the 2D real space for inputs (pairs of coordinates). Not the regression R-squared.","mapVisualRnA11y":"ℝ²: the 2D real space for inputs (pairs of coordinates). Not the regression R-squared.","mapVisualRmLabel":"ℝ²","mapVisualRmSubLabel":"Real plane · output","mapVisualRmSvgTitle":"ℝ²: the 2D real space where we plot the transformed point; same dimension as in this figure.","mapVisualRmA11y":"ℝ²: the 2D real space where we plot the transformed point; same dimension as in this figure.","mapVisualLabelX":"x","mapVisualLabelTx":"Ax","mapVisualLabelMap":"A","mapVisualMapHint":"**Orange “A” badge:** the **matrix $A$** (linear map) for this figure. It multiplies the left-hand coordinates $\\mathbf{x}$ to produce the point $A\\mathbf{x}$ on the right.","mapVisualCol1Tag":"a₁","mapVisualCol2Tag":"a₂","mapVisualGhostHint":"Gray dashed: square if A = I","dotVisualCaption":"**$A\\mathbf{x}$** is the rule “mix the entries of $\\mathbf{x}$ using $A$.” On the left, $\\mathbf{x}$ moves on the **input patch**; on the right, the answer stays on the **blue patch**. **$AB$** means **chain** the maps: $B$ first, then $A$.","whatIs":{"0":"**1. Linear transformation (Linear Transformation): the “Free Transform” tool in an image editor**\n\n**Concept:** Imagine an image drawn on a transparent grid, opened in Photoshop. Dragging corners to stretch diagonally, rotating 45°, or shearing—all of that is a **linear transformation** in geometry.\n\n**Strict rules:** This tool has two rules that must never break. First, **the origin $(0,0)$ at the center of the image stays fixed** after the transform. Second, **lines that were straight stay straight** (no bending), and lines that were parallel stay parallel.","1":"**2. Matrix × vector ($A\\mathbf{x}$): applying a filter to the raw image**\n\n**Concept:** Here **$\\mathbf{x}$** is the “original data (position of a point)” with no effect applied yet, and **$A$** is a **smart filter (transform rule)** that shears at specific angles and scales. Applying the filter is written $A\\mathbf{x}$ (matrix $A$ acts on $\\mathbf{x}$).\n\n**In deep learning:** One neural network layer uses this to build **$\\mathbf{y} = W\\mathbf{x} + \\mathbf{b}$**.\n* $W$ (weight matrix): shears data into angles/scales that are easier for the model to analyze (linear map).\n* $\\mathbf{b}$ (bias vector): drags the sheared result sideways like moving a layer in an editor (translation).\nThe result $\\mathbf{y}$ after “deform + shift” is passed to the next layer.","2":"**3. Matrix × matrix ($AB$): stacking filters in order**\n\n**Concept:** Multiplying $A$ and $B$ means **applying two editing filters one after another**. In $AB$, expressions flow **right to left**, so you **apply $B$ first, then overlay $A$** on the result.\n\n**Key fact ($AB \\neq BA$):** “Stretch horizontally by 2, then rotate 90°” gives a **tall** image; “rotate 90°, then stretch horizontally by 2” gives a **wide** image. Order changes the outcome, so **$AB \\neq BA$ (multiplication is not commutative)**.","3":"**4. Matching dimensions: plugging compatible cables**\n\n**Concept:** When stacking filters, connectors must match: the **number of columns** of the left matrix must equal the **number of rows** of the right matrix.\n\n**Key formula:** $(m\\times n)$ times $(n\\times p)$ absorbs the touching $n$ and outputs **$(m\\times p)$**. In code, **transpose** flips a table so batch data $X$ and weights $W$ line up cleanly as **$Y = XW^{\\mathsf{T}}$**.","4":"$23"},"whyImportant":{"bridge":"**The magic of parallelism: millions of pixels in one shot**\n\nA single high-res photo can have millions of pixels. Looping with a for-loop over each pixel would choke the CPU and make training impractical. Matrix multiplication packs those numbers into **one huge table (matrix)** and encodes the transform as **another matrix**, so the “apply a filter” picture becomes **one multiply**.\n\nGPUs are built so **thousands of cores** share this work. **Batch GEMM** in TensorFlow/PyTorch stacks many samples as rows of $X$, multiplies by $W$ once, and pushes the whole mini-batch through **$Y = XW^{\\mathsf{T}}$**. Deep learning digests huge data quickly because **matrices** are a common format hardware can parallelize.","similarity":"**One shared language across AI**\n\nWhether Netflix recommendations, Tesla lane detection, or ChatGPT, the bottom layer keeps running **$Y = XW^{\\mathsf{T}}$**. Fully connected layers, embeddings, attention scores—different names, same **matrix × matrix** pattern.\n\nWith this mindset, **shape mismatches** are easier to debug: mismatched inner sizes are like **mismatched cable specs**. Once this “shared language” clicks, papers, code, and logs in different domains read off the **same map**."},"howUsed":{"ml":"$24","geometry":"$25"},"summary":"**Practitioner’s recap:** Matrix multiplication treats data not as a flat list of numbers but as **dynamic spatial transforms** via **$\\mathbf{y} = W\\mathbf{x} + \\mathbf{b}$**. When stacking layers, **matching shapes ($(m \\times n) \\times (n \\times p)$)** comes first; never forget that **order ($AB \\neq BA$)** completely changes the result.","problemSolving":{"focus":"The table lists **shape rules** and identities. Examples sketch typical steps.","examplesHeading":"Worked examples","examplesTable":"**Example 1 — shape**\n\nQ: $A$ is $4\\times 7$, $B$ is $7\\times 3$. $AB$?\n\nA: $4\\times 3$.\n\n---\n\n**Example 2 — order**\n\nQ: Matrix for “$B$ then $A$”?\n\nA: $AB$.\n\n---\n\n**Example 3 — transpose**\n\nQ: $(AB)^{\\mathsf T}$?\n\nA: $B^{\\mathsf T}A^{\\mathsf T}$.\n\n---\n\n**Example 4 — column**\n\nQ: $A\\mathbf{e}_2$?\n\nA: Second **column** of $A$.\n\n---\n\n**Example 5 — batch**\n\nQ: One-shot linear layer with rows as samples?\n\nA: Often $XW^{\\mathsf T}$."},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"| Symbol | Meaning |\n| :--- | :--- |\n| $AB$ | Defined when cols of $A$ = rows of $B$ |\n| $(AB)_{ij}$ | Dot of row $i$ of $A$ and col $j$ of $B$ |\n| $A\\mathbf{x}$ | Vector of row–$\\mathbf{x}$ dots |\n| $(AB)^{\\mathsf T}$ | $B^{\\mathsf T}A^{\\mathsf T}$ |\n| Composition | $\\mathbf{x}\\mapsto A(B\\mathbf{x})=(AB)\\mathbf{x}$ |\n| FC layer | $\\mathbf{y}=W\\mathbf{x}+\\mathbf{b}$ |\n\n**① Shapes** Inner dimensions must match.\n\n**② Batch** Same $W$ on each row → GEMM.","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"","practiceProblemsInstruction":"Read the question and choose the best option.","problems":{"definition_0":"For $A\\in\\mathbb{R}^{m\\times n}$ and $B\\in\\mathbb{R}^{p\\times q}$, a necessary condition for $AB$ to be **defined** is?\n\n① $m=p$\n② $n=p$\n③ $m=q$\n④ $n=m$","definition_1":"Which defines $(AB)_{ij}$? ($i$-th **row** of $A$, $j$-th **column** of $B$)\n\n① $a_{ij}b_{ij}$\n② **Dot product** of row $i$ of $A$ and column $j$ of $B$\n③ $a_{ij}+b_{ij}$\n④ $a_{ji}b_{ji}$","definition_2":"If $A\\in\\mathbb{R}^{m\\times n}$ and $B\\in\\mathbb{R}^{n\\times p}$, the shape of $AB$ is?\n\n① $m\\times p$\n② $n\\times n$\n③ $m\\times n$\n④ $p\\times m$","definition_3":"For $A\\in\\mathbb{R}^{m\\times n}$, to have $AI_n=A$, the identity $I_n$ must be?\n\n① $n\\times n$\n② $m\\times m$\n③ $m\\times n$\n④ $n\\times m$","definition_4":"For $A\\in\\mathbb{R}^{m\\times n}$, to have $I_m A=A$, the identity $I_m$ must be?\n\n① $m\\times m$\n② $n\\times n$\n③ $m\\times n$\n④ $n\\times m$","definition_5":"Transpose rule for $(AB)^{\\mathsf T}$:\n\n① $A^{\\mathsf T}B^{\\mathsf T}$\n② $B^{\\mathsf T}A^{\\mathsf T}$\n③ $(A^{\\mathsf T})^{\\mathsf T}B$\n④ $AB^{\\mathsf T}$","definition_6":"For $A\\in\\mathbb{R}^{m\\times n}$ and $\\mathbf{u}\\in\\mathbb{R}^n$, the vector $A\\mathbf{u}$ lies in?\n\n① $\\mathbb{R}^m$\n② $\\mathbb{R}^n$\n③ $\\mathbb{R}^{mn}$\n④ $\\mathbb{R}^{m+n}$","definition_7":"For a linear map $T(\\mathbf{x})=A\\mathbf{x}$, which always holds?\n\n① $T(\\mathbf{0})=\\mathbf{0}$\n② $T(\\mathbf{x})=\\mathbf{x}$\n③ $\\|T(\\mathbf{x})\\|=\\|\\mathbf{x}\\|$\n④ $T(\\mathbf{x}+\\mathbf{y})=T(\\mathbf{x})T(\\mathbf{y})$","definition_8":"Which inequality always holds for $\\mathrm{rank}(AB)$?\n\n① $\\mathrm{rank}(AB)\\ge \\mathrm{rank}(A)$\n② $\\mathrm{rank}(AB)\\le \\min(\\mathrm{rank}(A),\\mathrm{rank}(B))$\n③ $\\mathrm{rank}(AB)=\\mathrm{rank}(A)+\\mathrm{rank}(B)$\n④ $\\mathrm{rank}(AB)=mn$","definition_9":"Column-vector convention: matrix for “apply $B$, then $A$” is?\n\n① $AB$\n② $BA$\n③ $A+B$\n④ $A^{\\mathsf T}B^{\\mathsf T}$","trueFalse_0":"If **true** choose ①, if **false** choose ②.\n\nFor all square $A,B$, $AB=BA$.\n\n① True\n② False","trueFalse_1":"Whenever defined, $(AB)C=A(BC)$.\n\n① True\n② False","trueFalse_2":"Whenever defined, $A(B+C)=AB+AC$.\n\n① True\n② False","trueFalse_3":"If $AB=O$, must $A=O$ or $B=O$?\n\n① True\n② False","trueFalse_4":"Always $(A+B)^2=A^2+2AB+B^2$ for square matrices?\n\n① True\n② False","trueFalse_5":"For same-size square $A,B$, $\\det(AB)=\\det(A)\\det(B)$?\n\n① True\n② False","trueFalse_6":"Linear $T(\\mathbf{x})=A\\mathbf{x}$ always has $T(\\mathbf{0})=\\mathbf{0}$?\n\n① True\n② False","trueFalse_7":"Orthogonal $Q$ satisfies $Q^{\\mathsf T}Q=I$?\n\n① True\n② False","trueFalse_8":"Scaling by $c$ is the matrix $cI$?\n\n① True\n② False","trueFalse_9":"If each **row** of batch $X$ is a sample, row-wise $\\mathbf{y}^{\\mathsf T}=\\mathbf{x}^{\\mathsf T}W^{\\mathsf T}$ equals multiplying each sample by the same $W^{\\mathsf T}$?\n\n① True\n② False","calc_0":"Let $A=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, $B=\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$. What is $(AB)_{11}$?\n\n① $2$\n② $1$\n③ $3$\n④ $0$","calc_1":"Let $A=\\begin{pmatrix}1&0\\\\0&2\\end{pmatrix}$, $\\mathbf{x}=\\begin{pmatrix}3\\\\4\\end{pmatrix}$. First component of $A\\mathbf{x}$?\n\n① $3$\n② $4$\n③ $7$\n④ $12$","calc_2":"Let $R=\\begin{pmatrix}0&-1\\\\1&0\\end{pmatrix}$ (CCW $90^\\circ$). Then $R\\begin{pmatrix}1\\\\0\\end{pmatrix}$ is?\n\n① $\\begin{pmatrix}0\\\\1\\end{pmatrix}$\n② $\\begin{pmatrix}1\\\\0\\end{pmatrix}$\n③ $\\begin{pmatrix}-1\\\\0\\end{pmatrix}$\n④ $\\begin{pmatrix}0\\\\-1\\end{pmatrix}$","calc_3":"The $(2,2)$ entry of $\\begin{pmatrix}2&1\\\\0&3\\end{pmatrix}\\begin{pmatrix}1&1\\\\0&1\\end{pmatrix}$ is?\n\n① $3$\n② $4$\n③ $6$\n④ $0$","calc_4":"Value of $\\begin{pmatrix}1&2&3\\end{pmatrix}\\begin{pmatrix}4\\\\5\\\\6\\end{pmatrix}$?\n\n① $32$\n② $21$\n③ $18$\n④ $720$","calc_5":"Let $A=\\begin{pmatrix}1&1\\\\0&1\\end{pmatrix}^2$. The $(1,2)$ entry is?\n\n① $2$\n② $1$\n③ $0$\n④ $3$","calc_6":"The $(2,1)$ entry of $\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}\\begin{pmatrix}1&0\\\\0&0\\end{pmatrix}$ is?\n\n① $3$\n② $1$\n③ $0$\n④ $4$","calc_7":"For $B=\\begin{pmatrix}1&2\\\\3&4\\end{pmatrix}$, $B\\mathbf{e}_1$ is?\n\n① First **column** of $B$\n② First **row** of $B$\n③ Zero\n④ $(1,0)^{\\mathsf T}$","calc_8":"Let $A=\\begin{pmatrix}1&0\\\\0&0\\end{pmatrix}$, $B=\\begin{pmatrix}0&0\\\\0&1\\end{pmatrix}$. Then $AB$ is?\n\n① Zero matrix\n② $I_2$\n③ $\\begin{pmatrix}1&0\\\\0&1\\end{pmatrix}$\n④ $\\begin{pmatrix}0&1\\\\1&0\\end{pmatrix}$","calc_9":"Multiply $\\begin{pmatrix}3\\end{pmatrix}\\begin{pmatrix}2\\end{pmatrix}$ (both $1\\times 1$).\n\n① $\\begin{pmatrix}6\\end{pmatrix}$\n② $5$\n③ Undefined\n④ $13$","concept_0":"In $\\mathbf{y}=W\\mathbf{x}+\\mathbf{b}$, $W$ mainly?\n\n① **Linearly mixes** input features\n② Always outputs probabilities\n③ Always rotates images\n④ Minimizes loss directly","concept_1":"How many entries in $W\\in\\mathbb{R}^{d_{out}\\times d_{in}}$?\n\n① $d_{out}\\times d_{in}$\n② $B\\times d_{in}$\n③ $d_{in}+d_{out}$\n④ $B\\times d_{out}$","concept_2":"Coordinate $i$ of $A\\mathbf{u}$ matches Ch.02?\n\n① Dot of row $i$ of $A$ with $\\mathbf{u}$\n② Outer product only\n③ Norm of $\\mathbf{u}$\n④ Determinant","concept_3":"Deep linear-only nets are?\n\n① Compositions of matrix products (and biases)\n② Always adding the same matrix\n③ Determinants only\n④ Transposes only","concept_4":"One-shot batch formula for $X\\in\\mathbb{R}^{B\\times d_{in}}$, $W\\in\\mathbb{R}^{d_{out}\\times d_{in}}$?\n\n① $XW^{\\mathsf T}$\n② $WX$ (always defined)\n③ $X+X$\n④ Only $W^{\\mathsf T}X^{\\mathsf T}$","concept_5":"Before activation $\\sigma$, one layer is?\n\n① Linear map (matrix)\n② Always nonlinear only\n③ Always softmax\n④ The loss","concept_6":"First check for linear layer on sample-as-row $X$?\n\n① Match shapes of $X$ and $W$\n② Never transpose\n③ Collapse to scalars\n④ Batch size 1","concept_7":"Why is multiplication usually non-commutative?\n\n① Order of maps can change the result\n② Matrices are always symmetric\n③ No dot products\n④ No inverses","concept_8":"In $\\hat{\\mathbf{y}}=X\\boldsymbol{\\beta}$, $X\\boldsymbol{\\beta}$ is?\n\n① Linear combo of columns of $X$\n② Always a norm\n③ Determinant\n④ Only eigendecomposition","concept_9":"Outputs $\\{A\\mathbf{x}\\}$ form the?\n\n① Column space\n② Unit sphere\n③ One scalar\n④ Always full space","projection_0":"For standard basis $\\mathbf{e}_j$, $A\\mathbf{e}_j$ is?\n\n① Column $j$ of $A$\n② Row $j$ of $A$\n③ Always zero\n④ Diagonal only","projection_1":"If $A\\mathbf{x}=\\mathbf{0}$ for all $\\mathbf{x}$, $\\mathrm{rank}(A)$ is?\n\n① $0$\n② Always $n$\n③ Always $m$\n④ Always $\\min(m,n)$","projection_2":"$$\\{A\\mathbf{x}: \\mathbf{x}\\in\\mathbb{R}^n\\}$ is the?\n\n① **Column space** of $A$\n② Always $\\mathbb{R}^m$\n③ Always $\\{\\mathbf{0}\\}$\n④ Always row space","projection_3":"$$A(B\\mathbf{x})=(AB)\\mathbf{x}$ means?\n\n① Composition ↔ matrix multiply\n② Always $AB=BA$\n③ Commutative\n④ No dots","projection_4":"If $P^2=P$, $P\\mathbf{x}$ intuitively?\n\n① Projects onto a subspace\n② Rotation only\n③ Always invertible globally\n④ Scalar only","projection_5":"For $A\\in\\mathbb{R}^{m\\times n}$, $m0$ for a $2\\times 2$ real $A$, orientation of the linear map is:\n\n① preserved (no reflection)\n② always symmetric\n③ only rotation\n④ always diagonalizable","concept_9":"In $\\mathbb{R}^3$, the volume of the parallelepiped spanned by the three columns of $A$ is:\n\n① $\\lvert\\det([\\mathbf{a}_1\\ \\mathbf{a}_2\\ \\mathbf{a}_3])\\rvert$\n② sum of norms\n③ $\\mathrm{tr}(A)$\n④ always $1$","projection_0":"Laplace (cofactor) expansion of $\\det(A)$ along a row/column is:\n\n① a standard valid method\n② only defined for $3\\times 3$\n③ only for symmetric $A$\n④ always yields $0$ after transpose","projection_1":"The adjugate matrix satisfies:\n\n① $A\\,\\mathrm{adj}(A)=\\det(A)\\,I$\n② $A\\,\\mathrm{adj}(A)=I$\n③ $\\mathrm{adj}(A)=A^{-1}$\n④ $\\det(\\mathrm{adj}(A))=0$","projection_2":"For invertible $A$, $\\det(A^{-1}BA)$ equals:\n\n① $\\det(B)$\n② $\\det(A)$\n③ $\\det(A^{-1})$\n④ $\\det(A)+\\det(B)$","projection_3":"If $\\lambda$ is an eigenvalue of $A$, then necessarily:\n\n① $\\det(A-\\lambda I)=0$\n② $\\det(A-\\lambda I)=1$\n③ $\\det(A)=\\lambda$\n④ $A=\\lambda I$","projection_4":"Using Cramer’s rule for all three coordinates of a $3\\times 3$ system typically requires how many determinants?\n\n① $4$\n② $1$\n③ $9$\n④ only $3$","projection_5":"For block-diagonal $\\begin{pmatrix}A&0\\\\0&D\\end{pmatrix}$ with square blocks, $\\det$ equals:\n\n① $\\det(A)\\det(D)$\n② $\\det(A)+\\det(D)$\n③ $\\det(AD)$\n④ $0$","projection_6":"Swapping two rows of a matrix:\n\n① flips the sign of $\\det$\n② leaves $\\det$ unchanged\n③ forces $\\det=0$\n④ doubles $\\det$","projection_7":"Adding a multiple of one row to another row:\n\n① leaves $\\det$ unchanged\n② only flips sign\n③ forces $\\det=0$\n④ doubles $\\det$","projection_8":"If $AB$ is invertible for $n\\times n$ real $A,B$, then:\n\n① both $A$ and $B$ are invertible\n② only $A$ must be invertible\n③ only $B$ must be invertible\n④ one of them must be zero","projection_9":"In $\\mathbb{R}^n$, an invertible linear map $A$ scales any region volume $V$ by:\n\n① $\\lvert\\det(A)\\rvert\\cdot V$\n② $V/\\lvert\\det(A)\\rvert$\n③ always $V$\n④ $\\mathrm{tr}(A)\\cdot V$","scenario_0":"PyTorch torch.linalg.det(A) keeps batch dimensions and returns determinants of the last two axes. This means:\n\n① many small matrices can be processed at once\n② always a single scalar\n③ it also returns inverses\n④ only defined on GPU","scenario_1":"Why is torch.linalg.solve(A, b) usually preferred over forming inv(A) and multiplying by b?\n\n① more stable and often faster direct solve\n② determinants cannot be computed\n③ inverses never exist\n④ b must not be a vector","scenario_2":"If the Hessian (or GN approximation) is nearly singular during training, a common symptom is:\n\n① exploding/unstable step directions\n② immediate guaranteed convergence\n③ loss becomes exactly $0$\n④ gradients vanish entirely","scenario_3":"A main reason ridge regression uses $X^{\\mathsf T}X+\\lambda I$ with $\\lambda>0$ is:\n\n① to make the matrix better conditioned / invertible\n② to force determinant $0$\n③ to forbid inverses\n④ to shrink batch size","scenario_4":"The $\\det(\\Sigma)^{-1/2}$ factor in a multivariate normal density is directly tied to:\n\n① volume scaling under a linear change (Jacobian idea)\n② softmax temperature\n③ ReLU slope\n④ dropout rate","scenario_5":"For overdetermined $A\\mathbf{x}=\\mathbf{b}$, the Moore–Penrose pinv is closest to:\n\n① providing a meaningful least-norm least-squares solution when not invertible\n② forcing $\\det(A)=1$\n③ always returning an exact solution\n④ computing softmax","scenario_6":"Many nearly singular Hessian directions on a loss surface often indicate:\n\n① flat valleys / ambiguous curvature\n② a unique global minimum only\n③ gradients are always zero\n④ learning rate is meaningless","scenario_7":"If det(A) is extremely close to $0$, what can we safely conclude?\n\n① inversion may be numerically unstable\n② training is impossible\n③ parameters are optimal\n④ softmax diverges","scenario_8":"If $A=Q\\Lambda Q^{-1}$ is diagonalizable, then $\\det(A)$ equals:\n\n① product of eigenvalues\n② sum of eigenvalues\n③ $\\mathrm{tr}(Q)$\n④ always $0$","scenario_9":"When a mini-batch covariance $S$ is nearly singular, a common way to stabilize $\\log\\det S$ in a log-likelihood is:\n\n① Cholesky / small $\\varepsilon I$ regularization\n② force determinant $0$\n③ replace $S$ by zeros\n④ apply softmax to $S$"},"problemAnswers":{"definition_0":2,"definition_1":1,"definition_2":2,"definition_3":3,"definition_4":1,"definition_5":1,"definition_6":2,"definition_7":1,"definition_8":1,"definition_9":2,"trueFalse_0":2,"trueFalse_1":1,"trueFalse_2":2,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":1,"trueFalse_6":2,"trueFalse_7":1,"trueFalse_8":2,"trueFalse_9":1,"calc_0":1,"calc_1":1,"calc_2":3,"calc_3":1,"calc_4":1,"calc_5":1,"calc_6":1,"calc_7":1,"calc_8":1,"calc_9":1,"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":2,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"projection_0":1,"projection_1":1,"projection_2":1,"projection_3":1,"projection_4":1,"projection_5":1,"projection_6":1,"projection_7":1,"projection_8":1,"projection_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1},"problemSolutions":{"definition_0":"**1) Definition:** $\\det(A)=ad-bc$. **2) Example:** $\\begin{pmatrix}2&1\\\\0&3\\end{pmatrix}$ gives $6$. **3) Answer ②**","definition_1":"**1) Fact:** Square $A$ is invertible iff $\\det(A)\\neq0$. **2) Example:** $\\det\\begin{pmatrix}1&1\\\\0&1\\end{pmatrix}=1$. **3) Answer ①**","definition_2":"**1) Rule:** Undo in reverse order: $(AB)^{-1}=B^{-1}A^{-1}$. **3) Answer ②**","definition_3":"**1) Geometry:** Volume scale is $1$. **3) Answer ③**","definition_4":"**1) Property:** Transpose does not change $\\det$. **3) Answer ①**","definition_5":"**1) Rule:** Invert diagonal entries. **3) Answer ①**","definition_6":"**1) Rule:** Factor $2$ from each row: $2\\cdot2=4$. **3) Answer ②**","definition_7":"**1) Geometry:** Parallelogram area scales by $\\lvert\\det(A)\\rvert$. **3) Answer ①**","definition_8":"**1) Link:** Invertible $\\Rightarrow$ full rank. **3) Answer ①**","definition_9":"**1) Theorem:** $\\det(AB)=\\det(A)\\det(B)$. **3) Answer ②**","trueFalse_0":"**1) Terminology:** Singular means $\\det(A)=0$. **3) Answer ②**","trueFalse_1":"**1) Theorem:** Product rule. **3) Answer ①**","trueFalse_2":"**1) Counterexample:** Zero matrix. **3) Answer ②**","trueFalse_3":"**1) Derive:** $\\det(A)\\det(A^{-1})=1$. **3) Answer ①**","trueFalse_4":"**1) Example:** Two axis projections summing to $I_2$. **3) Answer ①**","trueFalse_5":"**1) Theorem:** $\\det(Q)^2=1$. **3) Answer ①**","trueFalse_6":"**1) Counterexample:** $A=B=I$. **3) Answer ②**","trueFalse_7":"**1) Theorem:** Triangular case. **3) Answer ①**","trueFalse_8":"**1) Link:** $\\det=0$ implies dependence. **3) Answer ②**","trueFalse_9":"**1) Derive:** $\\det(AA)=\\det(A)^2$. **3) Answer ①**","calc_0":"**1) Compute:** $4-6=-2$. **3) Answer ①**","calc_1":"**1) Compute:** $2\\cdot3=6$. **3) Answer ①**","calc_2":"**1) Compute:** duplicate columns $\\Rightarrow$ $0$. **3) Answer ③**","calc_3":"**1) Compute:** $A^{-1}=\\mathrm{diag}(1,1/2)$. **3) Answer ①**","calc_4":"**1) Compute:** $3-2=1$. **3) Answer ①**","calc_5":"**1) Compute:** $0-(-1)=1$. **3) Answer ①**","calc_6":"**1) Compute:** $A^{-1}=\\frac12I$, trace $1$. **3) Answer ①**","calc_7":"**1) Compute:** rows proportional $\\Rightarrow$ $0$. **3) Answer ①**","calc_8":"**1) Compute:** inverse is $\\begin{pmatrix}1&-1\\\\0&1\\end{pmatrix}$. **3) Answer ①**","calc_9":"**1) Compute:** $\\cos^2t+\\sin^2t=1$. **3) Answer ①**","concept_0":"**1) Link:** invertible Hessian $\\Rightarrow$ solvable Newton system. **3) Answer ①**","concept_1":"**1) Practice:** direct solves are usually better conditioned. **3) Answer ①**","concept_2":"**1) Order:** undo $A$ then $B$ $\\Rightarrow$ $B^{-1}A^{-1}$. **3) Answer ②**","concept_3":"**1) Intuition:** rank drop / collapse. **3) Answer ②**","concept_4":"**1) Link:** invertible Gram matrix. **3) Answer ①**","concept_5":"**1) Numerics:** ill-conditioning. **3) Answer ①**","concept_6":"**1) SVD view:** product of singular values. **3) Answer ①**","concept_7":"**1) Formula:** denominator $\\det(A)$. **3) Answer ①**","concept_8":"**1) Sign:** positive determinant preserves orientation. **3) Answer ①**","concept_9":"**1) Geometry:** absolute determinant is volume. **3) Answer ①**","projection_0":"**1) Theorem:** cofactor expansion. **3) Answer ①**","projection_1":"**1) Definition:** adjugate identity. **3) Answer ①**","projection_2":"**1) Compute:** similarity preserves determinant. **3) Answer ①**","projection_3":"**1) Characteristic polynomial:** eigenvalues are roots. **3) Answer ①**","projection_4":"**1) Count:** $\\det(A)$ plus three modified determinants. **3) Answer ①**","projection_5":"**1) Theorem:** block diagonal determinant. **3) Answer ①**","projection_6":"**1) Property:** row swap $\\Rightarrow$ sign flip. **3) Answer ①**","projection_7":"**1) Property:** row addition invariant. **3) Answer ①**","projection_8":"**1) Theorem:** $\\det(AB)\\neq0$ implies both nonsingular. **3) Answer ①**","projection_9":"**1) Geometry:** Jacobian magnitude. **3) Answer ①**","scenario_0":"**1) Practice:** batched determinants. **3) Answer ①**","scenario_1":"**1) Practice:** avoid explicit inverse. **3) Answer ①**","scenario_2":"**1) Link:** ill-conditioned quadratic model. **3) Answer ①**","scenario_3":"**1) Stats/ML:** regularization stabilizes inversion. **3) Answer ①**","scenario_4":"**1) Link:** normalization constant from covariance volume. **3) Answer ①**","scenario_5":"**1) Practice:** Moore–Penrose pseudoinverse. **3) Answer ①**","scenario_6":"**1) Optimization:** poor conditioning directions. **3) Answer ①**","scenario_7":"**1) Caution:** floating-point conditioning. **3) Answer ①**","scenario_8":"**1) Theorem:** determinant is product of eigenvalues. **3) Answer ①**","scenario_9":"**1) Practice:** PSD stabilization tricks. **3) Answer ①**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 3\nassert answer == 3","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 1\nassert answer == 1","definition_6":"answer = 2\nassert answer == 2","definition_7":"answer = 1\nassert answer == 1","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 2\nassert answer == 2","trueFalse_1":"answer = 1\nassert answer == 1","trueFalse_2":"answer = 2\nassert answer == 2","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 1\nassert answer == 1","trueFalse_6":"answer = 2\nassert answer == 2","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 2\nassert answer == 2","trueFalse_9":"answer = 1\nassert answer == 1","calc_0":"answer = 1\nassert answer == 1","calc_1":"answer = 1\nassert answer == 1","calc_2":"answer = 3\nassert answer == 3","calc_3":"answer = 1\nassert answer == 1","calc_4":"answer = 1\nassert answer == 1","calc_5":"answer = 1\nassert answer == 1","calc_6":"answer = 1\nassert answer == 1","calc_7":"answer = 1\nassert answer == 1","calc_8":"answer = 1\nassert answer == 1","calc_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","projection_0":"answer = 1\nassert answer == 1","projection_1":"answer = 1\nassert answer == 1","projection_2":"answer = 1\nassert answer == 1","projection_3":"answer = 1\nassert answer == 1","projection_4":"answer = 1\nassert answer == 1","projection_5":"answer = 1\nassert answer == 1","projection_6":"answer = 1\nassert answer == 1","projection_7":"answer = 1\nassert answer == 1","projection_8":"answer = 1\nassert answer == 1","projection_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1"}},"midMathCh06":{"chapter":"Chapter 06","title":"Linear Independence and Rank: How Many Real Dimensions?","description":"Imagine a startup with 100 employees on paper. In practice, 20 people drive new ideas while 80 mostly copy the same approvals with different names. Is the “real workload dimension” 100 or 20?\n\nLast chapter: matrices reshape space. Here we learn to spot **fake vs real** arrows in data: **linear independence** (a direction nobody else can replace) vs **dependence** (a free rider that is just a combination). After stripping redundant shadows, **rank** counts the **true backbone** of information—without being fooled by raw column counts.","sectionTitle":"Linear Independence and Rank: How Many Real Dimensions?","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving notes"},"visualShort":"Same line vs new direction · rank 1↔2","visualIntro":"The **dashed line** is the first direction. The **orange** vector moves **on** that line, then **off**—watch **rank** flip between **1** and **2**.","visualStep1":"Definition: $\\sum c_i\\mathbf{v}_i=\\mathbf{0}$ forces all $c_i=0$ ⇔ linear independence","visualStep2":"Intuition: **collinear** ≈ dependence; **break away** → independence & rank","visualStep3":"Math: $\\mathrm{rank}(A)$ = column-space dimension = pivot count","visualStep4":"Uses: multicollinearity, ridge, layer bottlenecks","visualStepsLabel":"Reading order","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: independence, dependence, basis, rank","visualFlowStep1":"Intuition: collinearity vs independence, rank","visualFlowStep2":"Geometry ↔ algebra","visualFlowStep3":"Link: Ch.05 invertibility and $\\det$","visualFlowStep4":"Applications: regression & deep nets","rankVisualAriaLabel":"A dashed line for the first direction and two vectors from the origin; the corner badge fades between rank 1 and rank 2 as the orange vector moves.","rankVisualMainTitle":"Linear Independence and Rank: How Many Real Dimensions?","rankVisualSubtitle":"**Independent** means the directions **don’t overlap**. **Rank** counts **non-redundant** directions (here **1** or **2** in this demo).","rankVisualCaption":"When the **orange** arrow lies **on the dashed line** (the span of the first direction), it does not add a new axis—**linear dependence**; in this figure the demo reads **rank 1**.\n\nWhen orange **leaves the line**, the two directions differ → **linear independence**, and the demo reads **rank 2**.","whatIs":{"0":"**1. Linear independence — “RGB primaries”**\n\nMixing paint or light, **red, green, blue** are **fundamental**: you cannot synthesize one from the others alone. Vectors are **linearly independent** when no vector is a combination of the rest: $c_1\\mathbf{v}_1+\\cdots+c_k\\mathbf{v}_k=\\mathbf{0}$ forces **every** $c_i=0$. Each new independent vector opens a **new axis** of information.","1":"**2. Linear dependence — echoes and “free riders”**\n\nIf red and green lights already exist, adding a “**yellow**” lamp (red + green) does **not** widen the color gamut: it is **redundant**. If $\\mathbf{v}_3=2\\mathbf{v}_1+3\\mathbf{v}_2$, the third vector is a **linear combination** of the others — **dependence**. It may look like more data, but it is an **echo**, not new signal.","2":"**3. Rank — “information purity” after defoaming**\n\n$\\mathrm{rank}(A)$ is the **maximum number of independent columns**, no matter whether you have 100 or 1000 columns. If 100 arrows all lie in one **plane**, rank is still **2**: rank is the **true effective dimension** of the data.","3":"**4. Basis — minimal steel frame**\n\nA **basis** is a **smallest independent set** that still **spans** the whole subspace—like the steel frame that fixes a building’s shape even if many bricks fill the walls. The number of basis vectors is the **dimension**.","4":"**5. Link to Ch.05 — what $\\det(A)$ means, and rank**\n\nThe **determinant** $\\det(A)$ is **one number** that tells how an $n\\times n$ linear map **scales unit $n$-dimensional volume** (in 2D: **area** of the unit square). If $\\det(A)=0$, the map **squashes** space so volume collapses and **no inverse** exists; if $\\det(A)\\neq 0$, you can **undo** the map with **$A^{-1}$** (Ch.05).\n\nIf $\\mathrm{rank}(A)=n$, columns are independent and space is **not** fully flattened, so **$\\det(A)\\neq 0$** and **$A^{-1}$** exists. Lower rank lets the space collapse, so **$\\det(A)=0$** and inversion fails."},"whyImportant":{"bridge":"Five witnesses sound great—unless they all watched from the **same window** (dependence): you hear **one** clue five times (rank 1). Three witnesses from a street, a rooftop, and CCTV (independence, rank 3) carry **far more** real information.\n\nIn ML, feeding both “area in m²” and “area in pyeong” points the **same direction**: **multicollinearity**. The model may not “notice” the duplication and can return unstable or nonsense weights.","similarity":"**Rank** asks: *how many nutrient-dense directions are really here?* Stripping redundant mixtures is core prep for stable training and faster, clearer computation."},"howUsed":{"ml":"**1. Saving linear regression (ridge)**\nLeast squares needs $(X^{\\mathsf T}X)^{-1}$. Nearly duplicate columns make $X^{\\mathsf T}X$ singular. **Ridge** adds a tiny diagonal “shim”—like slipping a toothpick into a crushed sandwich—to restore **numerical volume** so an inverse can be computed.","geometry":"**2. Bottlenecks in deep nets**\nThink of a 100-lane highway through linear layers. If a layer has **effective rank 10**, the road suddenly narrows: **information bottleneck**—most lanes of detail are destroyed. Designers watch rank-like behavior when choosing widths."},"summary":"**One line:** independence = **irreplaceable** directions; dependence = **mixtures**; rank = **true dimension** after removing foam.","problemSolving":{"focus":"The table lists **symbols and tips**. **Worked patterns** walk through representative practice types—**definitions**, **true/false**, **numeric rank**, **rank–nullity**, **rank identities**, **short scenarios**—in a compact **question / solution** format.","examplesHeading":"Worked patterns","examplesTable":"$27"},"problemSolvingLabel":"Problem-solving notes","problemSolvingTable":"| Symbol | Meaning |\n| :--- | :--- |\n| Independent | Only trivial combination gives zero |\n| Dependent | One column is a combination of others |\n| $\\mathrm{rank}(A)$ | Dimension of column space |\n| Basis | Independent spanning set |\n| $\\mathrm{rank}(AB)$ | $\\le\\min\\{\\mathrm{rank}A,\\mathrm{rank}B\\}$ |\n| $\\det(A)$ | Volume/area scaling factor for the map (Ch.05); $\\det(A)=0$ ⇒ no inverse |","practiceProblemsTitle":"Practice","practiceProblemsIntro":"**10 random questions** are drawn from a bank of 60.","practiceProblemsInstruction":"Read each question carefully and choose the best answer.","problems":{"definition_0":"Which best characterizes linear independence of $\\mathbf{v}_1,\\mathbf{v}_2$?\n\n① Equal norms always\n② $c_1\\mathbf{v}_1+c_2\\mathbf{v}_2=\\mathbf{0}\\Rightarrow c_1=c_2=0$\n③ Dot product is 0\n④ Both are unit vectors","definition_1":"Which defines $\\mathrm{rank}(A)$?\n\n① Number of rows\n② Dimension of column space\n③ Sum of entries\n④ Trace","definition_2":"The size of a basis of a subspace is?\n\n① Can vary for the same subspace\n② Fixed for the same subspace\n③ Always equals number of rows\n④ Always 1","definition_3":"Max number of linearly independent vectors in $\\mathbb{R}^3$?\n\n① 2\n② 3\n③ 4\n④ Infinitely many","definition_4":"If columns of $A$ are linearly dependent then?\n\n① Rank equals number of columns\n② Rank is less than number of columns\n③ $\\det(A)=1$\n④ $A$ must be square","definition_5":"For $A\\in\\mathbb{R}^{m\\times n}$, $\\mathrm{rank}(A)\\le$?\n\n① $\\min(m,n)$\n② $m+n$\n③ $\\max(m,n)$\n④ $mn$","definition_6":"The set containing only $\\mathbf{0}$ is?\n\n① Always independent\n② Not independent\n③ Independent iff $n\\ge2$\n④ A basis","definition_7":"$$\\mathrm{rank}(A^{\\mathsf T})$ vs $\\mathrm{rank}(A)$?\n\n① Always equal\n② Always different\n③ Decreases by 1 after transpose\n④ Always 0","definition_8":"$$\\dim(W)$ for a subspace $W$ equals?\n\n① Number of basis vectors\n② Number of all vectors in $W$\n③ Always 0\n④ Always full space dimension","definition_9":"If $\\mathbf{v}_1,\\ldots,\\mathbf{v}_k$ are independent, then $\\mathrm{rank}([\\mathbf{v}_1\\ \\cdots\\ \\mathbf{v}_k])$ equals?\n\n① Less than $k$\n② $k$\n③ 0\n④ Unrelated","trueFalse_0":"More vectors always implies independence.\n\n① True\n② False","trueFalse_1":"$$\\mathrm{rank}(A+B)\\le \\mathrm{rank}(A)+\\mathrm{rank}(B)$.\n\n① True\n② False","trueFalse_2":"If $A$ is invertible $n\\times n$, then $\\mathrm{rank}(A)=n$.\n\n① True\n② False","trueFalse_3":"Independent columns imply $A$ is square.\n\n① True\n② False","trueFalse_4":"$$\\mathrm{rank}(A^{\\mathsf T}A)=\\mathrm{rank}(A)$ over reals.\n\n① True\n② False","trueFalse_5":"Two distinct vectors in $\\mathbb{R}^2$ are always independent.\n\n① True\n② False","trueFalse_6":"Rank cannot exceed number of columns.\n\n① True\n② False","trueFalse_7":"Pivot count equals rank after RREF.\n\n① True\n② False","trueFalse_8":"Row rank equals column rank.\n\n① True\n② False","trueFalse_9":"Any subset of an independent set is independent.\n\n① True\n② False","calc_0":"$$\\mathrm{rank}\\begin{pmatrix}1&2\\\\2&4\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_1":"$$\\mathrm{rank}\\begin{pmatrix}2&1\\\\4&2\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_2":"$$\\mathrm{rank}\\begin{pmatrix}1&1&0\\\\0&1&1\\end{pmatrix}$?\n\n① 1\n② 2\n③ 3\n④ 0","calc_3":"Max independent vectors in $\\mathbb{R}^4$?\n\n① 3\n② 4\n③ 5\n④ 2","calc_4":"$$\\mathrm{rank}\\begin{pmatrix}1&3\\\\2&6\\end{pmatrix}$?\n\n① 2\n② 1\n③ 0\n④ 3","calc_5":"$$\\mathrm{rank}\\begin{pmatrix}1&2&3\\\\2&4&6\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_6":"$$\\mathrm{rank}\\begin{pmatrix}1&2&3\\\\0&1&1\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","calc_7":"Max rank of $3\\times5$ real $A$?\n\n① 5\n② 4\n③ 3\n④ 8","calc_8":"$$\\mathrm{rank}\\begin{pmatrix}1&0&1\\\\0&1&1\\end{pmatrix}$? (col3 = col1+col2)\n\n① 3\n② 2\n③ 1\n④ 0","calc_9":"$$\\mathrm{rank}\\begin{pmatrix}1&1&2\\\\0&1&1\\\\1&2&3\\end{pmatrix}$? (row3 = row1+row2)\n\n① 0\n② 1\n③ 2\n④ 3","concept_0":"If $A\\in\\mathbb{R}^{m\\times n}$ has 3 columns that are linearly independent, then $\\mathrm{rank}(A)$ is?\n\n① 3\n② At most 2\n③ 0\n④ Unrelated to column count","concept_1":"If a finite set of vectors is linearly dependent, which is always true?\n\n① All are zero\n② At least one is a linear combination of the others\n③ All are unit\n④ All are pairwise orthogonal","concept_2":"After RREF, the number of pivots equals?\n\n① Column rank\n② Always differs from column rank\n③ Always number of rows\n④ Always 0","concept_3":"Let $W\\subseteq\\mathbb{R}^5$ be a subspace with $\\dim(W)=3$. Max number of independent vectors in $W$?\n\n① 2\n② 3\n③ 5\n④ Infinitely many","concept_4":"If $\\mathbf{v}_1,\\mathbf{v}_2,\\mathbf{v}_3$ are independent, then $\\mathbf{v}_1,\\mathbf{v}_2$ are?\n\n① Always dependent\n② Always independent\n③ Always orthogonal\n④ Cannot tell","concept_5":"For $A\\in\\mathbb{R}^{m\\times n}$, a necessary condition for its columns to be linearly independent is?\n\n① $m\\ge n$\n② $m\\le n$\n③ Only $m=n$\n④ $n>m$","concept_6":"For $A\\in\\mathbb{R}^{m\\times n}$, if $\\dim\\{ \\mathbf{x}:A\\mathbf{x}=\\mathbf{0}\\}=k$, then $\\mathrm{rank}(A)$ equals?\n\n① $n-k$\n② $m-k$\n③ $k$\n④ $m+n$","concept_7":"If one column is a linear combination of the others, the column rank is?\n\n① Equal to number of columns\n② Less than number of columns\n③ Always 0\n④ Infinite","concept_8":"A real $2\\times2$ matrix is invertible iff?\n\n① rank 0\n② rank 1\n③ rank 2\n④ rank irrelevant","concept_9":"Which is always true?\n\n① $\\mathrm{rank}(AB)\\ge \\mathrm{rank}(A)$\n② $\\mathrm{rank}(AB)\\le \\mathrm{rank}(A)$\n③ $\\mathrm{rank}(AB)=\\mathrm{rank}(A)$\n④ $AB$ is always full rank","projection_0":"$$\\mathrm{rank}(A^{\\mathsf T})$ equals?\n\n① $\\mathrm{rank}(A)$\n② $\\mathrm{rank}(A)+1$\n③ $0$\n④ $\\det(A)$","projection_1":"Upper bound of $\\mathrm{rank}(AB)$?\n\n① $\\min\\{\\mathrm{rank}A,\\mathrm{rank}B\\}$\n② $\\mathrm{rank}A+\\mathrm{rank}B$\n③ $mn$\n④ Always $\\mathrm{rank}A$","projection_2":"For invertible $P,Q$, $\\mathrm{rank}(PAQ)$ equals?\n\n① $\\mathrm{rank}(A)$\n② 0\n③ $\\mathrm{rank}(P)$\n④ $\\det(A)$","projection_3":"Rank of zero matrix?\n\n① 0\n② 1\n③ #cols\n④ #rows","projection_4":"Rank of triangular with nonzero diagonal?\n\n① 0\n② Number of nonzero diagonal entries\n③ Always 1\n④ Full always","projection_5":"Max rank of $5\\times3$ $A$?\n\n① 5\n② 4\n③ 3\n④ 15","projection_6":"Swapping columns changes rank?\n\n① Preserves\n② +1 always\n③ Always 0\n④ Doubles","projection_7":"Adding a multiple of another column preserves rank?\n\n① Yes\n② Always -1\n③ Always 0\n④ Doubles","projection_8":"Rank of $P=\\begin{pmatrix}1&0\\\\0&0\\end{pmatrix}$?\n\n① 0\n② 1\n③ 2\n④ 3","projection_9":"$$\\mathrm{rank}(A)$ vs $\\mathrm{rank}(A^{\\mathsf T}A)$ over reals?\n\n① Equal\n② Always differ\n③ Always larger for $A$\n④ Always 0","scenario_0":"If two distinct columns of a matrix are identical, then?\n\n① Columns are dependent; column rank can be less than #cols\n② Always full column rank\n③ Rank is always 0\n④ Column rank always equals #cols","scenario_1":"If $\\mathbf{a}_3=2\\mathbf{a}_1-\\mathbf{a}_2$ for columns $\\mathbf{a}_1,\\mathbf{a}_2,\\mathbf{a}_3$, then $\\mathrm{rank}([\\mathbf{a}_1\\ \\mathbf{a}_2\\ \\mathbf{a}_3])$ is?\n\n① Always 3\n② At most 2\n③ Always 0\n④ Always 4","scenario_2":"If $A$ is $4\\times4$ with $\\mathrm{rank}(A)=3$, then $\\dim(\\mathrm{Col}(A))$ equals?\n\n① 4\n② 3\n③ 2\n④ 0","scenario_3":"If the rows of $A\\in\\mathbb{R}^{m\\times n}$ are linearly independent as vectors in $\\mathbb{R}^n$, the row rank is?\n\n① $m$\n② Always 0\n③ $n$\n④ Always 1","scenario_4":"For any real $m\\times n$ matrix $A$, $\\mathrm{rank}(A)$ and $\\mathrm{rank}(A^{\\mathsf T})$ are?\n\n① Always equal\n② Always different\n③ Always rank(A) larger\n④ Always 0","scenario_5":"If all $n$ columns of an $m\\times n$ matrix are linearly independent, then necessarily?\n\n① $m\\ge n$\n② $m\\le n$\n③ Only possible if $m=n$\n④ $n>m$","scenario_6":"If $\\mathrm{rank}(A)=r$, then $\\dim(\\mathrm{Col}(A))$ equals?\n\n① $r$\n② $mn$\n③ $n-r$\n④ $m$","scenario_7":"If two rows are proportional (one is a scalar multiple of the other), their contribution to row rank is at most?\n\n① 1\n② Always 2\n③ Always 0\n④ Equal to #rows","scenario_8":"For $T(\\mathbf{x})=A\\mathbf{x}$ with $A\\in\\mathbb{R}^{m\\times n}$, $\\dim(\\mathrm{range}\\,T)$ equals?\n\n① $\\mathrm{rank}(A)$\n② Always $n$\n③ Always $m$\n④ Always 0","scenario_9":"If $A$ is real $n\\times n$ and $\\mathrm{rank}(A)0$, eigenvalues of $S+\\mu I$ (with multiplicity) are?\n\n① the same multiset as $S$\n② **each eigenvalue of $S$ shifted by $\\mu$**\n③ all equal to $\\mu$\n④ all zero","hscn_5":"If $A=Q\\Lambda Q^{\\mathsf T}$ with orthogonal $Q$ and diagonal $\\Lambda$, then $A^5=Q\\Lambda_1 Q^{\\mathsf T}$ where $\\Lambda_1$ is?\n\n① **diagonal with each diagonal entry of $\\Lambda$ raised to the 5th power**\n② $5\\Lambda$\n③ $\\Lambda^{-1}$\n④ $I$"},"problemAnswers":{"edef_0":2,"edef_1":2,"edef_2":2,"edef_3":2,"edef_4":1,"edef_5":2,"etf_0":2,"etf_1":1,"etf_2":2,"etf_3":1,"etf_4":1,"etf_5":1,"ecalc_0":1,"ecalc_1":1,"ecalc_2":1,"ecalc_3":1,"ecalc_4":3,"ecalc_5":2,"eprop_0":1,"eprop_1":1,"eprop_2":1,"eprop_3":1,"eprop_4":2,"eprop_5":1,"mcon_0":1,"mcon_1":2,"mcon_2":2,"mcon_3":2,"mcon_4":1,"mcon_5":1,"mcmp_0":2,"mcmp_1":2,"mcmp_2":1,"mcmp_3":2,"mcmp_4":1,"mcmp_5":2,"mdiag_0":1,"mdiag_1":3,"mdiag_2":1,"mdiag_3":1,"mdiag_4":1,"mdiag_5":1,"hproj_0":1,"hproj_1":1,"hproj_2":3,"hproj_3":3,"hproj_4":1,"hproj_5":1,"hpca_0":1,"hpca_1":2,"hpca_2":2,"hpca_3":2,"hpca_4":1,"hpca_5":1,"hscn_0":2,"hscn_1":2,"hscn_2":2,"hscn_3":2,"hscn_4":2,"hscn_5":1},"problemSolutions":{"edef_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","edef_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","edef_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","etf_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","etf_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","etf_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","etf_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","etf_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","etf_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","ecalc_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","ecalc_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","eprop_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","eprop_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","eprop_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcon_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcon_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcon_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcon_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcon_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcon_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcmp_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcmp_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mcmp_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mcmp_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","mdiag_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","mdiag_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","mdiag_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","hproj_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ③","hproj_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hproj_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hpca_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hpca_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hpca_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hpca_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hpca_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hpca_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①","hscn_0":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_1":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_2":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_3":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_4":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ②","hscn_5":"**1)** Apply the eigenvalue/eigenvector definition and check with a tiny matrix. **2)** Check with a small numeric example. **3)** Answer ①"},"problemTestCodes":{"edef_0":"answer = 2\nassert answer == 2","edef_1":"answer = 2\nassert answer == 2","edef_2":"answer = 2\nassert answer == 2","edef_3":"answer = 2\nassert answer == 2","edef_4":"answer = 1\nassert answer == 1","edef_5":"answer = 2\nassert answer == 2","etf_0":"answer = 2\nassert answer == 2","etf_1":"answer = 1\nassert answer == 1","etf_2":"answer = 2\nassert answer == 2","etf_3":"answer = 1\nassert answer == 1","etf_4":"answer = 1\nassert answer == 1","etf_5":"answer = 1\nassert answer == 1","ecalc_0":"answer = 1\nassert answer == 1","ecalc_1":"answer = 1\nassert answer == 1","ecalc_2":"answer = 1\nassert answer == 1","ecalc_3":"answer = 1\nassert answer == 1","ecalc_4":"answer = 3\nassert answer == 3","ecalc_5":"answer = 2\nassert answer == 2","eprop_0":"answer = 1\nassert answer == 1","eprop_1":"answer = 1\nassert answer == 1","eprop_2":"answer = 1\nassert answer == 1","eprop_3":"answer = 1\nassert answer == 1","eprop_4":"answer = 2\nassert answer == 2","eprop_5":"answer = 1\nassert answer == 1","mcon_0":"answer = 1\nassert answer == 1","mcon_1":"answer = 2\nassert answer == 2","mcon_2":"answer = 2\nassert answer == 2","mcon_3":"answer = 2\nassert answer == 2","mcon_4":"answer = 1\nassert answer == 1","mcon_5":"answer = 1\nassert answer == 1","mcmp_0":"answer = 2\nassert answer == 2","mcmp_1":"answer = 2\nassert answer == 2","mcmp_2":"answer = 1\nassert answer == 1","mcmp_3":"answer = 2\nassert answer == 2","mcmp_4":"answer = 1\nassert answer == 1","mcmp_5":"answer = 2\nassert answer == 2","mdiag_0":"answer = 1\nassert answer == 1","mdiag_1":"answer = 3\nassert answer == 3","mdiag_2":"answer = 1\nassert answer == 1","mdiag_3":"answer = 1\nassert answer == 1","mdiag_4":"answer = 1\nassert answer == 1","mdiag_5":"answer = 1\nassert answer == 1","hproj_0":"answer = 1\nassert answer == 1","hproj_1":"answer = 1\nassert answer == 1","hproj_2":"answer = 3\nassert answer == 3","hproj_3":"answer = 3\nassert answer == 3","hproj_4":"answer = 1\nassert answer == 1","hproj_5":"answer = 1\nassert answer == 1","hpca_0":"answer = 1\nassert answer == 1","hpca_1":"answer = 2\nassert answer == 2","hpca_2":"answer = 2\nassert answer == 2","hpca_3":"answer = 2\nassert answer == 2","hpca_4":"answer = 1\nassert answer == 1","hpca_5":"answer = 1\nassert answer == 1","hscn_0":"answer = 2\nassert answer == 2","hscn_1":"answer = 2\nassert answer == 2","hscn_2":"answer = 2\nassert answer == 2","hscn_3":"answer = 2\nassert answer == 2","hscn_4":"answer = 2\nassert answer == 2","hscn_5":"answer = 1\nassert answer == 1"}},"midMathCh10":{"chapter":"Chapter 10","title":"Hessian Matrix: Reading the Curvature of Surfaces","description":"The Hessian matrix is a square matrix of second-order partial derivatives of a scalar function. It encodes how much a surface curves at a point and is used to classify minima, maxima, and saddle points in optimization, and forms the basis of Newton's method and trust-region methods.","sectionTitle":"Hessian Matrix: Reading the Curvature of Surfaces","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it's used","problemSolving":"Problem-solving guide"},"whatIs":{"intro":"**What is the Hessian matrix?** — Think of it as a table of numbers that describe how much the surface curves in every direction at the point where you stand. It is a square matrix built from second derivatives of the function, and it is symmetric (same on both sides of the diagonal).","plain":"Imagine walking downhill with your eyes closed. What you feel under your feet—\"this way is steeper down\"—is the first derivative (gradient). The sense of \"if I take one more step, will the ground bowl down or stay flat?\" is the second derivative, i.e. the Hessian. With it you can avoid cliffs and find the true bottom, like the bottom of a bowl.","definition":"More precisely, the Hessian $\\mathbf{H}$ is the table whose $(i,j)$ entry is $H_{ij} = \\frac{\\partial^2 f}{\\partial x_i \\partial x_j}$—the function $f$ differentiated twice, once in each of the $x_i$ and $x_j$ directions. The **eigenvalues** of this matrix are what matter: all positive → **local minimum** (bowl), all negative → **local maximum** (dome), mixed signs → **saddle point** (up in one direction, down in another).","inAI":"In machine learning, training is about finding the \"valley\" where the error is smallest. Moving only by gradient is slow. Using the Hessian to read curvature lets you take **Newton-style** jumps toward the bottom and learn much faster."},"whyImportant":{"fakeBottom":"On the way down you may hit a flat spot where the gradient is zero. That does not mean you have reached the true bottom—it could be a saddle (flat in one place but up one way and down another). The **eigenvalues** of the Hessian tell you whether it is a true minimum or a saddle. When there are many variables (as in AI), avoiding these fake bottoms is crucial.","smartStep":"You want small steps on narrow paths and larger steps on open ground. The Hessian tells you \"how steep each direction is,\" so you can set step size (learning rate) well and descend efficiently without wasted moves."},"howUsed":{"newton":"Newton's method moves a lot in one step with: $\\mathbf{x}_{k+1} = \\mathbf{x}_k - \\mathbf{H}^{-1} \\nabla f(\\mathbf{x}_k)$. Here $\\mathbf{x}_k$ is the current point, $\\nabla f(\\mathbf{x}_k)$ is the gradient there, $\\mathbf{H}$ is the Hessian at that point, and $\\mathbf{H}^{-1}$ is its inverse. So you look at both the gradient and the curvature (Hessian) and jump toward the bottom to $\\mathbf{x}_{k+1}$. That can reach the answer much faster than small gradient-only steps.","quasiNewton":"When there are many variables, computing the Hessian exactly is costly. In practice, **quasi-Newton** methods (e.g. BFGS) approximate the Hessian from past gradient information instead of computing it fully, and are used more often."},"summary":"The Hessian is a symmetric matrix of second partial derivatives of a scalar function and encodes curvature and the nature of critical points. At a point where the gradient is zero, all positive eigenvalues imply a local minimum, all negative a local maximum, and mixed signs a saddle point. In machine learning it underlies second-order optimization such as Newton's method, trust-region, and quasi-Newton methods.","problemSolving":{"focus":"The table below lists only **formulas and symbol meanings** needed for problem-solving. See the **worked examples** under the table for step-by-step solutions.","examplesHeading":"Worked examples","examplesTable":"$28"},"problemSolvingLabel":"Problem-solving guide","problemSolvingTable":"$29","problemSolvingExample1":"**Example (entry count)**\n\nFor $f(x_1,x_2)$, the Hessian is $2\\times2$, so 4 entries; 3 independent. → **Answer 4** (total) or **3** (independent, by context)","problemSolvingExample2":"**Example (extrema)**\n\nIf eigenvalues are 2 and 5 (both positive), the point is a local minimum. → **Answer 1** (minimum) or the number asked","problemSolvingExample3":"**Example (Newton step)**\n\n$f(x)=x^2$ gives $f'(x)=2x$, $f''(x)=2$. At $x_0=4$, $x_1 = x_0 - f'(x_0)/f''(x_0) = 4 - 8/2 = 0$. → **Answer 0**","visualShort":"Hessian: second partial derivatives → curvature and extrema","visualIntroShort":"The first derivative tells you \"which way is downhill\"; the second (Hessian) tells you \"will the surface bowl down, or go up in one direction and down in another (saddle point)?\" Follow the animation below.","visualWhyHessian":"The Hessian is the matrix of **second derivatives**, so the \"curvature\" in the figure below is exactly what the Hessian describes.","visualIntro":"The Hessian is the matrix of second partial derivatives of $f$ at $\\mathbf{x}$ and is used to read curvature and to classify minima, maxima, and saddle points.","visualConceptTitle":"Concept structure","visualConceptStep0":"Input: scalar function $f(\\mathbf{x})$, point $\\mathbf{x}$","visualConceptStep1":"Compute $\\frac{\\partial^2 f}{\\partial x_i \\partial x_j}$","visualConceptStep2":"Form Hessian $\\mathbf{H}$ (symmetric)","visualConceptStep3":"Eigenvalues → minimum (all +), maximum (all −), saddle (mixed)","visualFlowTitle":"Learning flow","visualFlowStep0":"Concept: second-derivative matrix","visualFlowStep1":"Intuition: curvature of the surface","visualFlowStep2":"Math: $H_{ij}$, symmetry, eigenvalues","visualFlowStep3":"Use: Newton, extrema, trust region","visualCaption":"Left: bowl (only curves down) → minimum. Inverted bowl (only curves up) → maximum. Saddle: one direction up, the other down → neither min nor max.","visualStep1":"Input: scalar function $f(\\mathbf{x})$, point $\\mathbf{x}$","visualStep2":"Compute 2nd partials $\\frac{\\partial^2 f}{\\partial x_i \\partial x_j}$","visualStep3":"Form Hessian matrix $\\mathbf{H}$ (symmetric)","visualStepsLabel":"Order to read","visualBowlTitle":"Bowl: curves only down → minimum","visualSaddleTitle":"Saddle: value ↑ this way, value ↓ that way","visualCurveDown":"↓ curvature","visualFppMin":"f″=2 > 0 → min","visualMinPoint":"Minimum","visualValueUp":"value↑","visualValueDown":"value↓","visualSaddleOrangeGreen":"Orange direction: value goes up · Green direction: value goes down","visualSaddleNeither":"Saddle: neither minimum nor maximum","visualSummary1":"Bowl curves only down → here is the minimum","visualSummary2":"Inverted bowl curves only up → here is the maximum","visualSummary3":"Saddle: one direction up, the other down → neither min nor max","problemPromptIntro":"Read each problem and enter the Hessian/extrema-related value.","promptDefinition":"If the statement is **true**, choose **1**; if **false**, choose **0**.","promptDefinitionChoice":"Three options (a), (b), and (c) are listed below. Choose the correct one.","promptElementCount":"For a scalar function $f$ with {n} input variables, how many entries does the Hessian have in total?","promptIndependentCount":"For a symmetric Hessian with $n={n}$ variables, how many independent entries are there?","promptMatrixSize":"For a function of $n={n}$ variables, how many rows (or columns) does the Hessian have?","promptEigenvalueType":"The Hessian eigenvalues are $\\lambda_1={ev1}$ and $\\lambda_2={ev2}$. What kind of critical point is it?","promptNewton1D":"For $f(x)={a}x^2{bVal}x+{c}$ with $x_0={x0}$, what is $x_1$ after one Newton step?","promptScalarSecondDeriv":"For $f(x)={a}x^2+bx+c$, what is $f''(x)$?","promptDefault":"Choose the correct option below.","mcDefChoice1":"(a)","mcDefChoice2":"(b)","mcDefChoice3":"(c)","mcDefChoice4":"(d) None of (a)–(c)","mcEigenChoice1":"Local min","mcEigenChoice2":"Local max","mcEigenChoice3":"Saddle","mcEigenChoice4":"None of the above","definitionStatements":{"0":"For a $C^2$ scalar function, the Hessian matrix is symmetric.","1":"At a critical point, if all eigenvalues of the Hessian are positive, the point is a local minimum.","2":"At a critical point, if all eigenvalues of the Hessian are negative, the point is a local maximum.","3":"The $(i,j)$ entry of the Hessian is $\\partial^2 f/\\partial x_i\\partial x_j$.","4":"If $f$ is $C^2$, then $\\partial^2 f/\\partial x_i\\partial x_j = \\partial^2 f/\\partial x_j\\partial x_i$.","5":"For a scalar function of $n$ variables, the Hessian is an $n\\times n$ matrix.","6":"If the Hessian is positive definite, all eigenvalues are positive.","7":"If the Hessian is negative definite, all eigenvalues are negative.","10":"If the eigenvalues of the Hessian are all distinct, the critical point must be a saddle.","11":"Every scalar function has a Hessian equal to the identity matrix.","12":"For a one-variable function $f(x)$, the Hessian is always a $2\\times 2$ matrix.","13":"If some eigenvalue is 0, the critical point must be a local minimum.","14":"If the Hessian is the zero matrix, the critical point must be a local extremum."},"definitionChoiceQuestions":{"0":"(a) For a function of 2 variables, the Hessian has **4** entries in total.\n(b) **9**.\n(c) **6**.","1":"(a) For a symmetric Hessian with 3 variables, the number of independent entries is **9**.\n(b) **6**.\n(c) **3**.","2":"(a) Local minimum\n(b) Local maximum\n(c) Saddle point\n\n(Hint) Eigenvalues are $\\lambda_1=2$, $\\lambda_2=-1$.","3":"(a) Local minimum\n(b) Local maximum\n(c) Saddle point\n\n(Hint) Eigenvalues are $\\lambda_1=3$, $\\lambda_2=5$.","4":"(a) Local minimum\n(b) Local maximum\n(c) Saddle point\n\n(Hint) Eigenvalues are $\\lambda_1=-2$, $\\lambda_2=-4$.","5":"(a) $f''(x)=2$\n(b) $f''(x)=0$\n(c) $f''(x)=1$\n\n(Hint) $f(x)=x^2+1$.","6":"(a) Number of rows (or columns) **4**\n(b) **3**\n(c) **2**\n\n(Hint) The Hessian is $2\\times 2$.","7":"(a) **9**\n(b) **3**\n(c) **6**\n\n(Hint) How many rows does the Hessian have when there are 3 variables?"}},"advMathChapters":{"advMath00":{"chapter":"Chapter 00","title":"Advanced Math and AI: Generative Theory and Complex-System Modeling","description":"Advanced math for AI: multidimensional analysis, complex distributions, and deep learning. Curriculum for generative models and reinforcement learning."},"advMath01":{"chapter":"Chapter 01","title":"SVD and Pseudoinverse: Extracting Latent Patterns from Data","description":"SVD and pseudoinverse for latent patterns. PCA, recommendation systems. Advanced math Ch.01."},"advMath02":{"chapter":"Chapter 02","title":"Tensor Algebra and Einstein Notation","description":"Tensor algebra, Einsum, contraction. Neural network and attention notation. Advanced math Ch.02."},"advMath03":{"chapter":"Chapter 03","title":"Lagrange Multipliers and KKT: Constrained Optimization","description":"Lagrange multipliers and KKT for constrained optimization. SVM and constrained RL. Advanced math Ch.03."},"advMath04":{"chapter":"Chapter 04","title":"Markov Chain: State Transitions and Stochastic Processes","description":"Markov chains, transition matrix, stationarity. MCMC and RL basics. Advanced math Ch.04."},"advMath05":{"chapter":"Chapter 05","title":"Monte Carlo Integration: Numerical Approximation","description":"Monte Carlo integration for high-dimensional expectations. Used in RL and Bayesian inference. Advanced math Ch.05."},"advMath06":{"chapter":"Chapter 06","title":"MCMC: Sampling from Complex Probability Distributions","description":"MCMC, Gibbs and Metropolis-Hastings. Sampling from complex posteriors. Advanced math Ch.06."},"advMath07":{"chapter":"Chapter 07","title":"EM Algorithm: Inference with Latent Variables","description":"EM algorithm: E-step, M-step, latent variable models. GMM, HMM. Advanced math Ch.07."},"advMath08":{"chapter":"Chapter 08","title":"MAP Estimation: Bayesian Optimization and Regularization","description":"MAP estimation, priors, L1/L2 regularization. Bayesian deep learning. Advanced math Ch.08."},"advMath09":{"chapter":"Chapter 09","title":"Conjugate Prior: Analytical Bayesian Inference","description":"Conjugate priors for tractable posteriors. Beta, Dirichlet. Advanced math Ch.09."},"advMath10":{"chapter":"Chapter 10","title":"JS Divergence and Mutual Information","description":"JS divergence and mutual information. GANs and information theory. Advanced math Ch.10."},"advMath11":{"chapter":"Chapter 11","title":"Variational Inference: Approximating Intractable Probabilities","description":"Variational inference, KL minimization, approximate posteriors. Core of VAE. Advanced math Ch.11."},"advMath12":{"chapter":"Chapter 12","title":"Reparameterization Trick: Differentiating Randomness","description":"Reparameterization trick for differentiable sampling. VAE training. Advanced math Ch.12."},"advMath13":{"chapter":"Chapter 13","title":"Optimal Transport and Wasserstein Distance","description":"Wasserstein distance, Earth Mover. WGAN when supports do not overlap. Advanced math Ch.13."},"advMath14":{"chapter":"Chapter 14","title":"MDP and Bellman Equation: Mathematical Basis of Reinforcement Learning","description":"MDP and Bellman equation. States, actions, rewards, value functions. RL math. Advanced math Ch.14."},"advMath15":{"chapter":"Chapter 15","title":"Fourier Transform and Spectral Analysis","description":"Fourier transform and frequency domain. Time series, images, CNN. Advanced math Ch.15."},"advMath16":{"chapter":"Chapter 16","title":"Graph Laplacian: Mathematizing Network Structure","description":"Graph Laplacian, adjacency, degree. GNN, smoothing. Advanced math Ch.16."},"advMath17":{"chapter":"Chapter 17","title":"SDE Basics: Continuous Injection of Noise","description":"SDE and Brownian motion. Diffusion forward process. Advanced math Ch.17."},"advMath18":{"chapter":"Chapter 18","title":"Langevin Dynamics and Score Matching","description":"Langevin dynamics and score matching. Diffusion reverse process. Advanced math Ch.18."},"advMath19":{"chapter":"Chapter 19","title":"Information Geometry and Natural Gradient","description":"Information geometry, Fisher matrix, natural gradient. Optimization on manifolds. Advanced math Ch.19."},"advMath20":{"chapter":"Chapter 20","title":"Advanced Math Summary: Generative Models and Deep Optimization","description":"How SDE, VI, optimal transport, and information geometry appear in VAE, GAN, Diffusion, LLM. Advanced math Ch.20."}},"midDlChapters":{"midDl00":{"chapter":"Chapter 00","title":"Intermediate DL: Stable Training and Unstructured Data"},"midDl01":{"chapter":"Chapter 01","title":"Weight Initialization"},"midDl02":{"chapter":"Chapter 02","title":"Optimization: Momentum and Adaptive Learning Rate"},"midDl03":{"chapter":"Chapter 03","title":"Learning Rate Scheduling"},"midDl04":{"chapter":"Chapter 04","title":"Loss Functions: Class Imbalance and Metric Learning"},"midDl05":{"chapter":"Chapter 05","title":"Regularization and Overfitting Prevention"},"midDl06":{"chapter":"Chapter 06","title":"Batch & Layer Normalization"},"midDl07":{"chapter":"Chapter 07","title":"Data Augmentation and Noise Robustness"},"midDl08":{"chapter":"Chapter 08","title":"CNN Basics: Spatial Feature Extraction"},"midDl09":{"chapter":"Chapter 09","title":"Pooling and Multi-Channel"},"midDl10":{"chapter":"Chapter 10","title":"Skip Connection and ResNet"},"midDl11":{"chapter":"Chapter 11","title":"Efficient Convolution: MobileNet"},"midDl12":{"chapter":"Chapter 12","title":"Vision Transfer Learning"},"midDl13":{"chapter":"Chapter 13","title":"Object Detection (YOLO, SSD)"},"midDl14":{"chapter":"Chapter 14","title":"Image Segmentation (U-Net)"},"midDl15":{"chapter":"Chapter 15","title":"NLP Preprocessing and Tokenization"},"midDl16":{"chapter":"Chapter 16","title":"Word Embedding (Word2Vec, GloVe)"},"midDl17":{"chapter":"Chapter 17","title":"1D CNN for Sequence Processing"},"midDl18":{"chapter":"Chapter 18","title":"RNN: Sequential State"},"midDl19":{"chapter":"Chapter 19","title":"LSTM and GRU: Long-Range Dependencies"},"midDl20":{"chapter":"Chapter 20","title":"Encoder-Decoder and Attention"},"midDl21":{"chapter":"Chapter 21","title":"Intermediate DL Summary"}},"midDlCh00":{"description":"Learn what intermediate deep learning covers: stable training and handling images and text, from Ch01 to Ch21.","roadmapTitle":"Intermediate deep learning diagram by chapter","roadmapDescription":"As you complete each chapter, the diagram below fills in. This is the structure so far.","roadmapListHeading":"What you learn in Ch01–Ch21","sectionTitle":"What is Intermediate Deep Learning?","paragraphs":{"0":"**Basic deep learning** introduced neurons, layers, and gradients. **Intermediate deep learning** adds **ways to stabilize training** and to handle **images and text**. You will learn **weight initialization**, **optimizers** (momentum, Adam), **learning rate scheduling**, **regularization and overfitting prevention**, **batch normalization**, and more so that training converges well. Then you move on to **convolutional networks (CNN)**, **ResNet**, **transfer learning**, **object detection and segmentation**, **NLP preprocessing and embeddings**, **RNN, LSTM, GRU**, and **encoder-decoder with attention**.","1":"**Images** are pixel grids, so we use **convolutions** to capture spatial patterns, **pooling** to summarize, and **skip connections** to train deep networks stably. **Text** is sequential, so we use **tokenization and embedding**, then **1D convolutions** or **RNN/LSTM** for context, and **attention** to focus on important parts.","2":"**Why training stability matters**: poor initialization can stall learning; a learning rate that is too high causes divergence, too low makes progress slow. **Optimizers** use not only the current gradient but also past updates (momentum) or per-parameter step sizes (Adam) to reach a good minimum faster and more reliably. **Learning rate scheduling** starts with larger steps and then reduces them for fine convergence; **regularization** and **batch normalization** keep activations and gradients at a sensible scale and reduce vanishing or exploding gradients.","3":"In **vision**, local patterns (edges, textures) matter, so **convolutions** are a natural fit. **Pooling** compresses information while making the representation somewhat invariant to small shifts. **ResNet**’s skip connections add previous layer outputs so that even very deep networks can be trained without the signal dying out. **Transfer learning** reuses models trained on large datasets and fine-tunes them for your task, which is especially useful when you have limited data.","4":"For **language and sequences**, we split text into **tokens**, turn them into **embeddings**, then use **RNN** or **LSTM/GRU** to carry context over time and predict the next token. **Attention** lets the model learn which parts of the input matter most for each prediction, which is central to translation, summarization, and question answering. After this course, you will understand the basics of image classification, detection, segmentation, and text generation, translation, and summarization.","5":"This course is organized as follows: Ch01–Ch07 cover **training stability** (initialization, optimization, scheduling, loss, regularization, normalization layers, data augmentation); Ch08–Ch14 cover **vision** (CNN, pooling, ResNet, efficient convolutions, transfer learning, detection, segmentation); Ch15–Ch21 cover **language and sequences** (preprocessing, embedding, 1D CNN, RNN, LSTM/GRU, encoder-decoder and attention, and a final summary)."}},"midDlCh01":{"chapter":"Chapter 01","title":"Weight Initialization: A Good Start Is Half the Battle","description":"**Weight initialization** is choosing the initial values of each layer's weights and biases before training. A bad start leads to vanishing or exploding gradients and often makes learning nearly impossible; a good start leads to faster convergence and stable training. This chapter covers the concept of initialization, the intuition and formulas behind Xavier and He initialization, and how they are used in practice.","sectionTitle":"Weight Initialization: A Good Start Is Half the Battle","whatIs":{"0":"**What is weight initialization?** — Each layer of a neural network has **weights $W$** and **biases $b$. Before training, these values are undefined, so we must choose **what numbers to use initially**. This process is called **weight initialization**. Intuitively, it is like choosing where to start a marathon: too far back (weights too small) and progress is slow; too far forward (weights too large) and training can explode and diverge.","1":"**Mathematically** — In one layer the linear combination is $z = W \\mathbf{x} + b$, where $\\mathbf{x}$ is the input vector, $W$ the weight matrix, and $b$ the bias. If all elements of $W$ are zero, every neuron in that layer gives the same output, **symmetry** is preserved, and gradients do not spread properly during backprop. So we usually initialize with **small random numbers**, but the **distribution (scale)** of those numbers matters. We adjust variance using the layer's input dimension $n_{in}$ and output dimension $n_{out}$ so that activations do not grow or shrink too much as they pass through.","2":"**In practice** — With bad initialization, a spam classifier may show almost no decrease in loss or NaNs. In deep CNNs (e.g. medical imaging or fraud detection), skipping Xavier or He often leads to **vanishing gradients** in early layers and training appears stuck. If the scale is too large, gradients explode and training becomes unstable. So in practice **Xavier** (for tanh·sigmoid) or **He** (for ReLU) initialization is standard."},"whyImportant":{"0":"**Vanishing and exploding gradients** — In deeper networks, backpropagated gradients are products of many numbers (chain rule). If weights are too small, this product tends to zero (**vanishing gradient**) and early layers barely update; if too large, it explodes (**exploding gradient**) and you get NaN or Inf. Good initialization keeps **variance** stable across layers so that gradients stay at a reasonable scale even in deep networks.","1":"**Convergence speed** — With proper initialization you start at a **good point** on the loss surface. A bad starting point can trap you in poor local minima or make convergence very slow. In practice, initialization is tuned together with learning rate by monitoring validation loss."},"howUsed":{"0":"**Xavier (Glorot) initialization** — So that the variance of $z$ does not depend on input/output size, $W$ is sampled from a **uniform** distribution $U(-\\sqrt{6/(n_{in}+n_{out})},\\ \\sqrt{6/(n_{in}+n_{out})})$ or a **normal** distribution $\\mathcal{N}(0,\\ \\sigma^2)$ with $\\sigma^2 = 2/(n_{in}+n_{out})$. It fits symmetric activations like tanh and sigmoid.","1":"**He initialization** — ReLU zeros out negative inputs, so output variance is about half the input variance. **He** initialization uses $\\sigma^2 = 2/n_{in}$ to compensate. It is the default in modern CNNs and MLPs that use ReLU or Leaky ReLU.","2":"**Practical choice** — Use He for ReLU-family activations and Xavier for tanh·sigmoid. Frameworks (PyTorch, TensorFlow) usually apply one of these by default depending on the layer type."},"problemSolving":{"0":"**Summary** — Weight initialization is the step of choosing initial values for each layer's $W$ and $b$ before training. Zero initialization breaks learning due to symmetry, so we usually use small random numbers and adjust **variance (scale)**. Xavier uses $\\sigma^2 = 2/(n_{in}+n_{out})$ for tanh·sigmoid; He uses $\\sigma^2 = 2/n_{in}$ for ReLU-family. Good initialization reduces vanishing/exploding gradients and speeds convergence.","2":"**Example (definition)**\n\n\"What is the main purpose of weight initialization? ① Match layer scale before training ② Increase learning rate ③ Data augmentation\"\n\nPurpose is to keep activation and gradient scale stable across layers. → **Answer 1**\n\n---\n\n**Example (Xavier vs He)**\n\n\"Common initialization for layers using ReLU? ① Xavier ② He ③ Zero\"\n\nHe is used for ReLU-family. → **Answer 2**\n\n---\n\n**Example (calculation)**\n\nIf $n_{in}+n_{out}=6$, what is the integer value of $6/(n_{in}+n_{out})$ (uniform Xavier ratio)?\n\n$6/6=1$. → **Answer 1**","3":"**Definition example** — \"What is the main purpose of weight initialization? ① Match layer scale before training ② Increase learning rate ③ Data augmentation\" → Purpose is to keep scale stable across layers. **Answer 1**\n\n**True/False example** — \"Weight initialization is the process of setting $W$, $b$ before training.\" → True. **Answer 1**\n\n**Application example** — \"When loss barely decreases in a spam classifier, what to check first? ① Initialization·learning rate ② Data size only ③ Batch size only\" → Check initialization·learning rate first. **Answer 1**\n\n**Choice example** — \"In He initialization, $\\sigma^2$ is? ① $2/n_{in}$ ② $2/(n_{in}+n_{out})$ ③ $1/n_{in}$\" → He uses $\\sigma^2=2/n_{in}$. **Answer 1**\n\n**Concept example** — \"In Xavier, if $n_{in}+n_{out}=6$, the value (integer) of $6/(n_{in}+n_{out})$ is? ① 1 ② 2 ③ 3\" → $6/6=1$. **Answer 1**\n\n**Calc example** — \"If $n_{in}+n_{out}=6$, what is the integer value of $6/(n_{in}+n_{out})$?\" → $6/6=1$. **Answer 1**"},"summary":"Weight initialization is the process of setting the initial values of each layer's weights and biases before training. Initializing everything to zero keeps neurons symmetric and prevents proper learning; random values that are too large or too small lead to exploding or vanishing activations and gradients. Xavier and He initialization adjust variance based on layer size and are widely used: Xavier for symmetric activations like tanh·sigmoid, He for ReLU-family. A good start reduces vanishing and exploding gradients and makes convergence faster and more stable.","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Understanding the formulas","linear":"**Formula $z = W\\mathbf{x}+b$ (linear sum in one layer)**\n\nThis is the pre-activation output of a layer. **$z$** is the raw vector before the activation function. **$W$** scales how each input affects each output; its variance is what initialization controls. **$\\mathbf{x}$** is the layer input (features or the previous layer’s output). **$b$** shifts the baseline and is often set to 0 at init. If $W$ is too large, activations explode; too small, they vanish.","xavierVariance":"**Xavier variance $\\sigma^2 = \\frac{2}{n_{in}+n_{out}}$**\n\nXavier draws weights from a normal with this variance. Larger $\\sigma^2$ spreads weights farther from 0. **$n_{in}$** counts inputs to the layer; **$n_{out}$** counts outputs (neurons). As **$n_{in}+n_{out}$** grows, $\\sigma^2$ **shrinks**, so wide layers use smaller weights so sums stay stable. The numerator **2** comes from matching variance through tanh/sigmoid-like activations.","heVariance":"**He variance $\\sigma^2 = \\frac{2}{n_{in}}$**\n\nHe targets ReLU, which zeros negative inputs so output variance is roughly halved. He uses **only $n_{in}$** (not $n_{out}$). The factor **2** compensates for that halving so activations stay scaled after ReLU.","xavierUniform":"**Xavier uniform on $[-a,\\ a]$, $a = \\sqrt{\\frac{6}{n_{in}+n_{out}}}$**\n\nWeights can be drawn uniformly on $[-a,a]$ with the same scaling idea as the normal Xavier form. When $n_{in}+n_{out}$ is given, compute $6/(n_{in}+n_{out})$; for integer-friendly drills, e.g. $n_{in}+n_{out}=6$ gives $6/6=1$."},"visual":"Visualization of how weight initialization affects gradient flow.","problemSolvingLabel":"Problem-solving guide","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are sample questions to check your understanding. Choose your answer using the buttons below.","practiceProblemsInstruction":"Read each problem and select the correct option.","midDlCh01VisualIntro":"Weight initialization is the first step of training: set $W$ and $b$ for each layer at an appropriate scale so that variance is preserved during forward and back propagation.","midDlCh01VisualStep0":"① Initialize: set $W$, $b$ for each layer (e.g. Xavier/He)","midDlCh01VisualStep1":"② Forward: input → linear sum $z$ → activation $a$ → next layer","midDlCh01VisualStep2":"③ Loss then backprop: gradients pass through layers","midDlCh01VisualStep3":"④ Update: update $W$, $b$ from gradients. Good init keeps gradient scale stable","midDlCh01VisualConceptTitle":"Concept: initialize → forward → loss → backprop → update","midDlCh01VisualFlowTitle":"Training flow: initialize so input·weight·output scale match per layer","midDlCh01VisualModelTitle":"Layer: set variance of $W$ so variance of $z=Wx+b$ stays similar to input variance","midDlCh01VisualScaleTitle":"Effect of initialization scale","midDlCh01VisualScaleSmall":"W too small → vanishing gradient","midDlCh01VisualScaleLarge":"W too large → exploding gradient","midDlCh01VisualScaleGood":"Reasonable W → variance preserved","midDlCh01VisualSegInput":"Input","midDlCh01VisualSegLayer1":"Layer 1","midDlCh01VisualSegLayer2":"Layer 2","midDlCh01VisualSegLayer3":"Layer 3","midDlCh01VisualSegOutput":"Output","midDlCh01VisualRowLabelVanishing":"Vanishing","midDlCh01VisualRowLabelStable":"Stable","midDlCh01VisualRowLabelExploding":"Exploding","midDlCh01VisualScaleCaption":"Good initialization sets W, b scale so that **variance is preserved** across layers.","midDlCh01VisualBannerShort":"A good start is half the battle","midDlCh01VisualBannerSub":"Proper initialization → fast convergence · stable training","problems":{"definition_0":"What is the main purpose of weight initialization? ① Match layer scale before training ② Increase learning rate ③ Data augmentation","definition_1":"The process of setting $W$, $b$ for each layer before training is called? ① Weight initialization ② Gradient descent ③ Regularization","definition_2":"Common initialization for ReLU-family activations? ① Xavier ② He ③ Zero initialization","definition_3":"Common initialization for tanh·sigmoid? ① Xavier ② He ③ Zero initialization","definition_4":"When gradients approach 0 and early layers barely update, this is called? ① Vanishing gradient ② Exploding gradient ③ Overfitting","definition_5":"When weights are too large and gradients explode, this is called? ① Vanishing gradient ② Exploding gradient ③ Underfitting","definition_6":"In Xavier initialization, how is variance set using $n_{in}$, $n_{out}$? ① $2/(n_{in}+n_{out})$ ② $2/n_{in}$ ③ $1/n_{in}$","definition_7":"In He initialization, variance is? ① $2/(n_{in}+n_{out})$ ② $2/n_{in}$ ③ $1/(n_{in}+n_{out})$","definition_8":"Main reason not to initialize all weights to 0? ① Symmetry: neurons give same output, learning fails ② Slower compute ③ Memory shortage","definition_9":"In one layer $z = W\\mathbf{x}+b$, if $W$ is too small? ① Tends to vanishing gradient ② Exploding gradient ③ No effect","trueFalse_0":"Weight initialization is the process of setting $W$, $b$ before training. True: 1, False: 0.","trueFalse_1":"Xavier initialization is used only for ReLU. True: 1, False: 0.","trueFalse_2":"He initialization is suitable for ReLU-family activations. True: 1, False: 0.","trueFalse_3":"Good initialization keeps variance stable across layers. True: 1, False: 0.","trueFalse_4":"Initializing all weights to 0 is recommended. True: 1, False: 0.","trueFalse_5":"Vanishing gradient occurs when weights are too large. True: 1, False: 0.","trueFalse_6":"Exploding gradient can occur when weights are too large. True: 1, False: 0.","trueFalse_7":"Initialization affects convergence speed. True: 1, False: 0.","trueFalse_8":"In Xavier, $\\sigma^2 = 2/(n_{in}+n_{out})$. True: 1, False: 0.","trueFalse_9":"In He, $\\sigma^2 = 2/n_{in}$. True: 1, False: 0.","scenario_0":"In a spam classifier, when loss barely decreases, what to suspect first? ① Initialization·learning rate ② Data size only ③ Batch size only","scenario_1":"In a deep CNN, when early layers barely update, the most common cause is? ① Vanishing gradient ② Overfitting ③ Lack of data","scenario_2":"When implementing an MLP with ReLU, a good default initialization is? ① Xavier ② He ③ Zero","scenario_3":"For a layer with tanh, initialization with variance $2/(n_{in}+n_{out})$ is? ① Xavier ② He ③ Neither","scenario_4":"When you get NaN during training, from an initialization perspective you suspect? ① Exploding gradient (scale too large) ② Data only ③ Batch size only","scenario_5":"Why try changing initialization when a medical image model converges very slowly? ① Bad starting point can slow convergence ② Only lack of data ③ Only learning rate matters","scenario_6":"PyTorch default Linear layer initialization is closest to? ① Xavier/He family ② Always zero ③ Random only","scenario_7":"The goal of keeping activation variance stable across layers is called? ① Variance preservation (scale matching) ② Regularization ③ Dropout","scenario_8":"Why care about initialization when a fraud detection model is deep? ① Avoid vanishing·exploding gradients ② Data only matters ③ Batch size only matters","scenario_9":"For a layer with $n_{in}=8$, $n_{out}=8$ using Xavier, $n_{in}+n_{out}$ is? ① 16 ② 8 ③ 64","choice_0":"Why not initialize weights to 0? ① Symmetry prevents proper learning ② Save memory ③ Slower speed","choice_1":"In He initialization, $\\sigma^2$ is? ① $2/n_{in}$ ② $2/(n_{in}+n_{out})$ ③ $1/n_{in}$","choice_2":"A good way to mitigate vanishing gradient? ① Proper initialization (e.g. Xavier/He) ② Only increase learning rate ③ Only increase batch size","choice_3":"Activation that fits Xavier initialization? ① tanh·sigmoid ② ReLU only ③ None","choice_4":"In one layer $z=W\\mathbf{x}+b$, if $W$ scale is too large? ① Exploding gradient possible ② Only vanishing ③ No effect","choice_5":"Effect of initialization on training? ① Convergence speed·stability ② Only data amount ③ Only loss shape","choice_6":"In He for ReLU layers, variance is ___ to input dimension $n_{in}$? ① Inversely proportional ($2/n_{in}$) ② Proportional ③ Unrelated","choice_7":"When gradients approach 0 during backprop, this is? ① Vanishing gradient ② Exploding gradient ③ Regularization","choice_8":"In Xavier, if $n_{in}=4$, $n_{out}=6$, then $n_{in}+n_{out}$ is? ① 10 ② 24 ③ 2","choice_9":"Closest to the goal of good initialization? ① Preserve variance across layers ② Weights to 0 ③ Only increase learning rate","concept_0":"In $z=W\\mathbf{x}+b$, if variance of $W$ is too large, during backprop the gradient? ① Can explode ② Always 0 ③ Unchanged","concept_1":"In Xavier, uniform range is $[-a,a]$ with $a=\\sqrt{6/(n_{in}+n_{out})}$. If $n_{in}+n_{out}=12$, $6/(n_{in}+n_{out})$ as integer? ① 0 ② 1 ③ 2","concept_2":"Main reason to use He initialization? ① ReLU zeros negatives, reducing variance ② Faster than Xavier ③ Always better","concept_3":"Why is initialization more important in deep networks? ① Gradients are multiplied through many layers ② Only data matters ③ Only first layer matters","concept_4":"How is bias $b$ usually initialized? ① To 0 ② To 1 ③ Randomly","concept_5":"Why use He-like initialization with Leaky ReLU? ① ReLU family, similar variance behavior ② Only Xavier ③ Zero init","concept_6":"When learning rate is fine but loss barely decreases? ① Suspect initialization or structure (vanishing) ② Data only ③ Batch only","concept_7":"What do Xavier and He have in common? ① Set variance by layer size ② Both zero init ③ ReLU only","concept_8":"In backprop, multiplying gradients (chain rule): 0.5^10 ≈ 0.001. Similar phenomenon? ① Vanishing gradient ② Exploding gradient ③ Regularization","concept_9":"Default initialization for ReLU CNN in practice? ① He family ② Zero ③ Xavier only","calc_0":"If $n_{in}+n_{out}=6$, what is the integer value of $6/(n_{in}+n_{out})$ (uniform Xavier ratio)?","calc_1":"For He initialization $\\sigma^2=2/n_{in}$ with $n_{in}=8$, what is the denominator $n_{in}$ (integer)?","calc_2":"For Xavier variance $\\sigma^2=2/(n_{in}+n_{out})$ with $n_{in}=2$, $n_{out}=8$, what is the denominator $n_{in}+n_{out}$ (integer)?","calc_3":"For He initialization, $\\sigma^2=2/n_{in}$ uses $n_{in}$ as the denominator. If $n_{in}=32$, what is that denominator (integer)?","calc_4":"For Xavier with $n_{in}=5$, $n_{out}=5$, what is $n_{in}+n_{out}$ (integer)?","calc_5":"If $n_{in}+n_{out}=3$, what is the integer value of $6/(n_{in}+n_{out})$?","calc_6":"For Xavier variance denominator $n_{in}+n_{out}$ with $n_{in}=1$, $n_{out}=7$, what is the value (integer)?","calc_7":"For He initialization $\\sigma^2=2/n_{in}$ with $n_{in}=20$, what is the denominator (integer)?","calc_8":"For Xavier with $n_{in}=4$, $n_{out}=12$, what is $n_{in}+n_{out}$ (integer)?","calc_9":"If $n_{in}+n_{out}=2$, what is the integer value of $6/(n_{in}+n_{out})$?"},"problemAnswers":{"definition_0":1,"definition_1":1,"definition_2":2,"definition_3":1,"definition_4":1,"definition_5":2,"definition_6":1,"definition_7":2,"definition_8":1,"definition_9":1,"trueFalse_0":1,"trueFalse_1":0,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":0,"trueFalse_5":0,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":1,"scenario_0":1,"scenario_1":1,"scenario_2":2,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"choice_0":1,"choice_1":1,"choice_2":1,"choice_3":1,"choice_4":1,"choice_5":1,"choice_6":1,"choice_7":1,"choice_8":1,"choice_9":1,"concept_0":1,"concept_1":1,"concept_2":1,"concept_3":1,"concept_4":1,"concept_5":1,"concept_6":1,"concept_7":1,"concept_8":1,"concept_9":1,"calc_0":1,"calc_1":8,"calc_2":10,"calc_3":32,"calc_4":10,"calc_5":2,"calc_6":8,"calc_7":20,"calc_8":16,"calc_9":3},"problemSolutions":{"definition_0":"**Concept**: The main goal of weight initialization is to set each layer’s $W$ and $b$ to a **reasonable scale** before training so activations and gradients stay stable in the forward and backward passes. Learning rate and data augmentation are different topics. **Answer: ①**.","definition_1":"The process of choosing initial $W$ and $b$ for each layer **before** training is **weight initialization**. Gradient descent updates weights **during** training; regularization fights overfitting. **Answer: ①**.","definition_2":"For ReLU-family activations, **He initialization** is standard; it uses $\\sigma^2=2/n_{in}$ to match ReLU’s variance behavior. **Answer: ②**.","definition_3":"For tanh/sigmoid-like symmetric activations, **Xavier (Glorot)** fits well, with $\\sigma^2=2/(n_{in}+n_{out})$. **Answer: ①**.","definition_4":"If weights are **too small**, gradients shrink through layers toward zero (**vanishing gradient**). **Answer: ①**.","definition_5":"If weights are **too large**, gradients can explode through layers (**exploding gradient**). **Answer: ②**.","definition_6":"Xavier uses $\\sigma^2 = 2/(n_{in}+n_{out})$. **Answer: ①**.","definition_7":"He uses $\\sigma^2 = 2/n_{in}$. **Answer: ②**.","definition_8":"All-zero weights break symmetry: neurons output the same values and learning stalls. **Answer: ①**.","definition_9":"If $W$ is too small, $z$ and gradients stay small—close to **vanishing gradient**. **Answer: ①**.","trueFalse_0":"Weight initialization sets $W$ and $b$ **before** training starts. **Answer: 1 (true).**","trueFalse_1":"Xavier targets tanh/sigmoid; ReLU typically uses **He**, so “Xavier is only for ReLU” is false. **Answer: 0.**","trueFalse_2":"He initialization suits ReLU and Leaky ReLU. **Answer: 1.**","trueFalse_3":"Good init keeps variance stable across layers. **Answer: 1.**","trueFalse_4":"All-zero weights are **not** recommended. **Answer: 0.**","trueFalse_5":"Vanishing gradient is tied to weights that are **too small**, not too large. **Answer: 0.**","trueFalse_6":"Exploding gradients can happen when weights are too large. **Answer: 1.**","trueFalse_7":"Initialization affects convergence speed. **Answer: 1.**","trueFalse_8":"Xavier variance $\\sigma^2 = 2/(n_{in}+n_{out})$ is correct. **Answer: 1.**","trueFalse_9":"He variance $\\sigma^2 = 2/n_{in}$ is correct. **Answer: 1.**","scenario_0":"If loss barely moves, check **initialization** and **learning rate** first. **Answer: 1.**","scenario_1":"Early layers not updating in a deep CNN often indicates **vanishing gradients**; try He/Xavier and related fixes. **Answer: 1.**","scenario_2":"For an MLP with ReLU, **He** is the usual default. **Answer: 2.**","scenario_3":"Variance $2/(n_{in}+n_{out})$ for tanh matches **Xavier**. **Answer: 1.**","scenario_4":"NaNs often suggest **exploding** gradients or too-large scales/lr. **Answer: 1.**","scenario_5":"Very slow convergence can come from a **bad starting point**; better init can help. **Answer: 1.**","scenario_6":"PyTorch `Linear` layers typically use **Xavier/He-style** defaults. **Answer: 1.**","scenario_7":"Keeping activation variance stable across layers is a core init goal (**variance preservation**). **Answer: 1.**","scenario_8":"Deep models need careful init to avoid **vanishing/exploding** gradients. **Answer: 1.**","scenario_9":"$$n_{in}+n_{out}=8+8=16$; pick the option that matches **16**. **Answer: 1.**","choice_0":"Zero weights cause **symmetry**—neurons behave identically. **Answer: 1.**","choice_1":"He uses $\\sigma^2 = 2/n_{in}$. **Answer: 1.**","choice_2":"Proper **initialization** (Xavier/He) helps mitigate vanishing gradients. **Answer: 1.**","choice_3":"**Xavier** pairs with tanh/sigmoid. **Answer: 1.**","choice_4":"If $W$ is too large, **exploding** gradients are possible. **Answer: 1.**","choice_5":"Initialization affects **convergence speed and stability**. **Answer: 1.**","choice_6":"In He, variance scales **inversely** with $n_{in}$ via $2/n_{in}$. **Answer: 1.**","choice_7":"Gradients near zero in backprop indicate **vanishing gradient**. **Answer: 1.**","choice_8":"$$n_{in}+n_{out}=4+6=10$. **Answer: 1** (option ① = 10).","choice_9":"A good init aims to **preserve variance** across layers. **Answer: 1.**","concept_0":"If variance of $W$ is too large, gradients can **explode**. **Answer: 1.**","concept_1":"With $n_{in}+n_{out}=12$, we have $6/(n_{in}+n_{out})=6/12=0.5$; follow the problem’s integer rule and choices (see prompt). **Answer: 1.**","concept_2":"ReLU zeros half the mass on average; **He** compensates with $\\sigma^2=2/n_{in}$. **Answer: 1.**","concept_3":"In deep nets, gradients multiply through many layers (chain rule), so bad init hurts badly. **Answer: 1.**","concept_4":"Bias $b$ is usually initialized to **0**. **Answer: 1.**","concept_5":"Leaky ReLU behaves like a ReLU family activation; **He-like** init is common. **Answer: 1.**","concept_6":"If lr looks fine but loss stalls, suspect **init** or **vanishing** structure. **Answer: 1.**","concept_7":"**Xavier** and **He** both set variance using layer dimensions. **Answer: 1.**","concept_8":"Multiplying many small factors toward zero mirrors **vanishing** gradients. **Answer: 1.**","concept_9":"ReLU CNNs typically default to **He-style** init. **Answer: 1.**","calc_0":"$$n_{in}+n_{out}=6 \\Rightarrow 6/6=1$. **Answer: 1.**","calc_1":"He uses denominator $n_{in}=8$. **Answer: 8.**","calc_2":"$$n_{in}+n_{out}=2+8=10$. **Answer: 10.**","calc_3":"Denominator $n_{in}=32$. **Answer: 32.**","calc_4":"$$5+5=10$. **Answer: 10.**","calc_5":"$$6/3=2$. **Answer: 2.**","calc_6":"$$1+7=8$. **Answer: 8.**","calc_7":"Denominator $n_{in}=20$. **Answer: 20.**","calc_8":"$$4+12=16$. **Answer: 16.**","calc_9":"$$6/2=3$. **Answer: 3.**"},"problemTestCodes":{"definition_0":"answer = 1\nassert answer == 1","definition_1":"answer = 1\nassert answer == 1","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 1\nassert answer == 1","definition_4":"answer = 1\nassert answer == 1","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 1\nassert answer == 1","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 0\nassert answer == 0","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 0\nassert answer == 0","trueFalse_5":"answer = 0\nassert answer == 0","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 2\nassert answer == 2","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","choice_0":"answer = 1\nassert answer == 1","choice_1":"answer = 1\nassert answer == 1","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 1\nassert answer == 1","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 1\nassert answer == 1","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 1\nassert answer == 1","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 1\nassert answer == 1","concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 1\nassert answer == 1","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 1\nassert answer == 1","concept_9":"answer = 1\nassert answer == 1","calc_0":"s = 6\nanswer = 6 // s\nassert answer == 1","calc_1":"n_in = 8\nanswer = n_in\nassert answer == 8","calc_2":"n_in, n_out = 2, 8\nanswer = n_in + n_out\nassert answer == 10","calc_3":"n_in = 32\nanswer = n_in\nassert answer == 32","calc_4":"n_in, n_out = 5, 5\nanswer = n_in + n_out\nassert answer == 10","calc_5":"s = 3\nanswer = 6 // s\nassert answer == 2","calc_6":"n_in, n_out = 1, 7\nanswer = n_in + n_out\nassert answer == 8","calc_7":"n_in = 20\nanswer = n_in\nassert answer == 20","calc_8":"n_in, n_out = 4, 12\nanswer = n_in + n_out\nassert answer == 16","calc_9":"s = 2\nanswer = 6 // s\nassert answer == 3"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midDlCh02":{"chapter":"Chapter 02","title":"Optimization Algorithms: Tuning Speed and Direction Wisely","description":"Training an AI model is like **wearing a blindfold while hiking a huge mountain range toward the deepest valley (the minimum error)**. **Optimization** is the **navigation** that picks **which direction** and **how large a step** to take from where you stand.\n\nAfter Ch.01 set the starting point, this chapter teaches skills to descend safely and quickly: walking step by step with **SGD**, sledding with **Momentum**, and self-driving with **Adam** that adapts its stride to the terrain. We unpack the core optimizers you will use every day—intuitively and clearly.","sectionTitle":"Optimization Algorithms: Tuning Speed and Direction Wisely","whatIs":{"0":"**1. Gradient descent & SGD: walk against the uphill gradient**\n\n**Concept:** The reliable way downhill is to feel the slope under your feet and take steps along the **steepest descent** — that is the heart of gradient descent.\n\n**Intuition:** Picture descending Hallasan in thick fog. If your **stride (learning rate)** is too wide, you may fall off a cliff or bounce onto the opposite ridge; if it is too narrow, sunset may arrive before you reach the valley.\n\n**Core equation:**\n$\\theta \\leftarrow \\theta - \\eta \\nabla L(\\theta)$\n- **$\\theta$**: where you stand (model weights)\n- **$\\eta$**: step size — the **learning rate** (often 0.01, 0.001, …)\n- **$\\nabla L$**: the slope (gradient) at the current point\n\n**Practical tip:** Scanning the full map every time is slow, so we usually follow **stochastic gradient descent (SGD)** — pick a **minibatch**, estimate $\\hat{g}$, and step quickly.","1":"**2. Momentum: a bowling ball on ice**\n\n**Concept:** Plain SGD only looks at the local slope, so in a bumpy narrow valley it **zig-zags** and wastes time. **Momentum** adds **inertia from past moves**.\n\n**Intuition:** A paper cup turns at every pebble; a **heavy bowling ball** keeps rolling through small bumps. Momentum gives the optimizer that kind of “mass.”\n\n**Core updates:**\n$v \\leftarrow \\beta v + (1-\\beta)g$\n$\\theta \\leftarrow \\theta - \\eta v$\n- **$v$**: velocity (accumulated direction)\n- **$\\beta$**: how much past motion to keep (often **0.9** — keep ~90% of the old velocity)\n- **$g$**: gradient at the current point\n\n**Extra:** **Nesterov** momentum evaluates $g$ at a **lookahead** point along $v$.","2":"**3. Adaptive optimizers (AdaGrad, RMSProp, Adam): brake each wheel separately**\n\n**Concept:** Some parameters are almost there; others still have far to go. Instead of one global $\\eta$, **adaptive** methods **rescale each coordinate** from gradient statistics.\n\n**How they evolved:**\n- **AdaGrad:** “Paths we walked a lot — shrink the step there.” It accumulates squared gradients so busy coordinates slow down.\n- **RMSProp:** Fixes AdaGrad’s issue (steps can shrink to ~0 forever) by **forgetting** very old history with an **EMA**.\n- **Adam:** Combines **momentum (direction)** and **RMSProp-like scaling** — a default choice in modern deep learning.\n\n**Practical tip:** Papers often use **AdamW**, which **decouples weight decay** from the loss for better regularization.","3":"**4. Three goals: stability, speed, generalization**\n\n**Concept:** Choosing an optimizer is not only about **reaching the bottom fast**. **Which valley** you land in changes **test performance**.\n\n**Intuition:** A bullet train (**Adam**) may arrive first; a local train (**SGD+momentum**) can discover **quieter minima** with better generalization — both stories appear in practice.\n\n**Practical tip:** Pair optimizers with **warmup** (gentle strides early) and **learning-rate schedulers** (smaller steps near the end)."},"whyImportant":{"0":"**Time and money**\n\nIf the learning rate is too large, optimization may **diverge**; if too small, a one-hour run can stretch to a week. Good optimizer + LR settings are the “magic” that saves **GPU bills** and **late nights**.","1":"**Generalization — your “test score”**\n\nWith the same data, different optimizers can yield **different quality**. Which minimum you settle in changes **test accuracy**. Strong engineers match the tool to the problem.","2":"**First thermometer when the model “gets sick”**\n\nIf loss won’t drop or **NaNs** appear, suspect **learning rate** and **optimizer** first. Knowing this lets you debug calmly instead of panicking."},"howUsed":{"0":"**① Keep a lab notebook — change one knob at a time**\n\nAPIs differ by library, but the workflow is similar: record **learning rate, batch size, optimizer, and random seed**. When training misbehaves, change **one setting at a time** to isolate the cause. Jittery loss → revisit batch, LR, and momentum; updates that fade away after many epochs → consider moving from AdaGrad-style accumulators to **RMSProp / Adam**. Practice pairing **symptoms with levers**.","1":"**② Optimizer cheat sheet**\n\n| Situation | Pick | Why |\n| :--- | :--- | :--- |\n| **Need a quick baseline** | `Adam` or `AdamW` | Adaptive steps — less sensitive to initial LR |\n| **NLP / transformers** | `AdamW` | Often very stable on sparse, structured objectives |\n| **Push CNN accuracy to the limit** | `SGD + Momentum` | Harder to tune but can **generalize** better at the sweet spot |","2":"**③ Monitoring — don’t look away**\n\nLaunch isn’t the end of the flight. Watch the **loss curve** live (TensorBoard, Weights & Biases). If it **saws** like a gear, it may be time to **lower the learning rate**."},"problemSolving":{"0":"Optimization is the process of deciding how to update parameters $\\theta$ using gradients from backpropagation to reduce the loss $L(\\theta)$. Basic **SGD** takes a step $\\theta \\leftarrow \\theta - \\eta \\hat{g}$ with a minibatch gradient $\\hat{g}$, and the **learning rate $\\eta$** sets the step size. **Momentum** accumulates velocity $v$ to reduce zig-zag in narrow valleys, while **Adam/AdamW** adapt per-coordinate steps using first and second moments. When loss oscillates or diverges, check **learning rate, batch size, and LR scheduler** together—not only the optimizer name.","2":"**Example (definition)**\n\n\"What is the core role of Momentum? ① It sets LR to 0 ② It accumulates past directions to reduce oscillation ③ It skips backprop\"\n\nMomentum keeps directional inertia through velocity $v$. → **Answer 2**\n\n---\n\n**Example (scenario)**\n\n\"When training loss oscillates heavily, what should be checked first? ① Learning rate, momentum, batch size ② Zero training data ③ Delete all layers\"\n\nOscillation is tied to step size and gradient noise, so check ① first. → **Answer 1**\n\n---\n\n**Example (calculation)**\n\nIf $\\eta=0.001$ and $g=20$, what is the SGD update magnitude $\\eta g$?\n\n$0.001 \\times 20 = 0.02$. → **Answer 0.02**","3":"**Definition example** — \"Which does Adam use together? ① 1st and 2nd moments ② Batch index only ③ Dropout mask only\" → Adam uses first and second moments. **Answer 1**\n\n---\n\n**True/False example** — \"RMSProp uses an EMA of squared gradients.\" → True. **Answer 1**\n\n---\n\n**Application example** — \"If early training is unstable, what to check first? ① warmup + LR schedule ② disable backprop ③ delete data\" → Check warmup and schedule first. **Answer 1**\n\n---\n\n**Choice example** — \"A defining trait of Nesterov is? ① gradient at lookahead point ② current point only ③ no gradient\" → Nesterov uses lookahead. **Answer 1**\n\n---\n\n**Concept example** — \"In AdaGrad, the effective step size of frequently updated coordinates tends to? ① decrease ② stay constant ③ increase\" → It tends to decrease due to accumulation. **Answer 1**\n\n---\n\n**Calculation example** — \"If sample count is 64 and batch size is 16, how many steps per epoch?\" → $64/16=4$. **Answer 4**"},"summary":"**Optimization** converts gradient information into update steps to reduce loss $L(\\theta)$.\n\n**SGD** updates with minibatch gradient $\\hat{g}$, **Momentum** smooths zig-zag via velocity $v$, and **Adam/AdamW** adapts per-coordinate step size using first/second moments.\n\n**Practical debugging summary (symptom → first checks)**\n- Loss oscillation: `lr`, momentum, batch size\n- Early divergence/NaN: initialization, `lr`, `grad_norm`, clipping\n- Slow/plateaued learning: scheduler (with warmup), optimizer switch (SGD↔AdamW)\n- Validation stagnation: weight decay, augmentation, early stopping\n\n**Tuning order (quick decision)**\n1) Validate logs → 2) tune `lr` first → 3) choose optimizer → 4) combine with scheduler → 5) add stabilizers → 6) pick by mean performance + variance + reproducibility\n\n**Operating rule**: change one variable at a time, and record `optimizer/lr/batch_size/weight_decay/seed/scheduler` for comparison.","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Formulas in plain words","sgd":"**SGD step** $\\theta \\leftarrow \\theta - \\eta \\hat{g}$ — $\\hat{g}$ is a minibatch estimate; $\\eta$ is step size.","momentum":"**Momentum** $v \\leftarrow \\beta v + (1-\\beta)g$, $\\theta \\leftarrow \\theta - \\eta v$ — past directions accumulate in $v$ to smooth zig-zags.","adam":"**Adam (idea)** — EMA of gradients and squared gradients per coordinate; **bias correction** in early steps.","adaptive":"**Adaptive intuition** — large historical gradients → smaller effective steps per coordinate."},"visual":"Animation comparing SGD, Momentum, and Adam on a loss **mountain**: the same slopes can yield **different paths**.","problemSolvingLabel":"How to approach problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are **10 problems** sampled from a bank of **60** (easy 4 · medium 3 · hard 3; order easy→medium→hard). For **multiple choice (①②③)**, enter **1, 2, or 3**. For **short answer**, follow the prompt: **1/0** for true/false, or a **single integer** for calculations.","practiceProblemsInstruction":"Read each problem and choose the correct option number.","midDlCh02VisualIntro":"Blindfolded on the same **loss mountain**, SGD, Momentum, and Adam pick **different routes** — simplified valley comparison below.","midDlCh02VisualStep0":"① **SGD**: step opposite the gradient each time (noisy minibatches → zig-zag).","midDlCh02VisualStep1":"② **Momentum**: accumulate velocity $v$ — smoother turns.","midDlCh02VisualStep2":"③ **Adam**: adaptive per-coordinate step sizes.","midDlCh02VisualStep3":"④ **Practice**: tune with logs, schedules, initialization (Ch.01).","midDlCh02VisualConceptTitle":"Concept: gradient →(transform)→ update","midDlCh02VisualFlowTitle":"Flow: forward → loss → backward → optimizer step","midDlCh02VisualModelTitle":"Update: $\\theta \\leftarrow \\theta - \\eta \\cdot(\\text{step from Adam, etc.})$","midDlCh02VisualLegendSgd":"SGD","midDlCh02VisualLegendMom":"Momentum","midDlCh02VisualLegendAdam":"Adam","midDlCh02VisualCaption":"**Red (SGD)** zig-zags more while descending and its sideways wiggle lasts longer. **Green (momentum)** damps the oscillation but still ends **slightly off the valley center**. **Blue (Adam)** reaches the **bottom center** fastest—so descent speed and final x differ clearly (illustrative).","problems":{"definition_0":"To reduce loss in one GD step, how should $\\theta$ move relative to $\\nabla L$?\n1) same direction as $\\nabla L$\n2) **opposite** to $\\nabla L$\n3) perpendicular to $\\nabla L$","definition_1":"After `loss.backward()` in PyTorch, which matches typical minibatch-SGD training? ① exact full-dataset gradient every step ② update with $\\hat{g}$ from a **minibatch** ③ skip backprop","definition_2":"To smooth zig-zags in a narrow valley by accumulating past gradients in velocity $v$?\n1) only increase dropout\n2) use **momentum**\n3) fix batch size to 1 forever","definition_3":"What does Nesterov momentum evaluate differently from vanilla momentum?\n1) $g$ only at current $\\theta$\n2) $g$ after a **lookahead** along momentum\n3) validation loss only","definition_4":"What does AdaGrad accumulate to shrink per-coordinate steps? ① absolute weights ② **squared** gradients ③ epoch index","definition_5":"RMSProp’s hallmark vs unbounded AdaGrad sums: ① store gradient signs only ② **EMA** of squared gradients ③ fixed $\\eta$ only","definition_6":"Which pair best matches what Adam maintains?\n1) **1st & 2nd moments** (momentum + adaptive scaling)\n2) dropout masks only\n3) pooling sizes only","definition_7":"When $\\eta$ is too large, which is **NOT** a typical symptom? ① loss oscillation ② **guaranteed faster convergence only** ③ NaNs","definition_8":"Adam’s first moment $m$ is closest to: ① **EMA** of recent gradients ② always zero ③ validation accuracy","definition_9":"When picking an optimizer under data/model/log constraints, what to prioritize?\n1) monitor resolution\n2) **stability, speed, generalization**\n3) file extension","trueFalse_0":"[T/F] Typical `optimizer.step()` for GD moves $\\theta$ **opposite** the gradient to reduce loss. 1=true, 0=false.","trueFalse_1":"[T/F] Momentum forces the LR hyperparameter to always be 0. 1=true, 0=false.","trueFalse_2":"[T/F] Adam often combines adaptive denominators with momentum-like 1st moments. 1=true, 0=false.","trueFalse_3":"[T/F] AdaGrad can make effective updates tiny on some coordinates after long training. 1=true, 0=false.","trueFalse_4":"[T/F] RMSProp’s core is an EMA of squared gradients. 1=true, 0=false.","trueFalse_5":"[T/F] Larger minibatches **always** increase gradient-variance of the estimate. 1=true, 0=false.","trueFalse_6":"[T/F] A cosine LR schedule changes $\\eta$ over time. 1=true, 0=false.","trueFalse_7":"[T/F] Nesterov evaluates gradients after a momentum lookahead step. 1=true, 0=false.","trueFalse_8":"[T/F] Adam’s $\\varepsilon$ stabilizes denominators near $\\sqrt{\\hat{v}}$. 1=true, 0=false.","trueFalse_9":"[T/F] Adam is always better than SGD+momentum on every dataset. 1=true, 0=false.","scenario_0":"[Application] ResNet training: loss oscillates badly. **Best first** lever? ① retune **LR·momentum·batch** ② use 0 images ③ remove all BatchNorm","scenario_1":"Sparse BoW text classification — quick first optimizer family?\n1) **Adam/AdamW**-style adaptive\n2) pure full-batch GD only\n3) k-means","scenario_2":"Image CNNs with validation in mind often use: ① **SGD+momentum (+schedule)** or Adam ② disable backward ③ no optimizer","scenario_3":"AdaGrad updates feel frozen after many epochs. Natural next step?\n1) switch to **RMSProp/Adam** and revisit LR\n2) keep batch=1 forever\n3) delete all inputs","scenario_4":"Want large LR early, smaller later — mainly requires: ① **scheduler/warmup** design ② infinite LR ③ skip `step()`","scenario_5":"`grad_norm` explodes — check alongside Ch.01 init: ① **LR·clipping·scale** ② log filenames ③ theme color","scenario_6":"Momentum $\\beta=0.99$ means:\n1) **longer memory** of past directions\n2) instant global optimum\n3) cannot train","scenario_7":"Using L2 with Adam, a common **decoupled** variant is: ① **AdamW** ② SGD only ③ AdaGrad only","scenario_8":"Small data overfitting — can you fix it **only** by swapping optimizers?\n1) **unlikely**; regularization/data first\n2) Adam always fixes it\n3) infinite LR","scenario_9":"Multi-GPU: suspected epoch-wise shuffle bias — inspect: ① **shuffling & synchronization** ② icons ③ remove GPU","choice_0":"vs pure batch GD, a hallmark of minibatch SGD is?\n1) no difference\n2) **sampling noise** in $\\hat{g}$ can help escape sharp regions\n3) no backprop","choice_1":"As $\\beta\\to 0$ in momentum, updates resemble: ① **plain SGD-like** steps ② guaranteed divergence ③ LR=0","choice_2":"Tutorial-favorite $(\\beta_1,\\beta_2)$ for Adam — closest?\n1) **$(0.9,\\,0.999)$**\n2) $(0,0)$\n3) $(1,1)$","choice_3":"Warmup slowly raises LR early in transformer finetuning to reduce: ① **early instability** ② keep LR=0 ③ delete data","choice_4":"Adam’s 2nd moment tracks something closest to: ① **EMA of squared gradients** ② absolute weights ③ batch index","choice_5":"Common “decoupled weight decay” variant with Adam:\n1) **AdamW**\n2) remove softmax\n3) batch 0","choice_6":"Reducing zig-zags in a narrow valley — most direct lever: ① **momentum** ② LR=0 ③ inference-only mode","choice_7":"Typical default scale for Adam’s $\\varepsilon$:\n1) **around $10^{-8}$**\n2) $10^{2}$\n3) exactly 0","choice_8":"If you only increase batch size (same model), gradient variance tends to: ① **decrease** ② stay identical ③ always increase","choice_9":"Exploding RNN language-model gradients — common fix: ① **gradient clipping** ② always harmful ③ inference only","concept_0":"Sharp, narrow loss canyon — most directly tied combo: ① augmentation only ② **momentum·LR·schedule/conditioning** ③ batch=1 lock","concept_1":"Bias correction in Adam mainly fixes:\n1) early **near-zero** $m,\\hat{v}$ bias\n2) LR always 0\n3) pooling size","concept_2":"With sparse features, AdaGrad tends to make steps on frequently updated coords: ① **smaller** ② identical ③ infinite","concept_3":"Same training loss, different validation — plausible?\n1) **different trajectories / implicit reg**\n2) optimizer changes the loss formula\n3) always identical","concept_4":"Nesterov differs from standard momentum in **where** the gradient is evaluated: ① identical ② **different** ③ no backprop","concept_5":"RMSProp targets AdaGrad’s issue of: ① **divergent** squared-gradient accumulation ② always larger LR ③ softmax","concept_6":"After moving to a very large batch, teams often revisit:\n1) **LR scaling** (e.g. linear rule)\n2) LR=0 lock\n3) delete data","concept_7":"Convention before each fresh `backward()`: ① **`optimizer.zero_grad()`** ② delete weights ③ freeze loss","concept_8":"Dividing by $\\sqrt{\\hat{v}}+\\epsilon$ makes effective steps where gradients are large: ① **relatively smaller** ② identical ③ always larger","concept_9":"ImageNet-style CNNs often use:\n1) **SGD+momentum + LR schedule**\n2) Adam only forever\n3) no optimizer","calc_0":"[Calc] **48** train samples, batch **16** — minibatch steps per epoch, one integer.","calc_1":"[Calc] **4** epochs, **25** steps/epoch — total parameter updates?","calc_2":"[Calc] $\\eta=3$, $g=2$ — integer $\\eta g$?","calc_3":"[Calc] $\\beta=0.9$, $v=10$, $g=10$ — integer $v \\leftarrow \\beta v + (1-\\beta)g$?","calc_4":"[Calc] $m=0$, $\\beta_1=0.9$, $g=20$ — integer $m \\leftarrow \\beta_1 m + (1-\\beta_1)g$?","calc_5":"[Calc] $\\beta=0.5$, $v=6$, $g=2$ — integer $v \\leftarrow \\beta v + (1-\\beta)g$?","calc_6":"[Calc] $\\beta_1=0.9$, $m=10$, $g=0$ — integer $m \\leftarrow \\beta_1 m + (1-\\beta_1)g$?","calc_7":"[Calc] $t=1$, $\\beta_1=0.9$ — integer $1/(1-\\beta_1^t)$?","calc_8":"[Calc] **2048** samples, batch **256** — steps per epoch?","calc_9":"[Calc] LR **0.002** times scale **500** — integer?"},"problemAnswers":{"definition_0":2,"definition_1":2,"definition_2":2,"definition_3":2,"definition_4":2,"definition_5":2,"definition_6":1,"definition_7":2,"definition_8":1,"definition_9":2,"trueFalse_0":1,"trueFalse_1":0,"trueFalse_2":1,"trueFalse_3":1,"trueFalse_4":1,"trueFalse_5":0,"trueFalse_6":1,"trueFalse_7":1,"trueFalse_8":1,"trueFalse_9":0,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"choice_0":2,"choice_1":1,"choice_2":1,"choice_3":1,"choice_4":1,"choice_5":1,"choice_6":1,"choice_7":1,"choice_8":1,"choice_9":1,"concept_0":2,"concept_1":1,"concept_2":2,"concept_3":1,"concept_4":2,"concept_5":1,"concept_6":2,"concept_7":1,"concept_8":2,"concept_9":1,"calc_0":3,"calc_1":100,"calc_2":6,"calc_3":10,"calc_4":2,"calc_5":4,"calc_6":9,"calc_7":10,"calc_8":8,"calc_9":1},"problemSolutions":{"definition_0":"**1) Concept:** descend opposite $\\nabla L$. **2) Example:** logistic regression $\\theta\\leftarrow\\theta-\\eta\\nabla L$. **3) Answer 2**","definition_1":"**1) Minibatch SGD uses a subset for $\\hat{g}$. **2) Example:** batch 64 uses 64 examples. **3) Answer 2**","definition_2":"**1) Momentum accumulates velocity. **2) Example:** smoother in narrow valleys. **3) Answer 2**","definition_3":"**1) Nesterov uses lookahead. **2) Example:** `SGD(..., nesterov=True)`. **3) Answer 2**","definition_4":"**1) AdaGrad sums squared gradients. **2) Answer 2**","definition_5":"**1) RMSProp uses EMA of squared grads. **2) Answer 2**","definition_6":"**1) Adam uses 1st and 2nd moments. **2) Answer 1**","definition_7":"**1)** Large $\\eta$ can cause oscillation/NaNs, but does **not** guarantee \"only faster convergence.\" **2)** Option ② is that misconception. **3) Answer 2**","definition_8":"**1) 1st moment ≈ EMA of gradients. **2) Answer 1**","definition_9":"**1) Consider data, model, stability, speed. **2) Answer 2**","trueFalse_0":"Opposite to gradient is correct. **Answer 1**","trueFalse_1":"Momentum does not zero LR. **Answer 0**","trueFalse_2":"Adam combines adaptive + momentum-like pieces. **Answer 1**","trueFalse_3":"AdaGrad steps can vanish. **Answer 1**","trueFalse_4":"RMSProp uses squared-gradient EMA. **Answer 1**","trueFalse_5":"Larger batch usually **lowers** variance — statement false. **Answer 0**","trueFalse_6":"Scheduling changes LR over time. **Answer 1**","trueFalse_7":"Nesterov uses lookahead. **Answer 1**","trueFalse_8":"$$\\varepsilon$ is for numerical stability. **Answer 1**","trueFalse_9":"Not always better. **Answer 0**","scenario_0":"**1) Oscillation → tune LR/momentum. **2) Example:** CNN loss spikes → reduce LR 10×. **3) Answer 1**","scenario_1":"**1) Adam common for quick NLP experiments. **2) Answer 1**","scenario_2":"**1) Vision: SGD+momentum or Adam. **2) Answer 1**","scenario_3":"**1) Switch to RMSProp/Adam if AdaGrad stalls. **2) Answer 1**","scenario_4":"**1) Use a scheduler. **2) Answer 1**","scenario_5":"**1) Clip gradients, check LR/init. **2) Answer 1**","scenario_6":"**1) Higher $\\beta$ → stronger inertia. **2) Answer 1**","scenario_7":"**1) AdamW for decoupled WD. **2) Answer 1**","scenario_8":"**1) Overfitting needs reg/data, not optimizer alone. **2) Answer 1**","scenario_9":"**1) Shuffle / sync in distributed training. **2) Answer 1**","choice_0":"**1) Minibatch noise can help exploration. **2) Answer 2**","choice_1":"**1) $\\beta\\approx0$ ≈ SGD. **2) Answer 1**","choice_2":"**1) 0.9 / 0.999 typical. **2) Answer 1**","choice_3":"**1) Warmup stabilizes early training. **2) Answer 1**","choice_4":"**1) 2nd moment ≈ squared-gradient EMA. **2) Answer 1**","choice_5":"**1) AdamW decouples decay. **2) Answer 1**","choice_6":"**1) Momentum smooths zig-zags. **2) Answer 1**","choice_7":"**1) $\\varepsilon\\sim10^{-8}$. **2) Answer 1**","choice_8":"**1) Larger batch → lower variance trend. **2) Answer 1**","choice_9":"**1) Clipping helps exploding grads. **2) Answer 1**","concept_0":"**1) Zig-zag → momentum/LR. **2) Example:** transformer finetuning with Adam+warmup. **3) Answer 2**","concept_1":"**1) Bias correction fixes early near-zero moments. **2) Answer 1**","concept_2":"**1) AdaGrad shrinks frequent coordinates. **2) Answer 2**","concept_3":"**1) Different trajectories → different generalization. **2) Answer 1**","concept_4":"**1) Nesterov changes gradient location. **2) Answer 1**","concept_5":"**1) RMSProp limits unbounded AdaGrad growth. **2) Answer 1**","concept_6":"**1) Large batch may need LR scaling. **2) Answer 2**","concept_7":"**1) `zero_grad()` standard. **2) Answer 1**","concept_8":"**1) Large $v$ shrinks steps. **2) Answer 2**","concept_9":"**1) Competitions often SGD+momentum+schedule. **2) Answer 1**","calc_0":"**1) $48/16=3$. **2) Example:** batch 16 covers 48 samples in 3 steps. **3) Answer 3**","calc_1":"**1) $5\\times20=100$. **2) Answer 100**","calc_2":"**1) $2\\times3=6$. **2) Answer 6**","calc_3":"**1) $0.9\\cdot10+0.1\\cdot10=10$. **2) Answer 10**","calc_4":"**1) $0.1\\cdot20=2$. **2) Answer 2**","calc_5":"**1) $0.5\\cdot6+0.5\\cdot2=4$. **2) Answer 4**","calc_6":"**1) $0.9\\cdot10+0.1\\cdot0=9$. **2) Answer 9**","calc_7":"**1) $1/(1-0.9)=10$. **2) Answer 10**","calc_8":"**1) $4096/512=8$. **2) Answer 8**","calc_9":"**1) $0.001\\cdot1000=1$. **2) Answer 1**"},"problemTestCodes":{"definition_0":"answer = 2\nassert answer == 2","definition_1":"answer = 2\nassert answer == 2","definition_2":"answer = 2\nassert answer == 2","definition_3":"answer = 2\nassert answer == 2","definition_4":"answer = 2\nassert answer == 2","definition_5":"answer = 2\nassert answer == 2","definition_6":"answer = 1\nassert answer == 1","definition_7":"answer = 2\nassert answer == 2","definition_8":"answer = 1\nassert answer == 1","definition_9":"answer = 2\nassert answer == 2","trueFalse_0":"answer = 1\nassert answer == 1","trueFalse_1":"answer = 0\nassert answer == 0","trueFalse_2":"answer = 1\nassert answer == 1","trueFalse_3":"answer = 1\nassert answer == 1","trueFalse_4":"answer = 1\nassert answer == 1","trueFalse_5":"answer = 0\nassert answer == 0","trueFalse_6":"answer = 1\nassert answer == 1","trueFalse_7":"answer = 1\nassert answer == 1","trueFalse_8":"answer = 1\nassert answer == 1","trueFalse_9":"answer = 0\nassert answer == 0","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","choice_0":"answer = 2\nassert answer == 2","choice_1":"answer = 1\nassert answer == 1","choice_2":"answer = 1\nassert answer == 1","choice_3":"answer = 1\nassert answer == 1","choice_4":"answer = 1\nassert answer == 1","choice_5":"answer = 1\nassert answer == 1","choice_6":"answer = 1\nassert answer == 1","choice_7":"answer = 1\nassert answer == 1","choice_8":"answer = 1\nassert answer == 1","choice_9":"answer = 1\nassert answer == 1","concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 1\nassert answer == 1","concept_6":"answer = 2\nassert answer == 2","concept_7":"answer = 1\nassert answer == 1","concept_8":"answer = 2\nassert answer == 2","concept_9":"answer = 1\nassert answer == 1","calc_0":"n, b = 48, 16\nanswer = n // b\nassert answer == 3","calc_1":"answer = 5 * 20\nassert answer == 100","calc_2":"eta, g = 2, 3\nanswer = eta * g\nassert answer == 6","calc_3":"beta, v, g = 0.9, 10, 10\nanswer = int(beta * v + (1 - beta) * g)\nassert answer == 10","calc_4":"beta1, m, g = 0.9, 0, 20\nanswer = int((1 - beta1) * g)\nassert answer == 2","calc_5":"beta, v, g = 0.5, 6, 2\nanswer = int(beta * v + (1 - beta) * g)\nassert answer == 4","calc_6":"beta1, m, g = 0.9, 10, 0\nanswer = int(beta1 * m + (1 - beta1) * g)\nassert answer == 9","calc_7":"beta1, t = 0.9, 1\nanswer = int(1 / (1 - beta1 ** t))\nassert answer == 10","calc_8":"n, b = 2048, 256\nanswer = n // b\nassert answer == 8","calc_9":"lr, k = 0.001, 1000\nanswer = int(round(lr * k))\nassert answer == 1"},"problemDifficulty":{"definition_0":"easy","definition_1":"easy","definition_2":"easy","definition_3":"easy","definition_4":"easy","definition_5":"easy","definition_6":"easy","definition_7":"easy","definition_8":"easy","definition_9":"easy","trueFalse_0":"easy","trueFalse_1":"easy","trueFalse_2":"easy","trueFalse_3":"easy","trueFalse_4":"easy","trueFalse_5":"easy","trueFalse_6":"easy","trueFalse_7":"easy","trueFalse_8":"easy","trueFalse_9":"easy","scenario_0":"medium","scenario_1":"medium","scenario_2":"medium","scenario_3":"medium","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","choice_0":"medium","choice_1":"medium","choice_2":"medium","choice_3":"medium","choice_4":"medium","choice_5":"medium","choice_6":"medium","choice_7":"medium","choice_8":"medium","choice_9":"medium","concept_0":"hard","concept_1":"hard","concept_2":"hard","concept_3":"hard","concept_4":"hard","concept_5":"hard","concept_6":"hard","concept_7":"hard","concept_8":"hard","concept_9":"hard","calc_0":"hard","calc_1":"hard","calc_2":"hard","calc_3":"hard","calc_4":"hard","calc_5":"hard","calc_6":"hard","calc_7":"hard","calc_8":"hard","calc_9":"hard"},"problemOrder":["definition_0","definition_1","definition_2","definition_3","definition_4","definition_5","definition_6","definition_7","definition_8","definition_9","trueFalse_0","trueFalse_1","trueFalse_2","trueFalse_3","trueFalse_4","trueFalse_5","trueFalse_6","trueFalse_7","trueFalse_8","trueFalse_9","scenario_0","scenario_1","scenario_2","scenario_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","choice_0","choice_1","choice_2","choice_3","choice_4","choice_5","choice_6","choice_7","choice_8","choice_9","concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","concept_6","concept_7","concept_8","concept_9","calc_0","calc_1","calc_2","calc_3","calc_4","calc_5","calc_6","calc_7","calc_8","calc_9"]},"midMlChapters":{"midMl00":{"chapter":"Chapter 00","title":"Intermediate ML: Real-World Data Limits and Model Optimization","description":"Introduces preprocessing for real-world data with missing values, outliers, and nonlinear relations, and the need for model performance optimization."},"midMl01":{"chapter":"Chapter 01","title":"Data Scaling and Distribution Transformation","description":"Covers standardization, min-max scaling, and robust scaling so features with different units affect the model uniformly and to handle outliers."},"midMl02":{"chapter":"Chapter 02","title":"Categorical Encoding","description":"Explains one-hot encoding, ordinal encoding, and target encoding to convert categorical data into numeric form for computation."},"midMl03":{"chapter":"Chapter 03","title":"Missing Data and Imputation","description":"Covers mean/median imputation, KNN-based imputation, and regression-based imputation beyond simple deletion of missing values."},"midMl04":{"chapter":"Chapter 04","title":"Imbalanced Data Basics","description":"Covers SMOTE and class weights so the model does not bias toward the majority in fraud detection, diagnosis, and similar settings."},"midMl05":{"chapter":"Chapter 05","title":"Advanced Cross Validation","description":"Covers stratified cross-validation to preserve class ratios and time series split when temporal order must be preserved."},"midMl06":{"chapter":"Chapter 06","title":"Multiclass Evaluation and ROC-AUC","description":"Extends precision/recall to multiclass (micro/macro) and introduces ROC curves for overall classification performance across thresholds."},"midMl07":{"chapter":"Chapter 07","title":"SVM Basics: Decision Boundary and Margin","description":"Finds the optimal separating hyperplane by maximizing the margin to the nearest support vectors."},"midMl08":{"chapter":"Chapter 08","title":"Kernel Trick: Nonlinear SVM","description":"Maps data to a higher-dimensional space via inner products (kernel) to achieve nonlinear separation without explicit feature transformation."},"midMl09":{"chapter":"Chapter 09","title":"Dimensionality Reduction 1: PCA","description":"Linear compression onto a few orthogonal principal components that retain most of the data variance."},"midMl10":{"chapter":"Chapter 10","title":"Ensemble: Bagging and Pasting","description":"Bagging (bootstrap) and pasting (no replacement) build multiple models and combine by voting; explains bias-variance tradeoff."},"midMl11":{"chapter":"Chapter 11","title":"Boosting Basics: AdaBoost","description":"Sequentially combines weak learners by upweighting misclassified samples to reduce error."},"midMl12":{"chapter":"Chapter 12","title":"Gradient Boosting Machine (GBM)","description":"Each new tree fits the residual of the previous ensemble, combining gradient descent with ensemble learning."},"midMl13":{"chapter":"Chapter 13","title":"Density-Based Clustering (DBSCAN)","description":"Forms clusters by density and flags noise, going beyond spherical K-means."},"midMl14":{"chapter":"Chapter 14","title":"Hierarchical Clustering and Dendrogram","description":"Unsupervised merging or splitting of similar points into a hierarchical tree (dendrogram) without fixing the number of clusters."},"midMl15":{"chapter":"Chapter 15","title":"Gaussian Mixture Model (GMM)","description":"Soft clustering by assuming data from a mixture of Gaussians and estimating membership via EM."},"midMl16":{"chapter":"Chapter 16","title":"Anomaly Detection Basics","description":"Unsupervised/semi-supervised methods using distribution or distance to flag points that deviate from normal patterns."},"midMl17":{"chapter":"Chapter 17","title":"Pipeline: Modeling Automation","description":"Chains scaling, encoding, dimensionality reduction, and model training in one workflow to improve reuse and avoid data leakage."},"midMl18":{"chapter":"Chapter 18","title":"Hyperparameter Tuning 1: Grid and Random Search","description":"Compares grid search (full combinations) and random search for finding hyperparameters such as tree depth and learning rate."},"midMl19":{"chapter":"Chapter 19","title":"Hyperparameter Tuning 2: Bayesian Optimization (Optuna)","description":"Uses a surrogate model of past trials to suggest the next hyperparameters for faster, more efficient search."},"midMl20":{"chapter":"Chapter 20","title":"Intermediate ML Summary","description":"Summarizes the pipeline from missing data and scaling through PCA, SVM and boosting, to hyperparameter tuning."}},"advMlChapters":{"advMl00":{"chapter":"Chapter 00","title":"Advanced ML: SOTA Models and Interpretability","description":"Principles of optimized boosting ensembles used in Kaggle and the importance of XAI for interpreting black-box predictions."},"advMl01":{"chapter":"Chapter 01","title":"XGBoost Algorithm","description":"Algorithm that improves on GBM speed and adds regularization to control tree complexity and prevent overfitting."},"advMl02":{"chapter":"Chapter 02","title":"LightGBM Algorithm","description":"Leaf-wise growth for speed and accuracy; contrast with level-wise tree building."},"advMl03":{"chapter":"Chapter 03","title":"CatBoost: Categorical Boosting","description":"Ordered Boosting to avoid target leakage; strong on tabular data with many categories."},"advMl04":{"chapter":"Chapter 04","title":"t-SNE for Manifold Visualization","description":"Nonlinear dimensionality reduction preserving local structure for 2D/3D visualization."},"advMl05":{"chapter":"Chapter 05","title":"UMAP: Topological Geometry","description":"Fast manifold learning preserving local and global structure; alternative to t-SNE."},"advMl06":{"chapter":"Chapter 06","title":"Isolation Forest","description":"Unsupervised anomaly detection using random splits; anomalies need fewer splits to isolate."},"advMl07":{"chapter":"Chapter 07","title":"One-Class SVM","description":"Kernel-based method learning a boundary around normal data; points outside are anomalies."},"advMl08":{"chapter":"Chapter 08","title":"Feature Selection and Importance","description":"Permutation importance and other ways to identify key variables."},"advMl09":{"chapter":"Chapter 09","title":"XAI 1: Partial Dependence Plot (PDP)","description":"Marginal effect of a feature on model prediction; global interpretability."},"advMl10":{"chapter":"Chapter 10","title":"XAI 2: LIME","description":"Local linear approximation to explain individual predictions."},"advMl11":{"chapter":"Chapter 11","title":"XAI 3: SHAP","description":"Shapley values for fair feature attribution to predictions."},"advMl12":{"chapter":"Chapter 12","title":"Time Series Preprocessing and Stationarity","description":"ADF test and differencing for stationarity."},"advMl13":{"chapter":"Chapter 13","title":"ARIMA and SARIMA","description":"Classical statistical forecasting with AR, MA, I, and seasonality."},"advMl14":{"chapter":"Chapter 14","title":"Prophet: Structural Time Series","description":"Trend, seasonality, and holiday effects for interpretable forecasting."},"advMl15":{"chapter":"Chapter 15","title":"Content-Based Filtering","description":"Recommendations from item attributes and similarity (e.g. cosine)."},"advMl16":{"chapter":"Chapter 16","title":"Matrix Factorization","description":"Latent factors for user-item rating prediction."},"advMl17":{"chapter":"Chapter 17","title":"Factorization Machines","description":"Efficient modeling of feature interactions in high-dimensional sparse data."},"advMl18":{"chapter":"Chapter 18","title":"Association Rules and Apriori","description":"Support, confidence, lift; traditional basket analysis."},"advMl19":{"chapter":"Chapter 19","title":"AutoML Basics: PyCaret and FLAML","description":"Automating preprocessing, model selection, and hyperparameter tuning."},"advMl20":{"chapter":"Chapter 20","title":"Advanced ML Summary: SOTA Pipeline and XAI","description":"From XGBoost/LightGBM pipelines to SHAP, time series, and recommender systems."}},"advDlChapters":{"advDl00":{"chapter":"Chapter 00","title":"Advanced DL: Large Models and Generative AI Paradigm"},"advDl01":{"chapter":"Chapter 01","title":"Transformer 1: Self-Attention and Parallelization"},"advDl02":{"chapter":"Chapter 02","title":"Transformer: Positional Encoding and Feed-Forward"},"advDl03":{"chapter":"Chapter 03","title":"Transformer Lineage: Encoder (BERT) vs Decoder (GPT)"},"advDl04":{"chapter":"Chapter 04","title":"Attention Optimization: FlashAttention and Sparse Attention"},"advDl05":{"chapter":"Chapter 05","title":"Vision Transformer (ViT) and Image Patches"},"advDl30":{"chapter":"Chapter 06","title":"Swin Transformer: Hierarchical Windows and Global Context"},"advDl31":{"chapter":"Chapter 07","title":"Vision Models: Local CNN vs Global ViT"},"advDl08":{"chapter":"Chapter 08","title":"PEFT 1: PEFT and LoRA"},"advDl09":{"chapter":"Chapter 09","title":"QLoRA and Quantization: Tuning When Smaller"},"advDl10":{"chapter":"Chapter 10","title":"Value Alignment and RLHF: Matching Human Preferences"},"advDl11":{"chapter":"Chapter 11","title":"DPO: Aligning with Preferences without Reinforcement Learning"},"advDl12":{"chapter":"Chapter 12","title":"RAG: Reducing Hallucinations with Retrieval"},"advDl13":{"chapter":"Chapter 13","title":"LLM Agents: Models That Use Tools"},"advDl27":{"chapter":"Chapter 14","title":"Master CNNs: Kernels, Stride, Padding & Backbone Evolution"},"advDl28":{"chapter":"Chapter 15","title":"Object Detection: R-CNN Family vs YOLO (Bounding Boxes)"},"advDl29":{"chapter":"Chapter 16","title":"Image Segmentation: U-Net and DeepLab (Pixel-Level Understanding)"},"advDl15":{"chapter":"Chapter 17","title":"Grad-CAM and XAI: Where CNNs Look"},"advDl14":{"chapter":"Chapter 18","title":"Graph Neural Networks (GNN): Message Passing to Neighbors"},"advDl16":{"chapter":"Chapter 19","title":"Autoencoder: Compress and Reconstruct"},"advDl17":{"chapter":"Chapter 20","title":"VAE: A Generative Space in Probability"},"advDl18":{"chapter":"Chapter 21","title":"GAN Basics: Generator vs Discriminator"},"advDl19":{"chapter":"Chapter 22","title":"Conditional GAN: Generate on Condition"},"advDl20":{"chapter":"Chapter 23","title":"Diffusion 1: Add Noise, Then Denoise"},"advDl21":{"chapter":"Chapter 24","title":"Diffusion 2: Diffusing in Latent Space"},"advDl22":{"chapter":"Chapter 25","title":"Vision-Language Models and CLIP: Images and Text Together (CNN Meets LLM)"},"advDl23":{"chapter":"Chapter 26","title":"Speech Recognition and Audio: Sound to Text"},"advDl24":{"chapter":"Chapter 27","title":"Model Compression and Knowledge Distillation"},"advDl25":{"chapter":"Chapter 28","title":"Inference Optimization and Deployment: From Servers to Browser Runtimes"},"advDl26":{"chapter":"Chapter 29","title":"Advanced DL Wrap-Up: Architecture and Future"}},"advDlCh00":{"chapter":"Chapter 00","title":"Advanced DL: Large Models and Generative AI Paradigm","description":"Advanced Deep Learning (Ch.00) is the entry point that connects “why models got so large” with “how generative AI systems actually work.” We go beyond learning representations from data: how large Transformers build contextual understanding, predict the next token, and then how we align, control, and deploy those models for real users.","roadmapTitle":"An advanced roadmap toward large generative models","roadmapDescription":"This roadmap gradually fills from Ch01 onward, showing how each chapter contributes to the full system.","roadmapListHeading":"What you will learn in Ch01–Ch24","sectionTitle":"What is Advanced DL? (Generative AI system view)","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Problem-solving guide"},"whatIs":{"0":"**Foundation models / LLMs** are trained with the objective of predicting the next token. In other words, they maximize $p(x_t\\mid x_{ tokenization -> context window -> Transformer -> decoding (greedy/beam/sample)`. Decoding strategy and prompt design strongly affect output quality.","1":"Alignment and control can be done in multiple ways. For example, **RLHF / DPO** uses preferences to improve the model, and **RAG** retrieves external knowledge to ground answers.","2":"From a product perspective, **tool use**, caching/batching, and optimization such as quantization or knowledge distillation are part of the whole stack. The same base model can feel very different depending on how you run it."},"problemSolving":{"0":"This section ties the whole Advanced DL track to how you might reason about it in exam-style questions. **Next-token prediction** in pretraining builds broad language ability and connects to probabilistic generation and representation learning. **Instruction tuning and SFT** shape how models follow user intent, which brings in data formatting and fine-tuning.\n\n**Alignment** addresses preferences, safety, and truthfulness through ideas like preference learning and reward modeling. **RAG and grounded generation** lean on retrieval, embeddings, and assembling context to reduce ungrounded answers. **Inference optimization** targets latency and cost with quantization, caching, distillation, and similar serving-side tools."}},"advDlCh01":{"chapter":"Chapter 01","title":"Transformer 1: See Self-Attention at a Glance","description":"The heart of a Transformer model is **Self-Attention**, **Add & Norm** (residual connection and layer normalization) that keeps training stable, and a **Feed Forward** neural network that transforms gathered information deeply. If older models read tokens one by one and often lose earlier context, Transformers process the entire sentence at once like a top-down overview. In this chapter, you will learn the attention mechanism with Query, Key, and Value, and the intuitive roles of Add & Norm and Feed Forward that help the model learn deeply and reliably.","sectionTitle":"Transformer 1: See Self-Attention at a Glance","whatIs":{"0":"**Concept Explanation: The Eye for Context**\n\nSelf-attention lets every token look at all other tokens at the same time, and it assigns weights to decide how much each token should influence the meaning of the current token. For example, in the phrase \"I went to the bank\", self-attention can infer whether \"bank\" refers to a financial place or a riverbank by looking at surrounding words at the same time.","1":"**Intuition: Query (Q), Key (K), Value (V)**\n\nThink of it like searching in a library.\n1. **Query (Q)** is what you type: the question you want to answer.\n2. **Key (K)** is the information on book labels: what each token contains as searchable features.\n3. **Value (V)** is the actual content you get.\nSelf-attention scores how well Q matches K, then mixes V according to the scores to produce the updated meaning.","2":"**Mathematical Explanation: Scaled Dot-Product Attention**\n\nLet the input embeddings be a matrix $X$. We project them into $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ using learnable matrices. Attention scores come from the dot product $QK^T$. When the dimension is large, dot-product values can become too big, so we divide by $\\sqrt{d_k}$ for scaling. After applying softmax, we obtain weights $A=\\mathrm{softmax}(QK^T/\\sqrt{d_k})$. The final output is $AV$. Here, $d_k$ is the key dimension, and $A$ is the weight matrix that tells how much each token attends to others.","3":"**Real ML Example: Smarter Sentence Understanding**\n\nIn spam filtering, words like \"free\" and \"click\" may be far apart, but self-attention captures their strong relationship at once and helps decide whether a message is spam. In medical document classification, it can connect symptoms, lab results, and negations like \"no\" in the same step to reduce false diagnoses."},"whyImportant":{"0":"**Concept Explanation: Perfectly Solving Long-Range Dependency**\n\nSelf-attention is effective because it can capture long-range dependency directly. When a word at the beginning of a sentence influences the meaning at the end, self-attention connects them without losing information in the middle.","1":"**Intuition: Relay (RNN) vs. Group Chat (Self-Attention)**\n\nRNN-style processing passes information like a relay race, so information can weaken as it moves step by step. Self-attention is like a group chat: everyone sees all messages at the same time, so distant information is available immediately.","2":"**Mathematical Explanation: Shorter Information Path**\n\nIn RNNs, the information path length grows with the token distance $n$ (about $O(n)$), which makes gradient flow harder for long distances. In self-attention, all tokens are connected in a single step, so the path length is about $O(1)$. Short paths make training more stable and help preserve important dependencies.","3":"**Real ML Example: Summarizing Long Texts**\n\nThe advantage is most obvious in tasks like summarizing long legal documents or analyzing long chat logs where key points at the beginning must connect to conclusions at the end."},"howUsed":{"0":"**Practical Assembly: Pre-work and a Conveyor Belt**\n\nIn real systems, you prepare text before feeding it into the model. First, split the sentence into token pieces and convert them to vectors (embeddings). Then add positional encoding so the model knows the order—because attention itself behaves like a \"set\" operation that ignores order. Positional encoding acts like a tiny tag such as \"this is the k-th token\". After that, the data goes through a big transformer block: [multi-head attention → Add & Norm → feed-forward → Add & Norm]. Repeating this block 12 times leads to a BERT-Base style model, while repeating 96 times reaches GPT-3 scale.","1":"**Multi-Head Attention: Expert Committee Division**\n\nInstead of relying on one perspective, you use an expert committee: multiple heads in parallel. For example, if you have 512-dimensional vectors, split them into 8 heads so each head processes 64-dimensional slices from its own viewpoint. One head may focus on grammatical relations like \"who did what\", another on sentiment nuances like positive vs. negative cues, and another on named entities such as people and places. Each head computes attention using $\\mathrm{head}_h=\\mathrm{softmax}(Q_hK_h^T/\\sqrt{d_k})V_h$. After that, concatenate the 64-dim outputs from all heads (8 heads) and project back to 512 dimensions to form a rich representation.","2":"**Feed Forward + Activation: Microscopic Observation and Feature Extraction**\n\nOnce attention has identified relationships between tokens, the feed-forward network (FFNN) updates each token representation independently and deeply. In practice, it often uses an hourglass structure: expand the dimension (e.g., 512 → 2048) to observe fine-grained patterns, then apply a non-linear activation such as ReLU or GELU, and finally compress back to the original size. Using a formula: $\\mathrm{FFN}(x)=\\max(0, xW_1 + b_1)W_2 + b_2$. This helps the model move beyond simple pattern matching and learn complex concepts.","3":"**Code and Framework Usage: Hyperparameter Tuning**\n\nAll this complex math is packaged in frameworks like PyTorch and TensorFlow as a single building block: `nn.TransformerEncoderLayer`. Practitioners usually do not derive everything from scratch; instead they tune key hyperparameters. You choose the embedding size $d_{model}$, the number of attention heads $n_{head}$, and the feed-forward expansion size $dim_{feedforward}$. With these settings aligned to your application and available GPU resources, you can achieve strong performance."},"problemSolving":{"0":"For self-attention items, start from “every token attends to every token to build weights $A=\\mathrm{softmax}(QK^T/\\sqrt{d_k})$.” You form Q, K, V with $W_Q,W_K,W_V$, scale by $\\sqrt{d_k}$, then softmax so each query row sums to 1. Multi-head runs several attentions in parallel; config questions often use $d_{model}=n_{head}\\times d_{head}$.","2":"$2a","3":"**Short definition** — \"Softmax in $\\mathrm{Attention}(Q,K,V)$ is applied along which axis? ① keys for each query row ② columns only ③ batch only\" → per query over keys. **Answer 1**\n\n---\n\n**T/F** — \"Scaling by $\\sqrt{d_k}$ helps prevent extremely peaked softmax when dot products are large.\" → True. **Answer 1**\n\n---\n\n**Application** — \"For long-range subject–verb agreement in translation, what helps? ① self-attention weights ② unigram counts only\" → ①. **Answer 1**\n\n---\n\n**Choice** — \"4 heads, head dim 32 → $d_{model}$? ① 128 ② 36 ③ 8\" → $4\\times32=128$. **Answer 1**\n\n---\n\n**Calc** — \"Sequence length 20: about how many score cells in an $N\\times N$ matrix?\" → $400$. **Answer 400**"},"summary":"$2b","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Understanding the core formulas","linear":"$$Q=XW_Q$, $K=XW_K$, $V=XW_V$ where $X$ is the input embeddings and $W_Q/W_K/W_V$ are learnable projection matrices. This step splits the same text into \"query-like\", \"key-like\", and \"value-like\" representations.","xavierVariance":"$$S=QK^T$ is the token-to-token relevance score matrix. Larger scores mean stronger relationships, but when the dimension is large the raw values can grow too much—so we divide by $\\sqrt{d_k}$ to stabilize them.","heVariance":"$$A=\\mathrm{softmax}(S/\\sqrt{d_k})$ produces a weight matrix whose rows sum to 1, meaning each token decides how much it should attend to others.","xavierUniform":"$$O=AV$ is the final context representation created by mixing values using weights $A$. The key is that it is a weighted average based on importance, not a plain average."},"visual":"The conceptual diagram is drawn in the order `input tokens → embeddings → Q/K/V projections → similarity matrix (QK^T) → scaling (√d_k) → softmax → weighted sum (AV) → multi-head combination`. The learning flow is shown as vertical stages: `tokenization → positional information → self-attention → feed-forward → prediction`. The model operation diagram sends arrows from one token to all tokens; the arrow thickness represents attention strength. On the frontend, the container uses `min-w-0`, `max-w-full`, `overflow-visible`, and `minHeight: \"320px\"`, while the SVG is responsive via `viewBox` to avoid clipping on mobile.","problemSolvingLabel":"Tips for solving the problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are 10 randomly selected problems from a pool of 60. The difficulty mix is easy 4, medium 3, hard 3, and answers are integers only.","practiceProblemsInstruction":"Read the problem prompt and enter the correct integer in the blank (?).","practiceProblemsInstructionConcept":"Read the prompt and options ①②③, then enter one choice index.","practiceProblemsInstructionOx":"Enter 1 if the statement is true, 0 if false.","practiceProblemsInstructionScenario":"Read the question and options ①②③, then enter one choice index.","practiceProblemsInstructionVote":"Enter one integer: the count of 1s (sum) in the given binary vector.","practiceProblemsInstructionAggregate":"Enter one integer: the sum of the given numbers.","practiceProblemsInstructionConfig":"Read the grid/setup prompt, then enter one integer (e.g. cells in an $n\\times n$ grid = $n^2$).","practiceProblemsInstructionEnsemble":"Read the prompt and options ①②③, then enter one choice index for the best trade-off / statement.","advDlCh01VisualIntro":"Self-attention is an operation where each token looks at all tokens and reconstructs context.","advDlCh01VisualStep0":"① Create token embeddings and linearly project them into Q, K, V","advDlCh01VisualStep1":"② Compute relation scores with QK^T","advDlCh01VisualStep2":"③ Scale by √d_k and normalize weights with softmax","advDlCh01VisualStep3":"④ Multiply the weights by V to form context vectors, then combine heads","advDlCh01VisualConceptTitle":"Concept structure: Q/K/V → scores → normalization → weighted sum","advDlCh01VisualFlowTitle":"Learning flow: tokenization → attention → update representations → prediction","advDlCh01VisualModelTitle":"Model operation: each token attends to all tokens at once","advDlCh01VisualInputTokenLabel":"Input tokens","advDlCh01VisualTokenRelationLabel":"Token relations (self-attention)","advDlCh01VisualContextVectorOutputLabel":"Context output","advDlCh01VisualContextVectorExplainLine1":"A context vector is","advDlCh01VisualContextVectorExplainLine2":"summary of attended info","advDlCh01VisualCoreFormulaLabel":"Core formula","advDlCh01VisualLegendWeak":"Weak reference","advDlCh01VisualLegendMedium":"Medium reference","advDlCh01VisualLegendStrong":"Strong reference","advDlCh01VisualCurrentSuffix":" (current)","problems":{"concept_0":"Problem instruction: Choose the key function of self-attention.\n\nActual question: Which mechanism computes importance by letting each token reference the entire sentence at the same time? ① Self-attention ② Max pooling ③ Dropout","concept_1":"Problem instruction: Confirm the meanings of Q, K, and V.\n\nActual question: Which description is closest to Query? ① A vector that represents what information you want to find ② The correct label ③ A loss value","concept_2":"Problem instruction: Check what the symbols in the formula mean.\n\nActual question: In $A=softmax(QK^T/\\sqrt{d_k})$, what is $d_k$? ① Batch size ② Key vector dimension ③ Number of classes","concept_3":"Problem instruction: Choose the intuition behind multi-head attention.\n\nActual question: The best reason to use multi-head attention is? ① It simultaneously looks at relationships from different perspectives ② It makes parameters equal to 0 ③ It deletes tokens","concept_4":"Problem instruction: Choose the advantage of self-attention.\n\nActual question: Why can self-attention capture word relationships far apart in a long sentence? ① It can directly reference any token in one layer ② Sentences always become short ③ The loss function disappears","concept_5":"Problem instruction: Connect with a real-world example.\n\nActual question: Why is self-attention especially useful in spam email classification? ① It considers interactions between words together ② It automatically generates training data ③ It removes the GPU","ox_0":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Self-attention allows each token to refer to all other tokens at the same time. True -> 1, False -> 0.","ox_1":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Query, Key, and Value mean the same thing, so no distinction is needed. True -> 1, False -> 0.","ox_2":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: In scaled dot-product attention, the purpose of dividing by $\\sqrt{d_k}$ is to reduce score explosion. True -> 1, False -> 0.","ox_3":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Multi-head attention always makes the information representation simpler than a single head. True -> 1, False -> 0.","ox_4":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: After softmax, the sum of a token's attention weights is usually 1. True -> 1, False -> 0.","ox_5":"Problem instruction: Decide whether each statement is true or false.\n\nActual question: Self-attention is used in NLP tasks such as translation, summarization, and classification. True -> 1, False -> 0.","scenario_0":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In a long customer support log, when an early negation expression flips the meaning of a later sentence, which model component is most helpful? ① Self-attention ② Average pooling only ③ A simple rule-based system","scenario_1":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: To interpret phrases like \"not cancer\" reliably in medical text, what should you use first? ① Self-attention that looks at contextual words together ② Word frequency only ③ Only the last word","scenario_2":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In a translation model, what element should you check first to better capture subject-verb agreement? ① Attention head settings ② Image augmentation ③ Pixel normalization","scenario_3":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In generating fraud-transaction explanations, what do you need to reflect relationships among transaction records? ① Compute token-to-token weights ② Delete samples ③ Only reduce the classes","vote_0":"Problem instruction: Compute a weighted ensemble score.\n\nActual question: Head reliability weights are [3,2,1,2,1] and binary votes are [1,1,0,1,0]. What is the weighted sum over positive (1) votes?","vote_1":"Problem instruction: Threshold counting.\n\nActual question: Layer probabilities are [0.92,0.63,0.71,0.48,0.83,0.69]. Treat values ≥ 0.7 as positive. How many positives?","vote_2":"Problem instruction: Count class occurrences.\n\nActual question: 3-class prediction labels are [2,0,1,2,1,0,2,2]. How many times is class 2 predicted?","vote_3":"Problem instruction: Ensemble margin.\n\nActual question: Class A has 7 votes and class B has 4. What is A − B?","scenario_4":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In legal summarization, to connect distant clauses, what structure should you apply first? ① Self-attention ② A 1-gram frequency table ③ Random selection","scenario_5":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: If a news summarization model misses a key sentence, what should you check first? ① The distribution of attention weights ② File extensions ③ Folder names","scenario_6":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In multilingual translation, what is the most natural thing to tune to reduce word alignment errors? ① The number of heads and the dimension ② Monitor brightness ③ Mouse speed","scenario_7":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In long-document classification, if information from the early sentences is lost, what is the most relevant direction? ① Strengthen global context reference ② Delete all tokens ③ Remove the label","scenario_8":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In customer complaint detection, how can you preserve the context of “refund not yet”? ① Reflect the relationship between the negation and key words using attention ② Use only word length ③ Use only numbers","scenario_9":"Problem instruction: Choose the most appropriate option in the given situation.\n\nActual question: In an experiment, multi-head attention was more stable than a single head. What is the most reasonable reason? ① Combine multiple perspectives ② Automatically augment data ③ Ignore the loss","vote_4":"Problem instruction: Reliability-weighted sum.\n\nActual question: Reliabilities are [4,3,2,1,2,3,1,2] and votes are [1,1,1,0,1,0,1,1]. What is the sum of reliabilities where the vote is 1?","vote_5":"Problem instruction: Threshold counting.\n\nActual question: Layer probabilities are [0.4,0.7,0.2,0.8,0.1,0.6,0.3,0.9,0.55,0.65]. Treat values ≥ 0.6 as positive. How many positives?","vote_6":"Problem instruction: Compare two layers.\n\nActual question: Layer A=[1,0,1,0,1,0,1,0,1,0,1,0], Layer B=[1,1,1,0,0,0,1,1,1,0,1,1]. How many positions differ?","vote_7":"Problem instruction: Compare two layers.\n\nActual question: Layer A=[1,1,0,0,1,1,0,0,1,1,0,0], Layer B=[1,0,0,1,1,0,0,1,1,0,0,1]. How many positions have both A and B equal to 1?","vote_8":"Problem instruction: Signed class balance.\n\nActual question: For vote vector [0,0,0,1,1,1,1,1,0,1], what is (# of 1s) − (# of 0s)?","vote_9":"Problem instruction: Compare early vs late segments.\n\nActual question: Early votes [1,1,1,1,1,0], late votes [0,0,1,0,1,0]. What is (early positives − late positives)?","aggregate_0":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the class-1 prediction counts of three heads are [2,1,2], what is the total sum?","aggregate_1":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the spam prediction counts of four heads are [3,2,1,2], what is the total number of spam predictions?","aggregate_2":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If five heads give scores to class 2 as [4,4,3,5,4], what is the sum?","aggregate_3":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the number of normal transactions per head is [6,5,7,6], what is the total number?","ensemble_0":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: What is the main benefit of combining multi-head outputs? ① Combining diverse representations improves generalization ② It removes parameters ③ It stops training","ensemble_1":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: When different heads look at different relationships, what effect should you expect? ① Increased chance that errors cancel out ② The same error always happens ③ Only information loss increases","ensemble_2":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: The most reasonable reason multi-head is stronger than a single head is? ① Splitting the feature space allows parallel learning ② Force the number of tokens to become 1 ③ Remove softmax","ensemble_3":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: From an ensemble perspective, what is the correct caution when increasing the number of heads? ① Check the balance between performance and computation ② Computation always decreases ③ Increase without validation","aggregate_4":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of six head scores [5,4,6,5,4,6]?","aggregate_5":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: If the class-0 counts are [7,8,6,9], what is the total sum?","aggregate_6":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of the keyword-matching counts per head [10,12,11,9,8]?","aggregate_7":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of positive prediction counts per batch [14,16,15]?","aggregate_8":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of the error counts across eight heads [1,2,1,2,1,2,1,2]?","aggregate_9":"Problem instruction: Calculate the model prediction aggregation.\n\nActual question: What is the sum of the interest-token counts per head [3,5,7,9,11]?","config_0":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 4 and each head dimension is 16, what is the model dimension $d_{model}$?","config_1":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 8 and each head dimension is 8, what is the model dimension $d_{model}$?","config_2":"Problem instruction: Calculate the model configuration.\n\nActual question: When the number of tokens is 10, the attention score matrix size (number of elements) is $10\\times10$. How many elements are there?","config_3":"Problem instruction: Calculate the model configuration.\n\nActual question: When the number of tokens is 12, the number of elements in the score matrix is $12\\times12$. What is the value?","config_4":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 6 and each head dimension is 12, what is $d_{model}$?","config_5":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 3 and each head dimension is 24, what is $d_{model}$?","config_6":"Problem instruction: Calculate the model configuration.\n\nActual question: When the sequence length is 14, the number of self-attention score elements is $14\\times14$. What is the value?","config_7":"Problem instruction: Calculate the model configuration.\n\nActual question: When the sequence length is 16, the number of score elements is $16\\times16$. What is the value?","config_8":"Problem instruction: Calculate the model configuration.\n\nActual question: If the number of heads is 12 and each head dimension is 10, what is $d_{model}$?","config_9":"Problem instruction: Calculate the model configuration.\n\nActual question: When the number of tokens is 20, the number of elements in the score matrix is $20\\times20$. What is the value?","ensemble_4":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: Why can you expect variance reduction when combining different heads? ① The errors from different heads partially cancel out ② All heads are always perfect ③ Learning data is unnecessary","ensemble_5":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: From an ensemble perspective, what is the purpose of increasing head diversity? ① Make it so the same input reveals different features ② Copy all heads identically ③ Fix the weights","ensemble_6":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: When deciding the number of heads in a real service, what is the most important factor? ① Balance accuracy improvement and latency ② Always use the maximum number of heads ③ Always use the minimum number of heads","ensemble_7":"Problem instruction: Choose the correct statement about the ensemble principle.\n\nActual question: If performance does not improve even after combining multiple heads, what should you check first? ① Whether heads only see very similar patterns ② The length of token names ③ File colors"},"problemAnswers":{"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":1,"concept_4":1,"concept_5":1,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"vote_0":7,"vote_1":3,"vote_2":4,"vote_3":3,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":14,"vote_5":5,"vote_6":4,"vote_7":3,"vote_8":2,"vote_9":3,"aggregate_0":5,"aggregate_1":8,"aggregate_2":20,"aggregate_3":24,"ensemble_0":1,"ensemble_1":1,"ensemble_2":1,"ensemble_3":1,"aggregate_4":30,"aggregate_5":30,"aggregate_6":50,"aggregate_7":45,"aggregate_8":12,"aggregate_9":35,"config_0":64,"config_1":64,"config_2":100,"config_3":144,"config_4":72,"config_5":72,"config_6":196,"config_7":256,"config_8":120,"config_9":400,"ensemble_4":1,"ensemble_5":1,"ensemble_6":1,"ensemble_7":1},"problemSolutions":{"concept_0":"This question asks for the definition of self-attention: the key idea is whether each token simultaneously references the whole set of tokens. Only option ① matches this definition. In practice, for spam classification you need relationships between surrounding words (e.g., “free + click”), not just one isolated word, which reduces false positives. Therefore, the correct answer is 1.","concept_1":"Query is the question vector that represents what you want to find. Key is the matching criterion, and Value is the actual information to retrieve. In medical document classification, Query helps the current token find the contextual clues it needs, then compares with Key to fetch the important Value. Therefore, the correct answer is 1.","concept_2":"$$d_k$ is the dimension of the Key vectors. When the dimension is large, the variance of dot products grows and softmax may become overly one-sided, so we divide by $\\sqrt{d_k}$ as scaling. This scaling is crucial for training stability, and it’s also used to prevent training blow-ups in translation models. The correct answer is 2.","concept_3":"Multi-head attention increases representational power by letting you view relationships from multiple perspectives at the same time. For example, one head can capture grammar while another can capture connections between named entities. In sentiment analysis of customer reviews, if a separate head captures relationships involving negation, accuracy improves. The correct answer is 1.","concept_4":"Self-attention is strong for long-range dependency because it can directly reference tokens at arbitrary distances within a single layer. This is especially useful when early clauses change the meaning later, such as in legal documents. Therefore, the correct answer is 1.","concept_5":"Spam email classification depends heavily on interactions between words. Self-attention improves classification performance by reflecting contextual relationships as attention weights. Steps: (1) tokenize (2) compute relation scores (3) incorporate important context (4) classify. The correct answer is 1.","ox_0":"This is true because it matches the definition of self-attention. In practice, the key for strong translation and summarization is that each token looks at the whole set at the same time. Answer: 1.","ox_1":"This is false: Q, K, and V have different roles. Without distinguishing them, relationship computation would not make sense. Even in fraud-transaction detection logs, separating question/matching/content is important. Answer: 0.","ox_2":"True. The $\\sqrt{d_k}$ scaling prevents softmax saturation caused by large dot products and helps stable learning. Answer: 1.","ox_3":"False. Multi-head attention usually makes representations richer by learning diverse patterns, not simpler. Answer: 0.","ox_4":"softmax normalizes into probabilities, so the sum of weights in one row is 1. Therefore it is true. Answer: 1.","ox_5":"True. Self-attention is widely used in translation, summarization, classification, and question answering. Answer: 1.","scenario_0":"To capture relationships between distant words in long logs, self-attention is appropriate because it allows global reference. With average pooling only, you easily lose the direction of relationships. This is especially effective when an early negation flips the meaning later in customer-support complaint detection. Answer: 1.","scenario_1":"In “not cancer,” you must consider the relationship between the negation word and the disease name. Self-attention reflects the interaction between the two tokens directly, reducing the risk of misdiagnosis. Steps: (1) compute token relation scores (2) incorporate negation weights (3) perform final classification. Answer: 1.","scenario_2":"Subject-verb agreement is a long-distance dependency problem, so the design of attention heads is the primary thing to check. Image augmentation and pixel normalization are not the top priority for text translation. Answer: 1.","scenario_3":"To reflect relationships among transaction records, you need to compute token-to-token weights. This is essentially what self-attention does. In generating fraud explanations, you can also group evidence tokens to improve interpretability. Answer: 1.","vote_0":"Elementwise multiply weights [3,2,1,2,1] by votes [1,1,0,1,0] and sum: $3\\cdot1+2\\cdot1+1\\cdot0+2\\cdot1+1\\cdot0=7$. Answer: 7.","vote_1":"Values ≥ 0.7 are 0.92, 0.71, and 0.83 → 3 positives. Answer: 3.","vote_2":"In labels [2,0,1,2,1,0,2,2], class 2 appears 4 times. Answer: 4.","vote_3":"Margin: $7-4=3$. Answer: 3.","scenario_4":"Linking distant clauses in legal summarization is a classic long-range dependency problem, so self-attention is the best choice. Answer: 1.","scenario_5":"Missing a key sentence often happens when the attention distribution is biased toward one side. Checking the weight distribution first is a practical approach. Answer: 1.","scenario_6":"Word alignment errors in multilingual translation are directly related to attention components such as the number of heads and the head dimensions. Answer: 1.","scenario_7":"If early-sentence information is lost, you should counter it by strengthening global reference (using self-attention and adjusting layers/heads). Answer: 1.","scenario_8":"The key is to look at the relationship between the negation and the important words together. This is especially important in sentiment analysis and complaint detection. Answer: 1.","scenario_9":"The core reason multi-head attention improves stability is combining multiple perspectives. By learning different patterns in parallel, generalization improves. Answer: 1.","vote_4":"Add reliabilities only where vote=1: $4+3+2+2+1+2=14$. Answer: 14.","vote_5":"Values ≥ 0.6 are 0.7, 0.8, 0.6, 0.9, 0.65 → 5 positives. Answer: 5.","vote_6":"A and B differ at 4 positions. Answer: 4.","vote_7":"Positions where both are 1: indices 1, 5, 9 → 3. Answer: 3.","vote_8":"Six 1s and four 0s → $6-4=2$. Answer: 2.","vote_9":"Early has five 1s, late has two 1s → $5-2=3$. Answer: 3.","aggregate_0":"Aggregation sum: $2+1+2=5$. Prediction aggregation is the first step of combining head outputs with a simple or weighted sum. Answer: 5.","aggregate_1":"Total: $3+2+1+2=8$. Even in spam-detection operations, you sum head outputs per batch and compare against a threshold. Answer: 8.","aggregate_2":"Score sum: $4+4+3+5+4=20$. Steps: (1) check each head's score (2) sum (3) pick the class with the highest score. Answer: 20.","aggregate_3":"Sum: $6+5+7+6=24$. Similar table-style aggregation is also used in financial anomaly detection. Answer: 24.","ensemble_0":"Multi-head attention increases generalization by combining diverse representations. Reducing single-view bias is the key idea. Answer: 1.","ensemble_1":"If different heads see different patterns, some errors may cancel out. This is the basic principle of ensembles. Answer: 1.","ensemble_2":"Splitting the feature space and observing in parallel is multi-head attention's strength. Reducing the token count or removing softmax is not the essence. Answer: 1.","ensemble_3":"Increasing the number of heads can improve performance but also increases computation. So you should check the trade-off and balance. Answer: 1.","aggregate_4":"Sum: $5+4+6+5+4+6=30$. Answer: 30.","aggregate_5":"Sum: $7+8+6+9=30$. Answer: 30.","aggregate_6":"Sum: $10+12+11+9+8=50$. Answer: 50.","aggregate_7":"Sum: $14+16+15=45$. Answer: 45.","aggregate_8":"Sum: $1+2+1+2+1+2+1+2=12$. Answer: 12.","aggregate_9":"Sum: $3+5+7+9+11=35$. Answer: 35.","config_0":"The model dimension is usually $d_{model}=head\\_count \\times head\\_dim$. Compute: $4\\times16=64$. Answer: 64.","config_1":"Compute: $8\\times8=64$. A common integer setup for lightweight translation models. Answer: 64.","config_2":"The number of elements in the score matrix is the square of the number of tokens. Compute: $10\\times10=100$. Answer: 100.","config_3":"Compute: $12\\times12=144$. It shows that longer length makes the computation grow quadratically. Answer: 144.","config_4":"Compute: $6\\times12=72$. Answer: 72.","config_5":"Compute: $3\\times24=72$. You can form the same $d_{model}$ with a different head configuration. Answer: 72.","config_6":"Compute: $14\\times14=196$. It illustrates why the computational load increases for long sequences. Answer: 196.","config_7":"Compute: $16\\times16=256$. Answer: 256.","config_8":"Compute: $12\\times10=120$. Answer: 120.","config_9":"Compute: $20\\times20=400$. This is why costs increase as sequence length grows in search and document summarization. Answer: 400.","ensemble_4":"If the errors of different heads are not exactly the same, combining them can reduce variance. Answer: 1.","ensemble_5":"The purpose of head diversity is to let different features be observed so that you get the benefit of combination. Answer: 1.","ensemble_6":"In real services, you must satisfy both accuracy and latency (SLA), so balance is the key. Answer: 1.","ensemble_7":"If performance does not improve, you should first check for insufficient head diversity. If heads only learn similar patterns, the ensemble benefit is small. Answer: 1."},"problemTestCodes":{"concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 1\nassert answer == 1","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","vote_0":"weights = [3,2,1,2,1]\nvotes = [1,1,0,1,0]\nassert sum(w*v for w, v in zip(weights, votes)) == 7","vote_1":"probs = [0.92,0.63,0.71,0.48,0.83,0.69]\nassert sum(1 for p in probs if p >= 0.7) == 3","vote_2":"labels = [2,0,1,2,1,0,2,2]\nassert sum(1 for y in labels if y == 2) == 4","vote_3":"a_votes = 7\nb_votes = 4\nassert a_votes - b_votes == 3","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"weights = [4,3,2,1,2,3,1,2]\nvotes = [1,1,1,0,1,0,1,1]\nassert sum(w*v for w, v in zip(weights, votes)) == 14","vote_5":"probs = [0.4,0.7,0.2,0.8,0.1,0.6,0.3,0.9,0.55,0.65]\nassert sum(1 for p in probs if p >= 0.6) == 5","vote_6":"a = [1,0,1,0,1,0,1,0,1,0,1,0]\nb = [1,1,1,0,0,0,1,1,1,0,1,1]\nassert sum(1 for x, y in zip(a, b) if x != y) == 4","vote_7":"a = [1,1,0,0,1,1,0,0,1,1,0,0]\nb = [1,0,0,1,1,0,0,1,1,0,0,1]\nassert sum(1 for x, y in zip(a, b) if x == 1 and y == 1) == 3","vote_8":"votes = [0,0,0,1,1,1,1,1,0,1]\nones = sum(votes)\nzeros = len(votes) - ones\nassert ones - zeros == 2","vote_9":"early = [1,1,1,1,1,0]\nlate = [0,0,1,0,1,0]\nassert sum(early) - sum(late) == 3","aggregate_0":"values = [2,1,2]\ntotal = sum(values)\nassert total == 5","aggregate_1":"values = [3,2,1,2]\nassert sum(values) == 8","aggregate_2":"values = [4,4,3,5,4]\nassert sum(values) == 20","aggregate_3":"values = [6,5,7,6]\nassert sum(values) == 24","ensemble_0":"answer = 1\nassert answer == 1","ensemble_1":"answer = 1\nassert answer == 1","ensemble_2":"answer = 1\nassert answer == 1","ensemble_3":"answer = 1\nassert answer == 1","aggregate_4":"values = [5,4,6,5,4,6]\nassert sum(values) == 30","aggregate_5":"values = [7,8,6,9]\nassert sum(values) == 30","aggregate_6":"values = [10,12,11,9,8]\nassert sum(values) == 50","aggregate_7":"values = [14,16,15]\nassert sum(values) == 45","aggregate_8":"values = [1,2,1,2,1,2,1,2]\nassert sum(values) == 12","aggregate_9":"values = [3,5,7,9,11]\nassert sum(values) == 35","config_0":"head_count, head_dim = 4, 16\nd_model = head_count * head_dim\nassert d_model == 64","config_1":"head_count, head_dim = 8, 8\nd_model = head_count * head_dim\nassert d_model == 64","config_2":"tokens = 10\ncells = tokens * tokens\nassert cells == 100","config_3":"tokens = 12\ncells = tokens * tokens\nassert cells == 144","config_4":"head_count, head_dim = 6, 12\nassert head_count * head_dim == 72","config_5":"head_count, head_dim = 3, 24\nassert head_count * head_dim == 72","config_6":"tokens = 14\nassert tokens * tokens == 196","config_7":"tokens = 16\nassert tokens * tokens == 256","config_8":"head_count, head_dim = 12, 10\nassert head_count * head_dim == 120","config_9":"tokens = 20\nassert tokens * tokens == 400","ensemble_4":"answer = 1\nassert answer == 1","ensemble_5":"answer = 1\nassert answer == 1","ensemble_6":"answer = 1\nassert answer == 1","ensemble_7":"answer = 1\nassert answer == 1"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"hard","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh02":{"chapter":"Chapter 02","title":"Transformer: Positional Encoding and Feed-Forward","description":"Self-attention captures **relationships between tokens** well, but it does not fully provide **which position in the sentence** each token occupies. Transformers therefore **add positional encoding (PE)** to token embeddings so the model knows **which word comes where**. After mixing relations in a block, a **feed-forward (FFN)** layer updates each token representation in depth. This chapter explains sinusoidal PE, how it differs from learned positional embeddings, and the **per-token MLP** role of FFN in a beginner-friendly way.","sectionTitle":"Transformer: Positional Encoding and Feed-Forward","whatIs":{"0":"**1. Concept: why positional encoding?**\n\nSelf-attention scores all tokens at once; if inputs are only token embeddings in a bag, **first vs last** can blur. **Positional encoding** builds a vector $PE(p)$ of length $d_{model}$ for each position $p$ and **adds** it to embeddings.\n\n**Intuition:** like seat row/column labels in a theater—PE tags each token seat.\n\n**Math:** let token embedding be $x_t \\in \\mathbb{R}^{d_{model}}$; often $h_t^{(0)} = x_t + PE(t)$.\n\n**Use:** translation, summarization, QA—word order matters, so BERT/GPT always add position.","1":"$2c","2":"**3. Concept: feed-forward (FFN) — a “deep chat” per token**\n\n**One line:** **Attention** is where tokens **mix with each other**; **FFN** is the next step where **each token lane stays separate** and the **same** small network runs **once per lane** (like the green **compute blocks** in the figure above).\n\n**Analogy:** after a group meeting (attention), everyone walks into a booth **one by one** for a **private follow-up** (FFN). The vector width $d_{model}$ is often **expanded** in the middle (wider hidden) and then **compressed** back—an hourglass shape.\n\n**Why bother?** Attention is mostly linear maps + mixing; FFN adds **nonlinearity** (e.g. **ReLU**, $\\max(0,\\cdot)$) so the model can learn **curved, complex rules**, not only straight-line patterns.\n\n**Math (reference):** $\\mathrm{FFN}(x)=\\max(0,xW_1+b_1)W_2+b_2$. Weights $W_1,W_2$ are usually **shared across all positions**.\n\n**Use:** NER, sentiment—attention collects context; FFN sharpens each token.","3":"**4. Concept: flow inside one block — one conveyor step**\n\n**One line:** Each encoder **block** is like **one station** on an assembly line: always the **same order** of steps.\n\n**Easy order:**\n1. **Start:** add **PE** to embeddings so each token “knows” its slot.\n2. **Mix:** **Attention** lets tokens swap context.\n3. **Stabilize:** **Add & Norm** — add a **skip/residual** so signals don’t vanish, then **layer-normalize** scales.\n4. **Per-token polish:** **FFN** updates **each lane** with nonlinearity.\n5. **Again Add & Norm** to finish the station.\n\n**Math (reference):** $h'=\\mathrm{LayerNorm}(h+\\mathrm{Attn}(h))$, then $h''=\\mathrm{LayerNorm}(h'+\\mathrm{FFN}(h'))$. Stack many such blocks to build rich representations.\n\n**Use:** search, chatbots, codegen—repeat dozens of times."},"whyImportant":{"0":"**Order changes meaning**\n\n\"I ate rice\" vs \"rice ate I\" differ grammatically. Without PE, models struggle to keep this consistent. Fraud logs also rely on **time order**.","1":"**FFN brings nonlinearity**\n\nAttention is largely linear maps plus softmax mixing; FFN expands, applies ReLU/GELU, and learns **complex rules**—e.g., symptom combinations in clinical text.","2":"**Compute trade-offs**\n\nLarger $d_{ff}$ and depth raise quality but also GPU cost and latency—key for production tuning.","3":"**Foundation for modern models**\n\nAbsolute embeddings, sinusoidal PE, RoPE, ALiBi… the theme is **encode order in tensors**. FFN + attention blocks underpin BERT, GPT, ViT."},"howUsed":{"0":"**Pipeline: tokenize → embed → +PE**\n\nTokenize, multiply by an embedding matrix, add position vectors. Libraries expose max_position_embeddings for learned tables. Long-document QA must co-design **context length**.","1":"**FFN hyperparameters**\n\nintermediate_size ($d_{ff}$), activation (GELU), dropout. Example: $d_{model}=768$ → $d_{ff}=3072$ is common. Code models may widen FFN for syntax/style.","2":"**Decoder note**\n\nMasked attention hides future tokens, but PE still marks **left-to-right order** for generation quality.","3":"**Debugging hints**\n\nIf order matters, inspect PE/RoPE/context length; if representations are flat, inspect FFN width/depth/activation—common for spam/news tasks."},"problemSolving":{"0":"PE + FFN questions are easiest when you split roles: **order** comes from positional encoding, **token-to-token mixing** from attention, **per-token nonlinearity** from FFN. The usual recipe is $h=x+PE(pos)$, and FFN weights are typically **shared** across positions in a layer. Larger $d_{ff}$, depth, and context length move capacity and cost together.","2":"$2d","3":"**Definition** — \"Self-attention alone fully encodes absolute order without PE.\" True=1, False=0 → False. **Answer 0**\n\n---\n\n**T/F** — \"Sinusoidal PE stacks multiple frequency components to distinguish positions.\" → True. **Answer 1**\n\n---\n\n**Choice** — \"FFN primarily transforms: ① each token vector ② batch indices only\" → ①. **Answer 1**\n\n---\n\n**Calc** — \"$N=50$ tokens → score matrix cells (dense)?\" → $2500$. **Answer 2500**"},"summary":"Half of why transformers work is attention, but you still need a reliable way to tell the model **which slot** each token occupies. Sinusoidal PE overlays multiple frequency waves so each position gets a distinct pattern added to embeddings. Later, attention mixes tokens while FFN applies the same nonlinear transformation at every position to refine features. The expand-then-contract FFN is the practical knob between quality and compute—shared across translation, summarization, classification, and generation.","sectionLabels":{"whatIs":"What the idea is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Reading the formulas","linear":"In $h_t^{(0)} = x_t + PE(t)$, $x_t$ is the token embedding and $PE(t)$ is the vector for position $t$. You **add content and order (as numbers)** to form the model input.","xavierVariance":"Sinusoidal PE uses $PE(t,2i)=\\sin(t/10000^{2i/d})$ and $PE(t,2i+1)=\\cos(t/10000^{2i/d})$ to encode position with multiple frequencies $i$. Here $d$ is $d_{model}$ and $t$ is the token index.","heVariance":"In $\\mathrm{FFN}(h)=W_2\\,\\sigma(W_1 h+b_1)+b_2$, $\\sigma$ is a nonlinearity, $W_1$ maps $d_{model}\\to d_{ff}$, and $W_2$ maps $d_{ff}\\to d_{model}$.","xavierUniform":"**Weight sharing**—the same FFN at every position—helps generalization and keeps implementation simple."},"visual":"Interactive visualization of positional encoding and FFN flow.","problemSolvingLabel":"How to approach the exercises","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Below are 10 problems randomly drawn from a pool of 60. Difficulty mix: 4 easy, 3 medium, 3 hard. Enter **integers only**.","practiceProblemsInstruction":"Read the prompt and the question, then enter your answer as an integer.","practiceProblemsInstructionConcept":"Read the prompt and options ①②③, then enter one choice index.","practiceProblemsInstructionOx":"Enter 1 if the statement is true, 0 if false.","practiceProblemsInstructionScenario":"Read the question and options ①②③, then enter one choice index.","practiceProblemsInstructionVote":"Enter one integer: the count of 1s (sum) in the given binary vector.","practiceProblemsInstructionAggregate":"Enter one integer: the sum of the given numbers.","practiceProblemsInstructionConfig":"Read the grid/setup prompt, then enter one integer (e.g. cells in an $n\\times n$ grid = $n^2$).","practiceProblemsInstructionEnsemble":"Read the prompt and options ①②③, then enter one choice index for the best trade-off / statement.","advDlCh02VisualZoneLabelTop":"Top","advDlCh02VisualZoneLabelBottom":"Bottom","advDlCh02VisualIntroTop":"Read **left to right**—each column adds **word meaning** and **order turned into numbers (PE)**.","advDlCh02VisualIntroBottom":"Lanes **don’t mix**; each passes through the **same compute block** once (same weights, same ops).","advDlCh02VisualIntroNote":"Papers call this compute block **FFN**.","advDlCh02VisualStep0":"① **Meaning** + **which-number** info, added (same idea as adding PE)","advDlCh02VisualStep1":"② Then (if needed) attention mixes nearby tokens","advDlCh02VisualStep2":"③ FFN: wider hidden layer → nonlinear (bend once) → project back to output size","advDlCh02VisualStep3":"④ Add a skip (+), tidy up, then next layer or output","advDlCh02VisualConceptTitle":"① Build inputs → (middle steps omitted) → ② Same FFN per lane","advDlCh02VisualBridgeLead":"**①** first, then **②**—in order inside one block.","advDlCh02VisualBridgeBlock1":"**①** adds **meaning + order (PE)** to build the **input**. (Attention in between is skipped in the figure.)","advDlCh02VisualBridgeBlock2":"**②** then refines **each lane** with the **same FFN**. Lanes **don’t mix**.","advDlCh02VisualBridgeMicroCaption":"Order inside one block","advDlCh02VisualAnimHint":"The diagram slowly highlights each step (~7s each).","advDlCh02VisualAnimStepPe":"① Input","advDlCh02VisualAnimStepBridge":"Link","advDlCh02VisualAnimStepFfn":"② FFN","advDlCh02VisualFlowTitle":"Big picture: split → add order info → repeat layers → predict","advDlCh02VisualModelTitle":"In one line: meaning+order vectors pass through layers","advDlCh02VisualInputTokenLabel":"Input token + position","advDlCh02VisualTokenRelationLabel":"Token embedding + PE sum","advDlCh02VisualContextVectorOutputLabel":"Updated per-token representation","advDlCh02VisualContextVectorExplainLine1":"The FFN at each position","advDlCh02VisualContextVectorExplainLine2":"uses the same MLP (nonlinear)","advDlCh02VisualCoreFormulaLabel":"In math: **meaning+order (PE)** as $h{+}PE$, then **each lane** refines with $\\mathrm{FFN}(h)$","advDlCh02VisualLegendWeak":"Low intermediate activation","advDlCh02VisualLegendMedium":"Medium","advDlCh02VisualLegendStrong":"High intermediate activation","advDlCh02VisualCurrentSuffix":" (current)","advDlCh02VisualPanelPeTitle":"① Put meaning + order numbers (PE) together","advDlCh02VisualPanelFfnTitle":"② Same compute block polishes each lane (FFN)","advDlCh02VisualTrainCaption":"Like noting **which word in the sentence** this is, as numbers.","advDlCh02VisualSameMachineHint":"Four lanes, no cross-talk—same compute block each time","advDlCh02VisualMachineIn":"input","advDlCh02VisualMachineMid":"wider layer","advDlCh02VisualMachineOut":"output","advDlCh02VisualMachineAct":"nonlinear","advDlCh02VisualEmbShort":"meaning","advDlCh02VisualPosShort":"pos","advDlCh02VisualPosSlotShort":"index","advDlCh02VisualPeShort":"order","advDlCh02VisualSumPrimary":"{slot} fused","advDlCh02VisualSumSub":"meaning + order","advDlCh02VisualFfnSameNote":"All four lanes: **same compute block** (W₁, W₂ shared)","advDlCh02VisualFfnPerToken":"lane","advDlCh02VisualFfnInLabel":"width","advDlCh02VisualLegendExpand":"widen","advDlCh02VisualLegendNonlin":"nonlinear","advDlCh02VisualLegendProject":"narrow","advDlCh02VisualLegendFfnLabel":"compute block (FFN)","problems":{"concept_0":"Self-attention alone weakens explicit order; which module injects order as vectors? ① Positional encoding ② Dropout only ③ Batch norm only","concept_1":"In the original sinusoidal PE, even dimension index $2i$ usually uses? ① $\\sin$ ② $\\cos$ ③ ReLU","concept_2":"In a transformer block, the FFN does what to each token? ① mixes tokens ② applies the same MLP per token to deepen representations ③ shrinks sequence length","concept_3":"Often $d_{ff}=4d_{model}$. If $d_{model}=128$, a natural $d_{ff}$ is? ① 256 ② 512 ③ 64","concept_4":"Which matches learned positional embeddings? ① add a learned position vector per index ② only $\\sin$ ③ no position","concept_5":"When order of sentences matters for labels, what input must you keep with attention? ① token embedding + position ② pixels only ③ filename only","ox_0":"Additive PE is usually added to token embeddings. True = 1, False = 0.","ox_1":"The FFN applies one softmax over the whole sequence length. True = 1, False = 0.","ox_2":"The same FFN weights are typically shared across positions. True = 1, False = 0.","ox_3":"Sinusoidal PE is designed so periodic patterns can reflect relative distance. True = 1, False = 0.","ox_4":"Usually $d_{ff}$ is smaller than $d_{model}$. True = 1, False = 0.","ox_5":"FFN after attention is widely used in NLP stacks. True = 1, False = 0.","scenario_0":"In medical summaries, order of pre/post drug matters. What to strengthen first? ① order signal incl. PE ② image rotation ③ batch size only","scenario_1":"Spam: \"free\" and \"click now\" are far apart but related. With attention, to add order? ① embedding + PE ② color space ③ audio sampling only","scenario_2":"Fraud text: amount and time order matter. Which layer widens expressivity? ① per-token FFN ② pooling only ③ regex only","scenario_3":"Legal docs: relative clause distance matters. Classical PE that handles periodic patterns? ① sinusoidal PE ② random drop ③ file extension","scenario_4":"Model confuses \"today\" vs \"tomorrow\". Check first? ① PE + embedding ② monitor DPI ③ font size","scenario_5":"Raising FFN width increases compute. Balance? ① $d_{ff}$ vs latency ② mouse DPI ③ theme color","scenario_6":"Cross-lingual translation with different word order. Preprocess? ① subword emb + PE ② pixel norm only ③ zip only","scenario_7":"Long logs: negation early changes later meaning. Keep order? ① input with PE ② word length only ③ UUID only","scenario_8":"Sentiment: after seeing \"not\" and \"good\", need per-token nonlinearity? ① FFN ② mean only ③ stop","scenario_9":"Removing FFN hurts a lot. Why? ① deep nonlinear token transform is lost ② batch becomes 1 ③ GPU vanishes","vote_0":"Layer votes [1,1,0,1,0] — how many 1s?","vote_1":"Layer votes [1,0,1,1,1,0] — how many 1s?","vote_2":"Layer votes [0,0,1,0,1,1,1,0] — how many 1s?","vote_3":"Layer votes [1,1,1,1,0,0,1,0,1,1] — how many 1s?","vote_4":"Layer votes [1,1,1,0,1,0,1,1] — how many 1s?","vote_5":"Layer votes [0,1,0,1,0,1,0,1,1,1] — how many 1s?","vote_6":"Layer votes [1,0,1,0,1,0,1,0,1,0,1,0] — how many 1s?","vote_7":"Layer votes [1,1,0,0,1,1,0,0,1,1,0,0] — how many 1s?","vote_8":"Layer votes [0,0,0,1,1,1,1,1,0,1] — how many 1s?","vote_9":"Layer votes [1,1,1,1,1,0,0,0,1,0,1,0] — how many 1s?","aggregate_0":"Three heads predict positives [2,1,2]. Sum?","aggregate_1":"Four blocks spam scores [3,2,1,2]. Sum?","aggregate_2":"Five FFN active counts [4,4,3,5,4]. Sum?","aggregate_3":"Four PE match counts [6,5,7,6]. Sum?","aggregate_4":"Six layer scores [5,4,6,5,4,6]. Sum?","aggregate_5":"Class-0 counts [7,8,6,9]. Sum?","aggregate_6":"Keyword matches [10,12,11,9,8]. Sum?","aggregate_7":"Batch positives [14,16,15]. Sum?","aggregate_8":"Eight head errors [1,2,1,2,1,2,1,2]. Sum?","aggregate_9":"Position interest counts [3,5,7,9,11]. Sum?","ensemble_0":"Stacking blocks mainly helps by? ① staged abstraction for complex patterns ② zero params ③ no training","ensemble_1":"Why can errors cancel across depth? ① different transforms per layer ② identical outputs ③ delete data","ensemble_2":"Why multi-layer FFN over one? ① repeated nonlinearity boosts expressivity ② force length 1 ③ remove softmax","ensemble_3":"When adding blocks, watch? ① accuracy vs compute vs overfit ② infinite depth always ③ no validation","ensemble_4":"If layers duplicate work? ① smaller gain from redundancy ② always better ③ cannot train","ensemble_5":"Goal of depth? ① staged abstraction ② identical copy ③ freeze","ensemble_6":"Production depth trade-off? ① accuracy vs latency ② refresh rate ③ icon size","ensemble_7":"If stalled, check? ① layers learn same pattern ② filename ③ theme","config_0":"4 heads, head dim 16 → $d_{model}$?","config_1":"8 heads, head dim 8 → $d_{model}$?","config_2":"10 tokens → attention score matrix has $10\\times10$ entries. Value?","config_3":"12 tokens → $12\\times12$ entries. Value?","config_4":"6 heads, head dim 12 → $d_{model}$?","config_5":"3 heads, head dim 24 → $d_{model}$?","config_6":"Length 14 → $14\\times14$ score cells. Value?","config_7":"Length 16 → $16\\times16$. Value?","config_8":"12 heads, head dim 10 → $d_{model}$?","config_9":"20 tokens → $20\\times20$. Value?"},"problemAnswers":{"concept_0":1,"concept_1":1,"concept_2":2,"concept_3":2,"concept_4":1,"concept_5":1,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":1,"ox_4":0,"ox_5":1,"scenario_0":1,"scenario_1":1,"scenario_2":1,"scenario_3":1,"vote_0":3,"vote_1":4,"vote_2":4,"vote_3":7,"scenario_4":1,"scenario_5":1,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":6,"vote_5":6,"vote_6":6,"vote_7":6,"vote_8":6,"vote_9":7,"aggregate_0":5,"aggregate_1":8,"aggregate_2":20,"aggregate_3":24,"ensemble_0":1,"ensemble_1":1,"ensemble_2":1,"ensemble_3":1,"aggregate_4":30,"aggregate_5":30,"aggregate_6":50,"aggregate_7":45,"aggregate_8":12,"aggregate_9":35,"config_0":64,"config_1":64,"config_2":100,"config_3":144,"config_4":72,"config_5":72,"config_6":196,"config_7":256,"config_8":120,"config_9":400,"ensemble_4":1,"ensemble_5":1,"ensemble_6":1,"ensemble_7":1},"problemSolutions":{"concept_0":"PE supplements order that pure self-attention under-emphasizes. Spam detection also depends on word order. Answer 1.","concept_1":"Even $2i$ uses $\\sin$ in the classic pairing. Clinical timelines tie to this. Answer 1.","concept_2":"FFN applies the same MLP per token (not mixing tokens). Answer 2.","concept_3":"$$4\\times128=512$. Answer 2.","concept_4":"Learned absolute embeddings add a trainable vector per position. Answer 1.","concept_5":"Embeddings plus position are needed. Answer 1.","ox_0":"Additive PE sums into embeddings. Answer 1.","ox_1":"FFN is per position, not one softmax over length. Answer 0.","ox_2":"Weights are usually shared. Answer 1.","ox_3":"Periodic design encodes relative cues. Answer 1.","ox_4":"Typically $d_{ff} \\ge d_{model}$. Answer 0.","ox_5":"Standard NLP blocks use FFN. Answer 1.","scenario_0":"Clinical ordering needs PE-bearing inputs. Answer 1.","scenario_1":"Use embeddings + PE with attention. Answer 1.","scenario_2":"Per-token FFN widens features. Answer 1.","scenario_3":"Sinusoidal PE is classical. Answer 1.","scenario_4":"Check PE+embedding wiring. Answer 1.","scenario_5":"Balance width vs latency. Answer 1.","scenario_6":"Subword embeddings + PE. Answer 1.","scenario_7":"Keep order via PE. Answer 1.","scenario_8":"Use FFN for nonlinearity. Answer 1.","scenario_9":"Removing FFN removes deep nonlinear transforms. Answer 1.","vote_0":"Sum is 3. Answer 3.","vote_1":"Sum is 4. Answer 4.","vote_2":"Sum is 4. Answer 4.","vote_3":"Sum is 7. Answer 7.","vote_4":"Sum is 6. Answer 6.","vote_5":"Sum is 6. Answer 6.","vote_6":"Sum is 6. Answer 6.","vote_7":"Sum is 6. Answer 6.","vote_8":"Sum is 6. Answer 6.","vote_9":"Sum is 7. Answer 7.","aggregate_0":"$$2+1+2=5$. Answer 5.","aggregate_1":"$$3+2+1+2=8$. Answer 8.","aggregate_2":"$$4+4+3+5+4=20$. Answer 20.","aggregate_3":"$$6+5+7+6=24$. Answer 24.","ensemble_0":"Depth stacks representations. Answer 1.","ensemble_1":"Different layers transform differently. Answer 1.","ensemble_2":"Repeated nonlinear layers add capacity. Answer 1.","ensemble_3":"Watch overfitting and compute. Answer 1.","aggregate_4":"Sum 30. Answer 30.","aggregate_5":"Sum 30. Answer 30.","aggregate_6":"Sum 50. Answer 50.","aggregate_7":"Sum 45. Answer 45.","aggregate_8":"Sum 12. Answer 12.","aggregate_9":"Sum 35. Answer 35.","config_0":"$$4\\times16=64$. Answer 64.","config_1":"$$8\\times8=64$. Answer 64.","config_2":"$$10\\times10=100$. Answer 100.","config_3":"$$12\\times12=144$. Answer 144.","config_4":"$$6\\times12=72$. Answer 72.","config_5":"$$3\\times24=72$. Answer 72.","config_6":"$$14\\times14=196$. Answer 196.","config_7":"$$16\\times16=256$. Answer 256.","config_8":"$$12\\times10=120$. Answer 120.","config_9":"$$20\\times20=400$. Answer 400.","ensemble_4":"Redundant layers yield smaller gains. Answer 1.","ensemble_5":"Depth enables abstraction stages. Answer 1.","ensemble_6":"Balance accuracy and latency. Answer 1.","ensemble_7":"Check representation diversity. Answer 1."},"problemTestCodes":{"concept_0":"answer = 1\nassert answer == 1","concept_1":"answer = 1\nassert answer == 1","concept_2":"answer = 2\nassert answer == 2","concept_3":"answer = 2\nassert answer == 2","concept_4":"answer = 1\nassert answer == 1","concept_5":"answer = 1\nassert answer == 1","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 1\nassert answer == 1","ox_4":"answer = 0\nassert answer == 0","ox_5":"answer = 1\nassert answer == 1","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 1\nassert answer == 1","scenario_3":"answer = 1\nassert answer == 1","vote_0":"votes = [1,1,0,1,0]\nassert sum(votes) == 3","vote_1":"votes = [1,0,1,1,1,0]\nassert sum(votes) == 4","vote_2":"votes = [0,0,1,0,1,1,1,0]\nassert sum(votes) == 4","vote_3":"votes = [1,1,1,1,0,0,1,0,1,1]\nassert sum(votes) == 7","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 1\nassert answer == 1","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"votes = [1,1,1,0,1,0,1,1]\nassert sum(votes) == 6","vote_5":"votes = [0,1,0,1,0,1,0,1,1,1]\nassert sum(votes) == 6","vote_6":"votes = [1,0,1,0,1,0,1,0,1,0,1,0]\nassert sum(votes) == 6","vote_7":"votes = [1,1,0,0,1,1,0,0,1,1,0,0]\nassert sum(votes) == 6","vote_8":"votes = [0,0,0,1,1,1,1,1,0,1]\nassert sum(votes) == 6","vote_9":"votes = [1,1,1,1,1,0,0,0,1,0,1,0]\nassert sum(votes) == 7","aggregate_0":"values = [2,1,2]\nassert sum(values) == 5","aggregate_1":"values = [3,2,1,2]\nassert sum(values) == 8","aggregate_2":"values = [4,4,3,5,4]\nassert sum(values) == 20","aggregate_3":"values = [6,5,7,6]\nassert sum(values) == 24","ensemble_0":"answer = 1\nassert answer == 1","ensemble_1":"answer = 1\nassert answer == 1","ensemble_2":"answer = 1\nassert answer == 1","ensemble_3":"answer = 1\nassert answer == 1","aggregate_4":"values = [5,4,6,5,4,6]\nassert sum(values) == 30","aggregate_5":"values = [7,8,6,9]\nassert sum(values) == 30","aggregate_6":"values = [10,12,11,9,8]\nassert sum(values) == 50","aggregate_7":"values = [14,16,15]\nassert sum(values) == 45","aggregate_8":"values = [1,2,1,2,1,2,1,2]\nassert sum(values) == 12","aggregate_9":"values = [3,5,7,9,11]\nassert sum(values) == 35","config_0":"assert 4 * 16 == 64","config_1":"assert 8 * 8 == 64","config_2":"assert 10 * 10 == 100","config_3":"assert 12 * 12 == 144","config_4":"assert 6 * 12 == 72","config_5":"assert 3 * 24 == 72","config_6":"assert 14 * 14 == 196","config_7":"assert 16 * 16 == 256","config_8":"assert 12 * 10 == 120","config_9":"assert 20 * 20 == 400","ensemble_4":"answer = 1\nassert answer == 1","ensemble_5":"answer = 1\nassert answer == 1","ensemble_6":"answer = 1\nassert answer == 1","ensemble_7":"answer = 1\nassert answer == 1"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"hard","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh03":{"chapter":"Chapter 03","title":"Transformer lineage: BERT understands, GPT generates","description":"The Transformer evolved into two great lineages. **BERT, from the encoder clan (understanding models)**, reads a whole sentence at a glance; **GPT, from the decoder clan (generation models)**, keeps inventing the next token from what came before. If BERT is the ace of 'college-entrance cloze reading,' GPT is the prodigy of 'word chains and novel writing.' This chapter explains how the two models learn, and why their roles in industry differ completely—using analogies beginners can grasp.","sectionTitle":"Transformer lineage: BERT understands, GPT generates","whatIs":{"0":"**1. BERT: bidirectional reading for “understanding” (encoder)**\n\n**Concept:** BERT (Bidirectional Encoder Representations from Transformers) grows out of the Transformer **encoder** alone. The core is **bidirectional context**: left and right words are used together to build the most faithful **representation** of what the current word means.\n\n**Intuition:** like a master clinician who lays out past history (left) and today’s tests (right) **at once** and decides holistically—seeing the whole picture makes context understanding strong.\n\n**Math:** BERT’s flagship training is **MLM (Masked Language Modeling)**: punch a hole (`[MASK]`) in the sentence and train the distribution $p(w_t \\mid \\text{full context})$ for the correct token $w_t$.\n\n**ML use case:** text classification (“positive or negative review?”), named-entity recognition (“find names and dates”), document search, and more.","1":"**2. GPT: endlessly “generating” the next token (decoder)**\n\n**Concept:** GPT (Generative Pre-trained Transformer) develops the Transformer **decoder**. The model is not allowed to see the full sentence at once: a **mask** hides future words so that only **past tokens ($1\\ldots t-1$)** are used to predict the next token $t$—**autoregressive** behavior.\n\n**Intuition:** like a novelist at a typewriter—you **cannot see the next sentence in advance**; you imagine the next word from what you have already written.\n\n**Math:** to stop future information leaking, **causal masking** sets the upper triangle of the attention matrix to $-\\infty$. Training maximizes $-\\log p(x_t \\mid x_{> 3\nassert answer == 32","ensemble_1":"answer = 96 // 4\nassert answer == 24","ensemble_2":"answer = 80 // 2\nassert answer == 40","ensemble_3":"answer = 512 // 4\nassert answer == 128","ensemble_4":"answer = 14 * 14\nassert answer == 196","ensemble_5":"answer = 10 * 10\nassert answer == 100","ensemble_6":"answer = 8 * 8\nassert answer == 64","ensemble_7":"answer = 32 // 2\nassert answer == 16","config_0":"assert 8 * 8 == 64","config_1":"assert 9 * 9 == 81","config_2":"assert 10 * 10 == 100","config_3":"assert 11 * 11 == 121","config_4":"assert 12 * 12 == 144","config_5":"assert 6 * 6 == 36","config_6":"assert 7 * 7 == 49","config_7":"assert 16 * 16 == 256","config_8":"assert 20 * 20 == 400","config_9":"assert 25 * 25 == 625"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"advDlCh17":{"chapter":"Chapter 17","title":"Autoencoder: Compress and Reconstruct","description":"Feeding images or high-dimensional data $x$ into a network, the model first **encodes** a compact **summary code $z$** (latent representation), then **decodes** to $\\hat{x}$ with the same shape—an **autoencoder**. Training minimizes **reconstruction loss** between $x$ and $\\hat{x}$. This is classic **unsupervised** learning: no class labels; the data itself is the target.\n\nA narrow **bottleneck** enables **dimensionality reduction** and **anomaly detection**. Chapter 18 (**VAE**) adds a probabilistic latent model for **generation**; here we build the compress–reconstruct foundation.","sectionTitle":"Autoencoder: Compress and Reconstruct","whatIs":{"0":"**1. Symmetric encoder–decoder structure**\n\n**Concept:** The **encoder** $f_\\theta$ maps input $x$ to a latent vector $z=f_\\theta(x)$; the **decoder** $g_\\phi$ maps $z$ to $\\hat{x}=g_\\phi(z)$. The dimension of $z$ is forced into a **much smaller bottleneck** than the original input.\n\n**Intuition:** Like a witness describing a face to a sketch artist with a few traits ($z$) instead of every pixel—the decoder redraws the face from that summary.","1":"**2. Loss: how close is the reconstruction?**\n\n**Concept:** For continuous real-valued features, **MSE** $\\frac{1}{d}\\sum_i (x_i-\\hat{x}_i)^2$ is typical; for $[0,1]$ grayscale images, **BCE** is also used.\n\n**Intuition:** Like overlaying the original and the copy and scoring per-pixel mismatch.","2":"**3. Why the bottleneck matters**\n\nIf $z$ were as large as $x$, the network could trivially **copy the input** (identity). A narrow bottleneck forces the model to keep only **real patterns** in $z$.\n\n**Practice (anomaly detection):** Train on **normal** images only; **high reconstruction error** on novel “abnormal” inputs flags defects.","3":"**4. Denoising autoencoder (DAE)**\n\n**Use:** Add noise or masking, then train to recover the **clean target**. The model learns more **robust** features that ignore superficial corruption.","4":"**5. What is the latent space?**\n\n**Concept:** The **latent space** is the **low-dimensional vector space where the encoder’s codes $z$ live**—not the raw pixel/input space. Each sample becomes **one point (a coordinate vector)** in this space; after training, similar inputs often land **nearby**, while different patterns map **farther apart**, so the space can acquire **geometric structure**.\n\n**In an autoencoder:** The bottleneck dimension $k$ **is** the latent space dimension. The decoder $g_\\phi$ **maps** points in this space back to high-dimensional $\\hat{x}$. (Chapter 18 **VAE** adds a **probability model** on this space for sampling and **generation**.)","5":"**6. What is PCA?**\n\n**Concept:** **PCA (Principal Component Analysis)** is a **linear** dimensionality-reduction method: it finds directions in which the data **variance** is largest, in order, and builds **orthogonal axes** called **principal components**. **Projecting** data onto the first few axes yields a **low-dimensional summary** that keeps as much **variance** as possible (along discarded axes you lose that variance).\n\n**Versus autoencoders:** PCA uses **linear maps only**; autoencoders with nonlinear activations can learn **richer, curved** structure. On complex data, AEs are often more flexible. (A **linear** AE trained with MSE connects to PCA intuition under certain conditions.)"},"whyImportant":{"0":"**Beyond PCA: powerful dimensionality reduction**\n\n**PCA**, as described above, is essentially **linear** dimensionality reduction. **Autoencoders**, by contrast, use **nonlinear** activations to compress and visualize high-dimensional data in 2–3D **more flexibly**.","1":"**Unsupervised feature learning**\n\nLabeling is expensive. An AE can extract features $z$ from raw data alone; a pretrained encoder is a strong starting point for **transfer learning** into classifiers.","2":"**Gateway to generative AI**\n\nBeyond compression, tweaking latent $z$ to synthesize new faces or images leads to **VAEs** and **GANs**."},"howUsed":{"0":"**Step 1: Normalize and scale**\n\nMap image pixels from $0$–$255$ to $[0,1]$ with **min–max**, or **standardize** per channel. Keep **RGB** channel order $(R,G,B)$ fixed and apply the same preprocessing every batch. Inconsistent scaling changes MSE gradients and can slow or destabilize training.","1":"**Step 2: Architecture, bottleneck $k$, and loss**\n\n**Images:** prefer a **convolutional AE (CAE)** to preserve locality. **Vectors or sequences:** use 1D convs or fully connected stacks. **$k$:** smaller $k$ → stronger compression but more detail loss; larger $k$ → easier reconstruction but weaker summarization—pick $k$ with **validation loss**. Outputs in $\\mathbb{R}$ → **MSE**; $[0,1]$-like grayscale → consider **BCE**.","2":"**Step 3: Training loop, output activation, stability**\n\nBackpropagate MSE or BCE each minibatch. For $[0,1]$ targets, put **sigmoid** on the decoder’s last layer. Use **Adam** (or similar), a **learning-rate schedule**, and **gradient clipping** if needed. Split **train/validation**; if validation loss worsens, try **early stopping**, **dropout/weight decay**, or **denoising AE**.","3":"**Step 4: Evaluation, plots, downstream**\n\nDo not rely on the loss curve alone—**inspect** $\\hat{x}$. **Project** latent $z$ to 2D (e.g., t-SNE) to see structure or outliers. For **anomaly detection**, train on normal data only and set a reconstruction-error **threshold** on a validation set. **Freeze or fine-tune** the encoder for **few-label classification** or **clustering**.","4":"**Uses at a glance**\n\n| Goal | Idea |\n| --- | --- |\n| **Anomaly detection** | Train on **normal** data only → flag **high reconstruction error** |\n| **Denoising** | **DAE:** corrupted input → clean target |\n| **Dim. reduction / viz** | Small $z$ or 2D projection of $z$ |\n| **Pretraining** | Reuse the encoder as a front end for **transfer** |"},"problemSolving":{"0":"Autoencoder items are easiest if you keep the one-liner **$z=f_\\theta(x)$, $\\hat{x}=g_\\phi(z)$** and the goal **reconstruction loss** matching $x$ to $\\hat{x}$. At the **bottleneck**, usually **$k \\ll d$**. For one **fully connected** layer $d \\to k$, count about **$d\\cdot k$ weights + $k$ biases**. **Flattened image length** is height×width (×3 for RGB); **patch count** (no CLS) is $(H/p)\\times(W/p)$—same line of reasoning as **ViT patch/grid** (Chapter 5 review).","1":"**Anomaly detection:** train reconstruction on **normal** data, then flag samples with **large reconstruction error**. **Denoising AE** maps corrupted inputs toward clean targets for **robust** features. Use **MSE** for real pixels; **BCE** is common for $[0,1]$ grayscale. When **$k/d$** or a **percent** appears, align numerator and denominator carefully.","2":"**Convolutional AE** stacks **CNN** encoders/decoders to keep **local structure** (Chapter 12). If **$k$ is too large**, the net can approach an **identity copy**; questions often test the **compression vs. expressivity** trade-off when **shrinking $k$**.","3":"Next chapter **VAE** puts a **probability model** on **latent $z$** for **generation**. If the stem says **probabilistic latent** or **sampling/generation**, think **VAE**."},"summary":"**One-liner:** The encoder squeezes data through a narrow bottleneck $z$; the decoder maps back to $\\hat{x}$; training minimizes reconstruction error so the network discovers salient structure.\n\n**Links:** Combine **Dense** and **CNN** blocks for encoder/decoder; CAEs help on complex spatial data.\n\n**Next (Chapter 18):** **VAE** places a **probability distribution** on $z$ for **generation**.","sectionLabels":{"whatIs":"What it is","whyImportant":"Why it matters","howUsed":"How it is used","summary":"Summary"},"formulaGuide":{"title":"Reading the formulas (autoencoder)","linear":"**1. Encoder and decoder in one line**\n\n$z=f_\\theta(x)$, $\\hat{x}=g_\\phi(z)$. Loss example: $\\mathcal{L}=\\|x-\\hat{x}\\|_2^2$.\n\n- **$z$:** **Latent code** at the bottleneck\n- **$\\hat{x}$:** **Reconstructed output**","xavierVariance":"**2. Bottleneck and compression**\n\nInput dim $d$, latent $k\\ll d$: compression ratio is about $k/d$.\n\n- **Smaller $k$:** stronger compression (more information loss possible)\n- **Larger $k$:** easier reconstruction, weaker summarization","heVariance":"**3. Linear AE and PCA**\n\nWith linear activations and MSE, intuition links to **principal directions** (depends on data and constraints).\n\n- **Nonlinear** activations allow richer representations","xavierUniform":"**4. Practical tips**\n\nMatch data scale; adjust bottleneck and depth; use **DAE** for robust features when needed."},"formulaGuideDiagramCaption":"**In one line:** $x$ is compressed to a narrow $z$, expanded to $\\hat{x}$, and compared to $x$.","formulaGuideDiagramAria":"Autoencoder diagram: input encoder bottleneck latent decoder reconstruction loss","formulaGuideDiagramFrozenHint":"Bottleneck","advDlCh17FormulaGuideLossHint":"Compare x and x̂ · reconstruction loss","advDlCh17VisualInputLabel":"Input","visual":"Animation: stages light up in order — input → encoder → bottleneck z → decoder → reconstruction x̂ → reconstruction loss.","problemSolvingLabel":"Problem-solving notes","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"All **60** bank items are **autoencoder**-themed (compress/reconstruct, bottleneck, loss, anomaly detection, CAE, image/patch geometry, linear-layer params, etc.). Each session draws **10** problems: **4 easy → 3 medium → 3 hard**, and **session types (prefix + difficulty) do not repeat** within one run.","practiceProblemsInstruction":"Choose the best option.","practiceProblemsInstructionCalc":"Choose the best option.","practiceProblemsInstructionConcept":"Choose the best option.","practiceProblemsInstructionOx":"Choose the best option.","practiceProblemsInstructionScenario":"Choose the best option.","practiceProblemsInstructionVote":"Choose the best option.","practiceProblemsInstructionAggregate":"Choose the best option.","practiceProblemsInstructionConfig":"Choose the best option.","practiceProblemsInstructionEnsemble":"Choose the best option.","advDlCh17VisualIntro":"The **encoder** compresses input **$x$** to latent **bottleneck $z$**; the **decoder** expands **$z$** to **$\\hat{x}$**. Lower **reconstruction loss** means outputs closer to the input.","advDlCh17VisualConceptTitle":"Concept: encoder → bottleneck → decoder","advDlCh17VisualSectionTitle":"Autoencoder: compress & reconstruct","advDlCh17VisualMetaphor":"Like summarizing text on a sticky note, then rewriting it in full.","advDlCh17VisualTopInputLabel":"Input image","advDlCh17VisualTopLatentLabel":"Latent representation","advDlCh17VisualTopReconLabel":"Reconstructed image","advDlCh17VisualEncoderLabel":"Encoder","advDlCh17VisualBottleneckLabel":"Bottleneck z","advDlCh17VisualBottleneckHint":"Where dimensionality shrinks the most","advDlCh17VisualDecoderLabel":"Decoder","advDlCh17VisualReconLabel":"Reconstruction x̂","advDlCh17VisualLossLabel":"Loss","advDlCh17VisualFlowTitle":"Training flow","advDlCh17VisualStep0":"**① Input:** feed $x$.","advDlCh17VisualStep1":"**② Encoder:** map $x$ to $z$.","advDlCh17VisualStep2":"**③ Bottleneck:** small $z$ summarizes information.","advDlCh17VisualStep3":"**④ Decoder:** map $z$ to $\\hat{x}$.","advDlCh17VisualStep4":"**⑤ Loss:** minimize mismatch between $x$ and $\\hat{x}$.","advDlCh17VisualStage0":"Input x","advDlCh17VisualStage1":"Encoder","advDlCh17VisualStage2":"Bottleneck z","advDlCh17VisualStage3":"Decoder","advDlCh17VisualStage4":"Loss","problems":{"concept_0":"Which is closest to an **autoencoder** training objective?\n① Maximize classification accuracy only\n② **Minimize reconstruction loss so inputs are reconstructed well**\n③ Maximize RL reward only\n④ Delete the dataset","concept_1":"Which best describes latent vector $z$?\n① Always same dimension as input\n② **A compressed summary representation**\n③ Only store class probabilities\n④ Store learning rate","concept_2":"Common reconstruction loss for grayscale image vectors?\n① **MSE**\n② BCE only (always)\n③ Accuracy\n④ F1","concept_3":"If bottleneck $k$ **decreases**, what is generally expected?\n① Reconstruction always becomes easier\n② More information is always kept\n③ **Stronger compression (tighter representation capacity)**\n④ Loss becomes meaningless","concept_4":"Which matches **Denoising AE**?\n① Set all labels to 0\n② **Train to reconstruct clean targets from corrupted inputs**\n③ Always learn identity\n④ Remove attention","concept_5":"Which application fits **large reconstruction error** after training on normal data only?\n① Always classification\n② **Anomaly detection**\n③ Only augmentation\n④ Quantization","ox_0":"Autoencoders are often composed of **encoder and decoder**.\n1 if true, 0 if false.","ox_1":"Bottleneck $z$ must always have higher dimension than input $x$.\n1 if true, 0 if false.","ox_2":"Reducing reconstruction loss is a typical training goal.\n1 if true, 0 if false.","ox_3":"A **linear** autoencoder with **linear activations** and MSE is **always** identical to a GAN.\n1 if true, 0 if false.","ox_4":"Convolutional layers can leverage spatial structure for reconstruction.\n1 if true, 0 if false.","ox_5":"Autoencoders can be trained **without class labels** using reconstruction only.\n1 if true, 0 if false.","scenario_0":"**While training an autoencoder**, GPU runs out of memory. What should you try **first**?\n① Reduce batch size, input size, or model width\n② Increase learning rate without bound\n③ Delete all data\n④ Remove the loss","scenario_1":"For anomaly detection?\n① **Train reconstruction on normal data and flag large errors**\n② Shuffle labels randomly\n③ Always full fine-tuning only\n④ Change optimizer only","scenario_2":"Want robust features under noisy images?\n① Fill data with zeros only\n② **Denoising AE: reconstruct clean targets from noisy inputs**\n③ Use zero layers\n④ Stop training","scenario_3":"Bottleneck too wide (near identity)?\n① **Shrink bottleneck or add regularization**\n② Quantization only\n③ Use half the data\n④ Freeze LR to 0","vote_0":"Flattened dimension $d$ of a $28\\times28$ grayscale image?","vote_1":"Flattened $d$ for $16\\times16$ grayscale?","vote_2":"Flattened $d$ for $32\\times32$ grayscale?","vote_3":"Patch count for $224\\times224$ with $16\\times16$ patches (no CLS)?","scenario_4":"Validation MSE is much higher than training MSE. Suspect?\n① **Overfitting**\n② Training too slow\n③ Batch size always 1\n④ Optimizer name","scenario_5":"Pixels are in [0,255]. What should you consider?\n① Always leave as-is\n② **Normalize (e.g., to [0,1])**\n③ Increase labels\n④ Remove channels","scenario_6":"For a **probabilistic** latent space and generation, what is natural next?\n① **VAE**\n② Identity only\n③ k-means only\n④ PCA only","scenario_7":"Strategy close to using $z$ as classifier input?\n① **Representation learning then linear classifier with few labels**\n② Always random guess\n③ Drop data\n④ Remove loss","scenario_8":"Why use a CNN encoder?\n① **Use local patterns and spatial structure**\n② Always zero parameters\n③ RNN only\n④ Forbid padding","scenario_9":"Main purpose of noise in DAE?\n① **Learn robust features**\n② Always zero accuracy\n③ Delete data\n④ Stop training","vote_4":"Flattened $d$ for $32\\times16$ grayscale?","vote_5":"Flattened $d$ for $32\\times32$ **RGB (3 channels)**?","vote_6":"Flattened $d$ for width 16, height 8 grayscale?","vote_7":"Weight count (no bias) for one FC layer $d_{in}=100$, $d_{out}=20$?","vote_8":"Flatten length of a $6\\times6\\times2$ tensor?","vote_9":"With $d=1000$, $k=500$, express **$k/d$ as an integer percent** (e.g., 50% → **50**).","aggregate_0":"In an **AE experiment**, three bottleneck $k$ candidates were logged as **[3,4,5]**. What is their **sum**?","aggregate_1":"Same setup: candidates **[2,6,7]** — **sum**?","aggregate_2":"Bottleneck candidate **6** chosen three times ($6+6+6$) — **sum**?","aggregate_3":"Candidates **[2,3,6]** — **sum**?","ensemble_0":"**Image input** $224\\times224$ split into $16\\times16$ patches — **without CLS**, how many patch tokens?","ensemble_1":"**Patch grid**: square grid with **8** patches per side — total patches?","ensemble_2":"One **linear encoder** layer with $d_{in}=20$, $d_{out}=20$ — **weight count** (no bias)?","ensemble_3":"$$96\\times96$ input, patches $8\\times8$, **no CLS** — how many patches?","aggregate_4":"Multiple trials logged bottleneck candidates **[7,7,7,7]** — **sum**?","aggregate_5":"Candidates **[11,11,11]** — **sum**?","aggregate_6":"Candidate **3** logged **7** times — **sum**? ($3\\times7$)","aggregate_7":"Candidates **[4,5,10]** — **sum**?","aggregate_8":"Log **[3,4,5,6,6]** — **sum**?","aggregate_9":"Same candidate **5** added **6** times — **sum**? ($5\\times6$)","config_0":"**Image→patch grid**: square grid with **8** patches along each side — total cells?","config_1":"**9** patches per side — total?","config_2":"**10** patches per side — total?","config_3":"**11** patches per side — total?","config_4":"**12** patches per side — total?","config_5":"**6** patches per side — total?","config_6":"**7** patches per side — total?","config_7":"**16** patches per side — total?","config_8":"**20** patches per side — total?","config_9":"**25** patches per side — total?","ensemble_4":"Flattened $d$ for $30\\times30$ grayscale?","ensemble_5":"**196** patch tokens + **1** CLS — sequence length?","ensemble_6":"Linear encoder weights only (no bias): $d=16$, $k=2$ — how many weights?","ensemble_7":"Flatten $32\\times32$ grayscale to one vector — length?"},"problemSolutions":{"concept_0":"**Example:** MNIST reconstruction: minimize MSE.\n\n**Steps:** The goal is to shrink the gap between $x$ and $\\hat{x}$ → **2**.","concept_1":"**Example:** $z$ is a low-dimensional summary.\n\n**Steps:** **2**.","concept_2":"**Example:** MSE for real-valued pixels.\n\n**Steps:** **1**.","concept_3":"**Example:** Smaller $k$ → stronger compression.\n\n**Steps:** **3**.","concept_4":"**Example:** Noisy input → clean target.\n\n**Steps:** **2**.","concept_5":"**Example:** Train on normal data only → large error flags anomalies.\n\n**Steps:** **2**.","ox_0":"**Example:** Usually encoder–decoder.\n\n**Steps:** True **1**.","ox_1":"**Example:** Bottleneck is usually smaller than input.\n\n**Steps:** False **0**.","ox_2":"**Example:** Typical objective.\n\n**Steps:** True **1**.","ox_3":"**Example:** GAN has a different objective and architecture.\n\n**Steps:** False **0**.","ox_4":"**Example:** Conv AE uses spatial structure.\n\n**Steps:** True **1**.","ox_5":"**Example:** Unsupervised reconstruction is possible.\n\n**Steps:** True **1**.","scenario_0":"**Steps:** OOM → shrink batch/model first → **1**.","scenario_1":"**Steps:** Normal training + error threshold → **1**.","scenario_2":"**Steps:** DAE fits noisy→clean robust features → **2**.","scenario_3":"**Steps:** Narrow bottleneck / stronger regularization → **1**.","vote_0":"**Calc:** $28\\times28=784$. **Answer: 784**.","vote_1":"**Calc:** $16\\times16=256$. **Answer: 256**.","vote_2":"**Calc:** $32\\times32=1024$. **Answer: 1024**.","vote_3":"**Calc:** $(224/16)^2=14^2=196$. **Answer: 196**.","scenario_4":"**Steps:** Much higher val MSE → suspect overfitting → **1**.","scenario_5":"**Steps:** Scale normalization → **2**.","scenario_6":"**Steps:** Probabilistic latent → **VAE** → **1**.","scenario_7":"**Steps:** Representation + few labels → **1**.","scenario_8":"**Steps:** CNN uses spatial structure → **1**.","scenario_9":"**Steps:** DAE aims for robust features → **1**.","vote_4":"**Calc:** $32\\times16=512$. **Answer: 512**.","vote_5":"**Calc:** $32\\times32\\times3=3072$. **Answer: 3072**.","vote_6":"**Calc:** $16\\times8=128$. **Answer: 128**.","vote_7":"**Calc:** $100\\times20=2000$. **Answer: 2000**.","vote_8":"**Calc:** $6\\times6\\times2=72$. **Answer: 72**.","vote_9":"**Calc:** $k/d=500/1000=0.5$ → percent **50**.","aggregate_0":"**Example:** $3+4+5=12$. **Answer: 12**.","aggregate_1":"**Example:** $2+6+7=15$. **Answer: 15**.","aggregate_2":"**Example:** $6+6+6=18$. **Answer: 18**.","aggregate_3":"**Example:** $2+3+6=11$. **Answer: 11**.","ensemble_0":"**Calc:** $(224/16)^2=196$. **Answer: 196**.","ensemble_1":"**Calc:** $8\\times8=64$. **Answer: 64**.","ensemble_2":"**Calc:** Weights only $20\\times20=400$. **Answer: 400**.","ensemble_3":"**Calc:** $(96/8)^2=144$. **Answer: 144**.","aggregate_4":"**Example:** $7\\times4=28$. **Answer: 28**.","aggregate_5":"**Example:** $11\\times3=33$. **Answer: 33**.","aggregate_6":"**Example:** $3\\times7=21$. **Answer: 21**.","aggregate_7":"**Example:** $4+5+10=19$. **Answer: 19**.","aggregate_8":"**Example:** $3+4+5+6+6=24$. **Answer: 24**.","aggregate_9":"**Example:** $5\\times6=30$. **Answer: 30**.","config_0":"**Calc:** $8\\times8=64$. **Answer: 64**.","config_1":"**Calc:** $9\\times9=81$. **Answer: 81**.","config_2":"**Calc:** $10\\times10=100$. **Answer: 100**.","config_3":"**Calc:** $11\\times11=121$. **Answer: 121**.","config_4":"**Calc:** $12\\times12=144$. **Answer: 144**.","config_5":"**Calc:** $6\\times6=36$. **Answer: 36**.","config_6":"**Calc:** $7\\times7=49$. **Answer: 49**.","config_7":"**Calc:** $16\\times16=256$. **Answer: 256**.","config_8":"**Calc:** $20\\times20=400$. **Answer: 400**.","config_9":"**Calc:** $25\\times25=625$. **Answer: 625**.","ensemble_4":"**Calc:** $30\\times30=900$. **Answer: 900**.","ensemble_5":"**Calc:** $196+1=197$. **Answer: 197**.","ensemble_6":"**Calc:** Weights only $16\\times2=32$. **Answer: 32**.","ensemble_7":"**Calc:** $32\\times32=1024$. **Answer: 1024**."},"problemAnswers":{"concept_0":2,"concept_1":2,"concept_2":1,"concept_3":3,"concept_4":2,"concept_5":4,"ox_0":1,"ox_1":0,"ox_2":1,"ox_3":0,"ox_4":1,"ox_5":0,"scenario_0":1,"scenario_1":1,"scenario_2":2,"scenario_3":1,"vote_0":784,"vote_1":256,"vote_2":1024,"vote_3":196,"scenario_4":1,"scenario_5":2,"scenario_6":1,"scenario_7":1,"scenario_8":1,"scenario_9":1,"vote_4":512,"vote_5":3072,"vote_6":128,"vote_7":2000,"vote_8":72,"vote_9":50,"aggregate_0":12,"aggregate_1":15,"aggregate_2":18,"aggregate_3":11,"ensemble_0":196,"ensemble_1":64,"ensemble_2":400,"ensemble_3":144,"aggregate_4":28,"aggregate_5":33,"aggregate_6":21,"aggregate_7":19,"aggregate_8":24,"aggregate_9":30,"config_0":64,"config_1":81,"config_2":100,"config_3":121,"config_4":144,"config_5":36,"config_6":49,"config_7":256,"config_8":400,"config_9":625,"ensemble_4":900,"ensemble_5":197,"ensemble_6":32,"ensemble_7":1024},"problemTestCodes":{"concept_0":"answer = 2\nassert answer == 2","concept_1":"answer = 2\nassert answer == 2","concept_2":"answer = 1\nassert answer == 1","concept_3":"answer = 3\nassert answer == 3","concept_4":"answer = 2\nassert answer == 2","concept_5":"answer = 4\nassert answer == 4","ox_0":"answer = 1\nassert answer == 1","ox_1":"answer = 0\nassert answer == 0","ox_2":"answer = 1\nassert answer == 1","ox_3":"answer = 0\nassert answer == 0","ox_4":"answer = 1\nassert answer == 1","ox_5":"answer = 0\nassert answer == 0","scenario_0":"answer = 1\nassert answer == 1","scenario_1":"answer = 1\nassert answer == 1","scenario_2":"answer = 2\nassert answer == 2","scenario_3":"answer = 1\nassert answer == 1","vote_0":"answer = 784\nassert answer == 784","vote_1":"answer = 256\nassert answer == 256","vote_2":"answer = 1024\nassert answer == 1024","vote_3":"answer = 196\nassert answer == 196","scenario_4":"answer = 1\nassert answer == 1","scenario_5":"answer = 2\nassert answer == 2","scenario_6":"answer = 1\nassert answer == 1","scenario_7":"answer = 1\nassert answer == 1","scenario_8":"answer = 1\nassert answer == 1","scenario_9":"answer = 1\nassert answer == 1","vote_4":"answer = 512\nassert answer == 512","vote_5":"answer = 3072\nassert answer == 3072","vote_6":"answer = 128\nassert answer == 128","vote_7":"answer = 2000\nassert answer == 2000","vote_8":"answer = 72\nassert answer == 72","vote_9":"answer = 50\nassert answer == 50","aggregate_0":"values = [3, 4, 5]\nassert sum(values) == 12","aggregate_1":"values = [2, 6, 7]\nassert sum(values) == 15","aggregate_2":"values = [6, 6, 6]\nassert sum(values) == 18","aggregate_3":"values = [2, 3, 6]\nassert sum(values) == 11","ensemble_0":"answer = 196\nassert answer == 196","ensemble_1":"answer = 64\nassert answer == 64","ensemble_2":"answer = 400\nassert answer == 400","ensemble_3":"answer = 144\nassert answer == 144","aggregate_4":"values = [7, 7, 7, 7]\nassert sum(values) == 28","aggregate_5":"values = [11, 11, 11]\nassert sum(values) == 33","aggregate_6":"values = [3, 3, 3, 3, 3, 3, 3]\nassert sum(values) == 21","aggregate_7":"values = [4, 5, 10]\nassert sum(values) == 19","aggregate_8":"values = [3, 4, 5, 6, 6]\nassert sum(values) == 24","aggregate_9":"values = [5, 5, 5, 5, 5, 5]\nassert sum(values) == 30","config_0":"assert 8 * 8 == 64","config_1":"assert 9 * 9 == 81","config_2":"assert 10 * 10 == 100","config_3":"assert 11 * 11 == 121","config_4":"assert 12 * 12 == 144","config_5":"assert 6 * 6 == 36","config_6":"assert 7 * 7 == 49","config_7":"assert 16 * 16 == 256","config_8":"assert 20 * 20 == 400","config_9":"assert 25 * 25 == 625","ensemble_4":"answer = 900\nassert answer == 900","ensemble_5":"answer = 197\nassert answer == 197","ensemble_6":"answer = 32\nassert answer == 32","ensemble_7":"answer = 1024\nassert answer == 1024"},"problemDifficulty":{"concept_0":"easy","concept_1":"easy","concept_2":"easy","concept_3":"easy","concept_4":"easy","concept_5":"easy","ox_0":"easy","ox_1":"easy","ox_2":"easy","ox_3":"easy","ox_4":"easy","ox_5":"easy","scenario_0":"easy","scenario_1":"easy","scenario_2":"easy","scenario_3":"easy","vote_0":"easy","vote_1":"easy","vote_2":"easy","vote_3":"easy","scenario_4":"medium","scenario_5":"medium","scenario_6":"medium","scenario_7":"medium","scenario_8":"medium","scenario_9":"medium","vote_4":"medium","vote_5":"medium","vote_6":"medium","vote_7":"medium","vote_8":"medium","vote_9":"medium","aggregate_0":"medium","aggregate_1":"medium","aggregate_2":"medium","aggregate_3":"medium","ensemble_0":"medium","ensemble_1":"medium","ensemble_2":"medium","ensemble_3":"medium","aggregate_4":"hard","aggregate_5":"hard","aggregate_6":"hard","aggregate_7":"hard","aggregate_8":"hard","aggregate_9":"hard","config_0":"hard","config_1":"hard","config_2":"hard","config_3":"hard","config_4":"hard","config_5":"hard","config_6":"hard","config_7":"hard","config_8":"hard","config_9":"hard","ensemble_4":"hard","ensemble_5":"medium","ensemble_6":"hard","ensemble_7":"hard"},"problemOrder":["concept_0","concept_1","concept_2","concept_3","concept_4","concept_5","ox_0","ox_1","ox_2","ox_3","ox_4","ox_5","scenario_0","scenario_1","scenario_2","scenario_3","vote_0","vote_1","vote_2","vote_3","scenario_4","scenario_5","scenario_6","scenario_7","scenario_8","scenario_9","vote_4","vote_5","vote_6","vote_7","vote_8","vote_9","aggregate_0","aggregate_1","aggregate_2","aggregate_3","ensemble_0","ensemble_1","ensemble_2","ensemble_3","aggregate_4","aggregate_5","aggregate_6","aggregate_7","aggregate_8","aggregate_9","config_0","config_1","config_2","config_3","config_4","config_5","config_6","config_7","config_8","config_9","ensemble_4","ensemble_5","ensemble_6","ensemble_7"]},"paperReviewInfluenceKernelVonMises":{"chapter":"Chapter PR-01","title":"Kernel von Mises Formula of the Influence Function","description":"This paper replaces the old bottleneck—deriving influence functions (IF) by hand for every model—with a data-driven procedure built on kernels and spectral expansions. In particular it eases numerical ill-conditioning that often arises with point-mass perturbations, and through a regularized estimator it aims for both **practical computability** and **theoretical consistency**.","sectionTitle":"Learn / Paper Review / Theory & Math / CPAL2026","viewOriginalPdf":"View original paper (PDF)","coreFlow":{"0":"**[Abstract & Intro] Three-sentence summary + problem**\n\n① Classical influence-function computation forces a fresh derivation whenever the model changes, so automation is difficult.\n② The traditional approach—poking the distribution with a point mass—makes the response sharp and prone to numerical instability.\n③ This paper splits the data into several smooth patterns, computes influence for each, and recombines them so a computer—not hand derivation—can estimate the IF more stably.\n\n**Everyday analogy:** Imagine a complex hot-pot recipe and you want to know how one piece of firm tofu changes the broth. The old style jabs the pot like a needle, so readings swing wildly. This paper nudges gently in several directions like soft ripples and aggregates the responses—closer to a stable taste meter.","1":"$33","2":"**[Proposed method: core idea]**\n\nThe paper avoids point-mass perturbation directly; along eigenfunction-direction path perturbations $P_t^j$ it computes pathwise derivatives of $\\theta$ to reconstruct the IF. The centerpiece is **Theorem 3.3 (Spectral von Mises formula)**, expressing the IF as a sum of per-mode contributions. A regularization strength $\\lambda$ suppresses blow-up of small-eigenvalue modes and improves computational stability.","3":"$34","4":"$35","5":"$36"},"mainMethodFiveSteps":{"0":"**1) Core proposal (concept)**\n\nRather than jabbing with a point mass, perturb the distribution smoothly along kernel eigenfunction axes and compose pathwise derivatives to obtain the IF.","1":"**2) Everyday analogy (intuition)**\n\nPlucking one guitar string hard adds harsh noise; blending several strings yields a steadier chord. Likewise, one sharp stimulus is noisier than synthesizing multiple modes for the IF.","2":"**3) Formula deep dive (math)**\n\nThe weight $\\frac{1}{1+2\\lambda/\\sigma_j}$ damps blow-up of small-$\\sigma_j$ modes, lowering variance; a very large $\\lambda$ can inflate bias.","3":"**4) Math to code**\n\nThe code below evaluates $\\psi_{P,\\lambda}(x)$ using $\\sigma_j$, pathwise-derivative approximations, and $e_j(x)$. It is a compact replay of the paper pipeline (mode decomposition → per-mode sensitivity → shrunk weighted sum) with symbols mapped 1:1 to variable names.","4":"**5) Applied AI uses**\n\n- Detect training samples with outsized impact on predictions\n- Prioritize review of outliers and mislabeled data\n- Compare sensitivity before and after model updates"},"mathToCodeTitle":"Paper algorithm in code (NumPy)","mathToCodeCode":"import numpy as np\n\n# Eigenvalues sigma_j from the paper\nsigma = np.array([8, 4, 2, 1], dtype=float)\n\n# Pathwise derivative approximations [d/dt theta(P_t^j)]_{t=0}\ndtheta = np.array([6, 4, 2, 2], dtype=float)\n\n# e_j(x): eigenfunction values at a fixed x\ne_x = np.array([3, 2, 1, 1], dtype=float)\n\n# Regularization hyperparameter lambda\nlambda_reg = 2.0\n\n# Shrinkage 1 / (1 + 2*lambda/sigma_j)\nshrink = 1.0 / (1.0 + 2.0 * lambda_reg / sigma)\n\n# Per-mode contribution = shrink_j * dtheta_j * e_j(x)\nterm = shrink * dtheta * e_x\n\n# Low-rank IF with r=4\npsi_hat = int(np.round(np.sum(term)))\n\nprint('shrink =', shrink.astype(int))\nprint('term =', term.astype(int))\nprint('psi_hat =', psi_hat)","mathToCodeOutput":"shrink = [0 0 0 0]\nterm = [10 4 1 0]\npsi_hat = 16","visualPlanTitle":"Diagram: a stark contrast — limitations vs. proposal","visualPlan":"The **left block** highlights the **classical failure mode**: point-mass spikes make sensitivity swing wildly. The **right pipeline** shows the **paper’s fix**: spectral modes plus **regularized weighting** rebuilds a **smooth, suppressible** influence curve—so the gap is hard to miss.","visualLimitBannerTitle":"Classical limitation","visualLimitBannerDetail":"Point-mass · spikes → volatile, ill-conditioned sensitivity","visualProposalBannerTitle":"Paper’s proposal","visualProposalBannerDetail":"Spectral split → regularized reconstruction → stable IF","visualStep1Heading":"1) Point-mass perturbation","visualStep1Body":"Large sensitivity swings from spikes","visualStep2Heading":"2) Spectral decomposition","visualStep2Body1":"Per-mode $(\\sigma_j, e_j)$","visualStep2Body2":"Small $\\sigma_j$ modes are down-weighted","visualStep3Heading":"3) Regularized reconstruction","visualStep3Body1":"Weighted sum restores a smooth IF","visualStep3Body2":"$$\\frac{1}{1+2\\lambda/\\sigma_j}$ suppresses noisy modes","visualVsLabel":"VS","visualVsAria":"Separator between the classical limitation block and the proposed pipeline","summary":"The paper redefines IF estimation away from model-by-model hand derivation toward a kernel–spectral, data-driven procedure—an important shift. In practice you can **more stably** track which samples move predictions, tying directly to data quality checks, outlier analysis, and model debugging. Remaining work includes balancing bias and variance via regularization strength, sharper **convergence-rate** theory, and **fully automating** pathwise derivatives; theory and systems still have room to grow.","problemSolvingLabel":"Problem-Solving Guide","problemSolving":{"0":"| Type | How to solve (paper symbols → answer) |\n| :--- | :--- |\n| Symbols | $\\lambda$ regularization strength, $\\sigma_j$ eigenvalue, $e_j(x)$ eigenfunction value |\n| Count | Five eigenfunctions ⇒ five terms in the sum |\n| Shrinkage | $\\sigma_j=4$, $\\lambda=2$ gives denominator $1+2\\lambda/\\sigma_j=2$ |\n| Toy sum | Contributions [8,4,2,2] sum to 16 |\n| Trend | Larger $\\lambda$ usually shrinks small-$\\sigma_j$ contributions |\n| Code map | $\\lambda \\leftrightarrow$ lambda_reg, $\\sigma_j \\leftrightarrow$ sigma |","1":"**Example A**\n\nPrompt: With $\\sigma_j=4$ and $\\lambda=2$, compute the shrinkage denominator $1+2\\lambda/\\sigma_j$.\n\nWork: $1+2\\times2/4=2$\n\nAnswer: 2","2":"**Example B**\n\nPrompt: Per-mode contributions are [6, 4, 2, 4]. Total IF approximation?\n\nWork: $6+4+2+4=16$\n\nAnswer: 16"},"practiceProblemsTitle":"Practice Problems","practiceProblemsIntro":"Below are 10 problems drawn at random from a pool of 60. Difficulty order is easy 4 · medium 3 · hard 3; enter integers only.","practiceProblemsInstruction":"There is a blank line between the instruction and the actual question. Answers must be integers.","problems":{"q00":"Instruction: choose the main contribution.\n\nQuestion: What is the key contribution? ① stronger point-mass perturbation ② kernel spectral IF estimator ③ CNN classifier","q01":"Instruction: choose symbol meaning.\n\nQuestion: In the formula, $\\lambda$ means? ① regularization strength ② sample count ③ class count","q02":"Instruction: choose symbol meaning.\n\nQuestion: In the formula, $\\sigma_j$ is? ① eigenvalue ② batch size ③ layer count","q03":"Instruction: choose symbol meaning.\n\nQuestion: $e_j(x)$ is? ① eigenfunction value at x ② loss function ③ optimizer","q04":"Instruction: true/false.\n\nQuestion: Point-mass perturbation can be numerically unstable. true=1 false=0","q05":"Instruction: true/false.\n\nQuestion: Larger $\\lambda$ usually decreases small-$\\sigma_j$ mode contributions. true=1 false=0","q06":"Instruction: true/false.\n\nQuestion: The method reconstructs IF by summing mode-wise contributions. true=1 false=0","q07":"Instruction: count terms.\n\nQuestion: If $r=6$, how many terms are in the sum?","q08":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=2$, $\\sigma_j=4$, compute $1+2\\lambda/\\sigma_j$.","q09":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=3$, $\\sigma_j=3$, compute $1+2\\lambda/\\sigma_j$.","q10":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=1$, $\\sigma_j=2$, compute $1+2\\lambda/\\sigma_j$.","q11":"Instruction: compute denominator.\n\nQuestion: with $\\lambda=4$, $\\sigma_j=8$, compute $1+2\\lambda/\\sigma_j$.","q12":"Instruction: sum contributions.\n\nQuestion: Sum [5,4,3].","q13":"Instruction: sum contributions.\n\nQuestion: Sum [6,2,2,2].","q14":"Instruction: sum contributions.\n\nQuestion: Sum [9,1,3,3].","q15":"Instruction: compute decrease.\n\nQuestion: before=20, after=16. decrease?","q16":"Instruction: compute session size.\n\nQuestion: easy/medium/hard ratio is 4/3/3. total per session?","q17":"Instruction: count all.\n\nQuestion: easy=20, medium=20, hard=20. total?","q18":"Instruction: count terms.\n\nQuestion: 4 eigenfunctions with 1 contribution each. total?","q19":"Instruction: compute decrease.\n\nQuestion: value dropped from 5 to 2. decrease?","q20":"Instruction: toy sum.\n\nQuestion: Sum [8,4,2,2].","q21":"Instruction: toy sum.\n\nQuestion: Sum [10,3,1,2].","q22":"Instruction: toy sum.\n\nQuestion: Sum [7,5,4].","q23":"Instruction: toy sum.\n\nQuestion: Sum [12,6,2].","q24":"Instruction: toy sum.\n\nQuestion: Sum [4,4,4,4].","q25":"Instruction: toy sum.\n\nQuestion: Sum [3,3,5,5].","q26":"Instruction: toy sum.\n\nQuestion: Sum [15,1].","q27":"Instruction: toy sum.\n\nQuestion: Sum [11,2,3].","q28":"Instruction: toy sum.\n\nQuestion: Sum [6,6,2,2].","q29":"Instruction: toy sum.\n\nQuestion: Sum [14,2].","q30":"Instruction: count terms.\n\nQuestion: if $r=10$, number of terms?","q31":"Instruction: count terms.\n\nQuestion: if $r=12$, number of terms?","q32":"Instruction: count terms.\n\nQuestion: if $r=15$, number of terms?","q33":"Instruction: count terms.\n\nQuestion: if $r=18$, number of terms?","q34":"Instruction: denominator.\n\nQuestion: $\\lambda=6,\\sigma_j=6$, compute $1+2\\lambda/\\sigma_j$.","q35":"Instruction: denominator.\n\nQuestion: $\\lambda=8,\\sigma_j=4$, compute $1+2\\lambda/\\sigma_j$.","q36":"Instruction: denominator.\n\nQuestion: $\\lambda=5,\\sigma_j=10$, compute $1+2\\lambda/\\sigma_j$.","q37":"Instruction: denominator.\n\nQuestion: $\\lambda=9,\\sigma_j=9$, compute $1+2\\lambda/\\sigma_j$.","q38":"Instruction: compare estimates.\n\nQuestion: before=28, after=20. decrease?","q39":"Instruction: compare estimates.\n\nQuestion: before=35, after=27. decrease?","q40":"Instruction: hard sum.\n\nQuestion: Sum [20,10,6,4].","q41":"Instruction: hard sum.\n\nQuestion: Sum [18,12,8,2].","q42":"Instruction: hard sum.\n\nQuestion: Sum [16,9,7,4].","q43":"Instruction: hard sum.\n\nQuestion: Sum [22,8,5,1].","q44":"Instruction: hard sum.\n\nQuestion: Sum [14,14,6,2].","q45":"Instruction: hard sum.\n\nQuestion: Sum [25,5,4,2].","q46":"Instruction: hard sum.\n\nQuestion: Sum [30,4,1,1].","q47":"Instruction: hard sum.\n\nQuestion: Sum [19,9,5,3].","q48":"Instruction: hard sum.\n\nQuestion: Sum [17,11,6,2].","q49":"Instruction: hard sum.\n\nQuestion: Sum [24,7,3,2].","q50":"Instruction: hard denominator.\n\nQuestion: $\\lambda=10,\\sigma_j=5$, compute $1+2\\lambda/\\sigma_j$.","q51":"Instruction: hard denominator.\n\nQuestion: $\\lambda=12,\\sigma_j=6$, compute $1+2\\lambda/\\sigma_j$.","q52":"Instruction: hard denominator.\n\nQuestion: $\\lambda=14,\\sigma_j=7$, compute $1+2\\lambda/\\sigma_j$.","q53":"Instruction: hard denominator.\n\nQuestion: $\\lambda=16,\\sigma_j=8$, compute $1+2\\lambda/\\sigma_j$.","q54":"Instruction: hard denominator.\n\nQuestion: $\\lambda=18,\\sigma_j=9$, compute $1+2\\lambda/\\sigma_j$.","q55":"Instruction: hard denominator.\n\nQuestion: $\\lambda=20,\\sigma_j=10$, compute $1+2\\lambda/\\sigma_j$.","q56":"Instruction: set size.\n\nQuestion: 60 total, 10 used in one session. remaining?","q57":"Instruction: set size.\n\nQuestion: easy 20, use 4 in one session. remaining easy?","q58":"Instruction: set size.\n\nQuestion: medium 20, use 3 in one session. remaining medium?","q59":"Instruction: set size.\n\nQuestion: hard 20, use 3 in one session. remaining hard?"},"problemAnswers":{"q00":2,"q01":1,"q02":1,"q03":1,"q04":1,"q05":1,"q06":1,"q07":6,"q08":2,"q09":3,"q10":2,"q11":2,"q12":12,"q13":12,"q14":16,"q15":4,"q16":10,"q17":60,"q18":4,"q19":3,"q20":16,"q21":16,"q22":16,"q23":20,"q24":16,"q25":16,"q26":16,"q27":16,"q28":16,"q29":16,"q30":10,"q31":12,"q32":15,"q33":18,"q34":3,"q35":5,"q36":2,"q37":3,"q38":8,"q39":8,"q40":40,"q41":40,"q42":36,"q43":36,"q44":36,"q45":36,"q46":36,"q47":36,"q48":36,"q49":36,"q50":5,"q51":5,"q52":5,"q53":5,"q54":5,"q55":5,"q56":50,"q57":16,"q58":17,"q59":17},"problemSolutions":{"q00":"Correct choice is 2.","q01":"Correct choice is 1.","q02":"Correct choice is 1.","q03":"Correct choice is 1.","q04":"True -> 1.","q05":"True -> 1.","q06":"True -> 1.","q07":"Term count is 6.","q08":"$$1+2\\times2/4=2$.","q09":"$$1+2\\times3/3=3$.","q10":"$$1+2\\times1/2=2$.","q11":"$$1+2\\times4/8=2$.","q12":"$$5+4+3=12$.","q13":"$$6+2+2+2=12$.","q14":"$$9+1+3+3=16$.","q15":"$$20-16=4$.","q16":"$$4+3+3=10$.","q17":"$$20+20+20=60$.","q18":"$$4\\times1=4$.","q19":"$$5-2=3$.","q20":"$$8+4+2+2=16$.","q21":"$$10+3+1+2=16$.","q22":"$$7+5+4=16$.","q23":"$$12+6+2=20$.","q24":"$$4+4+4+4=16$.","q25":"$$3+3+5+5=16$.","q26":"$$15+1=16$.","q27":"$$11+2+3=16$.","q28":"$$6+6+2+2=16$.","q29":"$$14+2=16$.","q30":"Answer 10.","q31":"Answer 12.","q32":"Answer 15.","q33":"Answer 18.","q34":"$$1+2\\times6/6=3$.","q35":"$$1+2\\times8/4=5$.","q36":"$$1+2\\times5/10=2$.","q37":"$$1+2\\times9/9=3$.","q38":"$$28-20=8$.","q39":"$$35-27=8$.","q40":"Sum is 40.","q41":"Sum is 40.","q42":"Sum is 36.","q43":"Sum is 36.","q44":"Sum is 36.","q45":"Sum is 36.","q46":"Sum is 36.","q47":"Sum is 36.","q48":"Sum is 36.","q49":"Sum is 36.","q50":"Value is 5.","q51":"Value is 5.","q52":"Value is 5.","q53":"Value is 5.","q54":"Value is 5.","q55":"Value is 5.","q56":"$$60-10=50$.","q57":"$$20-4=16$.","q58":"$$20-3=17$.","q59":"$$20-3=17$."},"problemTestCodes":{"q00":"answer = 2\nassert answer == 2","q01":"answer = 1\nassert answer == 1","q02":"answer = 1\nassert answer == 1","q03":"answer = 1\nassert answer == 1","q04":"answer = 1\nassert answer == 1","q05":"answer = 1\nassert answer == 1","q06":"answer = 1\nassert answer == 1","q07":"assert 6 == 6","q08":"assert 1 + 2 * 2 // 4 == 2","q09":"assert 1 + 2 * 3 // 3 == 3","q10":"assert 1 + 2 * 1 // 2 == 2","q11":"assert 1 + 2 * 4 // 8 == 2","q12":"values = [5,4,3]\nassert sum(values) == 12","q13":"values = [6,2,2,2]\nassert sum(values) == 12","q14":"values = [9,1,3,3]\nassert sum(values) == 16","q15":"before = 20\nafter = 16\nassert before - after == 4","q16":"assert 4 + 3 + 3 == 10","q17":"assert 20 + 20 + 20 == 60","q18":"assert 4 * 1 == 4","q19":"assert 5 - 2 == 3","q20":"values = [8,4,2,2]\nassert sum(values) == 16","q21":"values = [10,3,1,2]\nassert sum(values) == 16","q22":"values = [7,5,4]\nassert sum(values) == 16","q23":"values = [12,6,2]\nassert sum(values) == 20","q24":"values = [4,4,4,4]\nassert sum(values) == 16","q25":"values = [3,3,5,5]\nassert sum(values) == 16","q26":"values = [15,1]\nassert sum(values) == 16","q27":"values = [11,2,3]\nassert sum(values) == 16","q28":"values = [6,6,2,2]\nassert sum(values) == 16","q29":"values = [14,2]\nassert sum(values) == 16","q30":"answer = 10\nassert answer == 10","q31":"answer = 12\nassert answer == 12","q32":"answer = 15\nassert answer == 15","q33":"answer = 18\nassert answer == 18","q34":"assert 1 + 2 * 6 // 6 == 3","q35":"assert 1 + 2 * 8 // 4 == 5","q36":"assert 1 + 2 * 5 // 10 == 2","q37":"assert 1 + 2 * 9 // 9 == 3","q38":"assert 28 - 20 == 8","q39":"assert 35 - 27 == 8","q40":"values = [20,10,6,4]\nassert sum(values) == 40","q41":"values = [18,12,8,2]\nassert sum(values) == 40","q42":"values = [16,9,7,4]\nassert sum(values) == 36","q43":"values = [22,8,5,1]\nassert sum(values) == 36","q44":"values = [14,14,6,2]\nassert sum(values) == 36","q45":"values = [25,5,4,2]\nassert sum(values) == 36","q46":"values = [30,4,1,1]\nassert sum(values) == 36","q47":"values = [19,9,5,3]\nassert sum(values) == 36","q48":"values = [17,11,6,2]\nassert sum(values) == 36","q49":"values = [24,7,3,2]\nassert sum(values) == 36","q50":"assert 1 + 2 * 10 // 5 == 5","q51":"assert 1 + 2 * 12 // 6 == 5","q52":"assert 1 + 2 * 14 // 7 == 5","q53":"assert 1 + 2 * 16 // 8 == 5","q54":"assert 1 + 2 * 18 // 9 == 5","q55":"assert 1 + 2 * 20 // 10 == 5","q56":"assert 60 - 10 == 50","q57":"assert 20 - 4 == 16","q58":"assert 20 - 3 == 17","q59":"assert 20 - 3 == 17"},"problemDifficulty":{"q00":"easy","q01":"easy","q02":"easy","q03":"easy","q04":"easy","q05":"easy","q06":"easy","q07":"easy","q08":"easy","q09":"easy","q10":"easy","q11":"easy","q12":"easy","q13":"easy","q14":"easy","q15":"easy","q16":"easy","q17":"easy","q18":"easy","q19":"easy","q20":"medium","q21":"medium","q22":"medium","q23":"medium","q24":"medium","q25":"medium","q26":"medium","q27":"medium","q28":"medium","q29":"medium","q30":"medium","q31":"medium","q32":"medium","q33":"medium","q34":"medium","q35":"medium","q36":"medium","q37":"medium","q38":"medium","q39":"medium","q40":"hard","q41":"hard","q42":"hard","q43":"hard","q44":"hard","q45":"hard","q46":"hard","q47":"hard","q48":"hard","q49":"hard","q50":"hard","q51":"hard","q52":"hard","q53":"hard","q54":"hard","q55":"hard","q56":"hard","q57":"hard","q58":"hard","q59":"hard"},"problemOrder":["q00","q01","q02","q03","q04","q05","q06","q07","q08","q09","q10","q11","q12","q13","q14","q15","q16","q17","q18","q19","q20","q21","q22","q23","q24","q25","q26","q27","q28","q29","q30","q31","q32","q33","q34","q35","q36","q37","q38","q39","q40","q41","q42","q43","q44","q45","q46","q47","q48","q49","q50","q51","q52","q53","q54","q55","q56","q57","q58","q59"]},"paperReviewCurseDepthLlm":{"chapter":"Chapter PR-02","title":"The Curse of Depth in Large Language Models","description":"This review explains why simply adding more layers does not always buy more representation power in large language models. The paper analyzes variance accumulation in Pre-LN transformers and shows that a single depth-aware rule, LayerNorm Scaling (LNS), can keep deep layers useful instead of letting them collapse into identity-like behavior.","viewOriginalPdf":"View original paper (PDF)","coreFlow":{"0":"### [Abstract & Introduction] 3-line summary + problem setup\n\n- **Core problem:** Many deep layers in large LLMs contribute less than expected and can drift toward identity-like behavior.\n- **Classical limitation:** Pre-LN improves optimization stability, but variance can still accumulate with depth.\n- **Key fix:** LNS multiplies the normalized signal by $\\frac{1}{\\sqrt{l}}$, suppressing deep-layer variance growth and restoring useful layer participation.\n\n**Analogy:** Imagine a stadium audio chain with 100 amplifiers in series. Without careful control, tiny noise added at each stage eventually drowns out the original voice. LNS acts like a smart limiter that lowers the volume more aggressively in later amplifiers so the original signal survives all the way to the end.","1":"$37","2":"$38","3":"### [Toy walkthrough] How the formula behaves in motion\n\nConsider a 6-layer transformer where residual additions gradually increase activation amplitude.\n\n1. At $l=1$, the scale is $1.0$, so almost the full signal is passed through.\n2. At $l=2$, the scale becomes about $0.707$, slightly damping the rising amplitude.\n3. At $l=3$, the scale is about $0.577$, further suppressing accumulated noise.\n4. At $l=4$, the scale reaches $0.5$, making later-layer growth much more controlled.\n5. At $l=5$ and $l=6$, the scale becomes even smaller, preventing deep-layer blow-up while preserving meaningful transformations.\n\nThe intuition is simple: early layers keep enough freedom to build features, while later layers are prevented from turning into unstable amplifiers.","4":"### [Experiments and results]\n\nThe paper reports that LNS improves convergence behavior from smaller models up to multi-billion-parameter scale.\n\n- It is **hyperparameter-free** in the sense that the rule is fixed by depth.\n- It lowers final loss in large-scale experiments compared with vanilla Pre-LN.\n- It preserves more angular diversity across deep-layer representations, suggesting that late layers remain meaningfully distinct instead of collapsing toward similar states.\n\nFrom an engineering perspective, this is attractive because the implementation cost is tiny while the potential payoff on depth efficiency is large.","5":"### [Conclusion and limitations]\n\n- **Practical value 1:** Better depth utilization creates a stronger starting point for pruning, quantization, and efficiency work.\n- **Practical value 2:** More useful deep features can help downstream fine-tuning and task adaptation.\n- **Practical value 3:** The method is easy to insert into existing Pre-LN pipelines without architectural surgery.\n\n**Limitations:** The analysis mainly targets Pre-LN transformers. Generalization to Post-LN, normalization-free models, and multimodal branches remains an open direction."},"visualPlanTitle":"Visualization plan: uncontrolled amplification vs controlled depth scaling","visualPlan":"The left panel shows variance growing with depth in legacy Pre-LN, while the right panel shows how LNS keeps amplitude under control as layers get deeper. For responsive UI, keep `minHeight: 320px` and use an SVG `viewBox`-based layout.","visualLegacyTitle":"Legacy Pre-LN","visualLegacyBody":"Variance keeps building up, so late layers drift toward identity-like behavior.","visualProposedTitle":"Proposed LNS","visualProposedBody":"Depth-aware damping stabilizes amplitude and keeps deep layers useful.","visualAxisStart":"Layer 1","visualAxisEnd":"Layer L","visualLegacyCurveLabel":"Variance growth","visualProposedCurveLabel":"Controlled amplitude","visualContributionLabel":"Layer contribution","visualLegacyBadgeLabel":"Late layers become near-identity","visualProposedBadgeLabel":"Deep layers stay useful","summary":"The appeal of LNS is that it attacks the curse of depth with almost no architectural overhead. The paper turns \"more depth\" from a fragile scaling strategy into something much closer to usable learning capacity."},"paperReviewAlphaFormer":{"sectionTitle":"Learn / Paper review / Core architecture & algorithms / CPAL2026 / AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","title":"AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers","description":"In quant practice, alpha factors still sit awkwardly between **hand-crafted formulas** and **black-box models**. AlphaFormer **pre-trains a Transformer on synthetic time series**, then—given new market data—**emits interpretable symbolic formulas** end-to-end. This article dissects the linear alpha pool, IC-based metrics, and PPO-style stabilization line by line.","viewOriginalPdf":"Open original PDF","coreFlow":{"0":"$39","1":"$3a","2":"$3b","3":"$3c","4":"**[Experiments & results]**\n\n- **Search efficiency:** Strong baselines need **far more candidate factors**; AlphaFormer reaches **top-tier IC / Rank IC** on CSI300 & CSI500 with **~one-third the generation budget** in the paper’s story—not a wider needle, but a steadier hand.\n- **Inference efficiency:** **No massive online parameter re-fit** during inference—important for near-real-time stacks.\n- **Generalization:** **Ensembling multiple generative architectures** for synthetics boosts IC; **China-pretrained models zero-shot to US S&P 500** still compete—suggesting partial transfer of **time-series / operator grammar**, not only venue noise.\n\n**Practical read:** If you want **interpretable factors** under **GPU-hour budgets**, “synthetic pre-train + bounded RL fine-tune” is an attractive MLOps compromise.","5":"**[Conclusion & limitations]**\n\n**Takeaways for practitioners (≤3)**\n\n1. **White-box signals:** RPN / operator trees are easy to share with risk as **literal formulas**.\n2. **Lower search tax:** Grammar compression means **less cold-start symbolic search** on every new tape.\n3. **End-to-end story:** generate → pool → IC → (optional) PPO keeps **pipelines short and reproducible**.\n\n**Limitations / future work**\n\n- **Hardware:** GPU-centric training & inference may **exclude CPU-only** legacy stacks.\n- **Regimes:** Impressive zero-shot transfer still may need **retrain or domain adaptation** after structural breaks.\n- **Labels:** IC is only as honest as your **forward-return definition and leakage controls**."},"visualPlanTitle":"Visualization plan: chaotic search vs. controlled generation","visualPlan":"Left: a **search-space scatter** of trials plus a **jagged path** that barely approaches the **IC goal**—cold-start symbolic mining. Right: a single **pipeline**—synthetic series → pre-training → tokenized formula generation → IC/pool—for AlphaFormer’s end-to-end story.","visualLegacyTitle":"Legacy: GP / RL symbolic search","visualLegacyBody":"Each new dataset restarts wide exploration; many candidates still yield noisy IC paths.","visualProposedTitle":"Proposed: AlphaFormer","visualProposedBody":"Grammar from synthetics; fewer generations lift IC steadily and zero-shot transfer is plausible.","visualAxisStart":"Trial 1","visualAxisEnd":"Trial N","visualLegacyCurveLabel":"Random search","visualProposedCurveLabel":"Pre-trained gen.","visualContributionLabel":"Cumulative gain","visualLegacyBadgeLabel":"Over-exploration","visualProposedBadgeLabel":"Few factors, high IC","summary":"AlphaFormer reframes “restart symbolic search every market” as **grammar pre-training + safely clipped RL fine-tuning**. Pool, L1, IC, and PPO play roles like **mixer, scissors, judges, seat belt**. Respect **GPU dependence** and **label hygiene** when you pilot."},"paperReviewPolarQuant":{"sectionTitle":"Learn / Paper Review / Model Optimization & Efficient AI / PolarQuant: Quantizing KV Caches with Polar Transformation","title":"Chapter 1: PolarQuant: Quantizing KV Caches with Polar Transformation","description":"In long-context LLM serving, the bottleneck is often not the model weights but the **KV cache memory**. PolarQuant attacks that bottleneck directly: after random preconditioning, it rewrites a KV vector in polar form and **stores angles compactly**, cutting the usual burden of **extra “how to reconstruct the numbers” side information**. This review unpacks the main formulas, why the angle distribution concentrates near $\\pi/4$, and what that means for real systems.","viewOriginalPdf":"View original PDF","coreFlow":{"0":"$3d","1":"$3e","2":"$3f","3":"$40","4":"$41","5":"**[Conclusion & Limitations]**\n\n**Practical significance**\n\n1. PolarQuant shows that quantization does not have to carry normalization metadata forever.\n2. It directly targets the memory hotspot of long-context serving.\n3. It changes the cache representation without requiring a new attention mechanism.\n\n**Limitations**\n\n- Codebook construction still leaves room for better analytic designs.\n- The paper is strongest on KV-cache quantization; extending the idea to weights or activations needs more evidence.\n- Real deployment still depends on efficient kernels, packing layouts, and careful implementation.","6":"**[Visualization Plan]**\n\nThe left panel should depict traditional block quantization: many blocks, each carrying **extra helper numbers to reconstruct the stored values**. The right panel should depict PolarQuant: random preconditioning, polar conversion, one radius, and highly concentrated angles near $45^\\circ$."},"visualPlanTitle":"KV storage at a glance","visualPlan":"Legacy stacks FP16 metadata per block; PolarQuant keeps r and angles.","visualLegacyTitle":"Block quant","visualLegacyBody":"Each block still needs extra numbers to decode the short codes, so overhead remains even when values look compressed.","visualProposedTitle":"PolarQuant","visualProposedBody":"After random preconditioning, the method moves to polar coordinates and quantizes concentrated angles instead of storing normalization metadata.","visualAxisStart":"Baseline","visualAxisEnd":"PolarQuant","visualLegacyCurveLabel":"metadata overhead ↑","visualProposedCurveLabel":"footprint ↓","visualContributionLabel":"memory efficiency","visualLegacyBadgeLabel":"+FP16 meta / block","visualProposedBadgeLabel":"r + θ codebook","visualGlossary":{"title":"How to read the diagram labels","items":[{"term":"FP16","hint":"**Half-precision** floating point (16 bits per number). About **half the footprint of FP32** for the same count of values; slightly coarser grid of representable numbers."},{"term":"Quantization","hint":"Rounding real values onto a **small set of integer codes** to save bits. At use time you **dequantize**; you often need **per-block helper numbers** to map codes back to the right range."},{"term":"KV","hint":"A chunk of cached Key/Value vectors for past tokens (attention memory)."},{"term":"INT4","hint":"Values packed into 4-bit integers—small, but not usable without extra info."},{"term":"+meta / FP16","hint":"High-precision helper numbers (scale, zero-point, etc.) needed to dequantize; stored separately."},{"term":"× N","hint":"Roughly: that metadata repeats for every block, so cost grows with N."},{"term":"S","hint":"Random matrix that mixes coordinates (preconditioning) before the polar transform."},{"term":"r","hint":"Radius: overall magnitude in polar form."},{"term":"θ","hint":"Angle (direction). Often stored as a codebook index instead of a full float."},{"term":"codebook","hint":"A small table of typical angles—like a palette—so you only store an index."}]},"summary":"PolarQuant is elegant because it changes the coordinate system of the problem. Instead of forcing raw coordinates into low bits and paying normalization overhead, it stores one radius and a set of structured angles. That makes it especially attractive when KV-cache memory, not model size, is the true serving bottleneck."},"paperReviewAutomlAgent":{"sectionTitle":"Learn / Paper Review / Automated ML & ML Pipelines / ICML 2025 / AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","title":"AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML","description":"AutoML-Agent goes beyond “helping AutoML”: it automates the whole loop from data retrieval, preprocessing, model design, HPO, code generation, to deployment—using a multi-agent LLM framework. This article dissects the paper’s core math (input → planning → decomposition → execution → verification) line by line.","viewOriginalPdf":"Open original PDF","coreFlow":{"0":"$42","1":"$43","2":"$44","3":"$45","4":"$46","5":"**[Conclusion & Limitations]**\n\n**Final meaning & practical value (≤3):**\n\n1. **Full-pipeline mindset:** defines AutoML as a continuous pipeline, not a single step.\n2. **RAP + multi-agent:** turns plan search from single-shot generation into guided candidate exploration.\n3. **Verification-first reliability:** reduces the typical LLM failure mode—“looks right, but breaks.”\n\n**Limitations / Future work:**\n\n- **Skeleton/template reliance:** genuinely novel tasks may still require stronger base templates.\n- **Backbone LLM dependency:** stronger LLMs usually produce better plans and code.\n- **Metric sensitivity:** success/performance depends on how SR/NPS (and the verification criteria) are defined.\n\nFinally, a single orchestration diagram summarizes the full pipeline."},"visualPlanTitle":"[Diagram] Full-pipeline orchestration board","visualPlan":"One flowchart panel: standardize the user instruction into $R$, strengthen planning with **RAP**, run **data → model → code** stages on decomposed sub-tasks in parallel, then advance only verified artifacts to **deployment**.","visualLegacyTitle":"Legacy: single-shot plan bottleneck","visualLegacyBody":"Planning and execution are partially automated; failures require manual debugging and iteration.","visualProposedTitle":"AutoML-Agent: RAP + multi-agent + multi-stage verification","visualProposedBody":"Standardize request $R$, generate candidate plans with RAP, decompose into data/model tasks, run in parallel, and verify until deployment-ready.","visualAxisStart":"Natural language","visualAxisEnd":"Deploy","visualDiagramUserNode":"User task","visualDiagramStdNode":"Parsed request","visualDiagramStdCaption":"Structured for tools","visualLegacyCurveLabel":"Cost↑, success↓","visualProposedCurveLabel":"Success↑","visualContributionLabel":"Full-Pipeline Control","visualLegacyBadgeLabel":"Uncontrolled","visualProposedBadgeLabel":"Precision control","visualDiagramData":"Data","visualDiagramModel":"Model","visualDiagramOps":"Code","visualDiagramVerify":"Checks","visualDiagramShip":"Ship","visualAnimPhases":["**User task** — the natural-language instruction (paper’s $I$).","**Parsed request** — **structured** so tools and retrieval can use it (paper’s $R$).","**RAP** — **retrieve** papers/code/examples to strengthen planning.","**Data** stage — prepares splits, cleaning, and features.","**Model** stage — architecture, training, and tuning.","**Code** stage — runnable scripts and packaging toward deploy.","**Multi-stage checks** — run, metrics, and deploy readiness gates.","Only outputs that **pass every gate** move to release."],"datasetSectionTitle":"Datasets and Evaluation Setup","datasetSectionContent":"Experiments cover image, text, tabular, time-series, and graph benchmarks, evaluating both success rate and normalized performance.","summary":"AutoML-Agent treats automation as an end-to-end system: RAP accelerates planning, decomposition enables parallel execution, and multi-stage verification locks reliability. So even with long math, the whole story compresses into one flow: input standardization → candidate plans → parallel execution → deployable final code."},"paperReviewSela":{"sectionTitle":"Learn / Paper review / AutoML & ML pipelines / ICLR 2025 / SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","title":"SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning","description":"LLM agents often produce **low-diversity, suboptimal** ML code even after many tries, while classical AutoML is tied to **fixed pipelines** and lacks flexibility.\n\n**MCTS (Monte Carlo Tree Search)** builds a **tree** of decisions/experiments and uses rollouts plus **validation scores** to choose **which branch to try next**. **UCT-DP** tweaks the usual **UCT score** for picking the next node so **deep, expensive training steps** are not always behind **shallow, cheap exploration**.\n\n**SELA** represents pipelines as such a **tree**, schedules experiments with **MCTS**, and uses **UCT-DP** to prioritize **deeper training-heavy** branches. **Formulas stay as in the paper; each block starts with plain-language notes.**","viewOriginalPdf":"View PDF (arXiv)","chapter1Lead":"# Chapter 1: SELA and tree-search AutoML\n\nSame as above in plain words: MCTS walks the tree using rollouts and validation scores to choose which branch to try next; UCT-DP changes the UCT score used when picking the next node so that deep, expensive training steps are less often pushed aside by shallow exploration.","mctsIntroTitle":"What is Monte Carlo Tree Search (MCTS)?","mctsIntroDescription":"**In short:** future experiments are arranged as a **tree**, and the same four steps repeat.\n\n- **① Pick (selection):** rules like UCT choose **which node to visit next**.\n\n- **② Grow (expansion):** attach a **new child node** (a new try) if it did not exist.\n\n- **③ Roll (rollout):** run code or a simulation on that branch to get a **validation score**.\n\n- **④ Push up (backpropagation):** send that score **up to parents** to update visit counts and averages.\n\nSELA explores LLM-proposed pipeline branches with **these four steps** and validation scores.\n\n**What is UCT?** (Upper Confidence Bound applied to trees) It is the scoring rule for choosing **which sibling child to visit next**. It mixes **average reward so far** (exploit) with **how under‑visited a branch is** (explore) in **one formula**, so you pick the next node by comparing numbers. The paper’s **UCT-DP** tweaks this UCT so **deep training-heavy** steps are not always behind **shallow** search.","mctsPhaseRowTitle":"Four steps (one cycle)","mctsPhase1":"① Pick","mctsPhase2":"② Grow","mctsPhase3":"③ Roll","mctsPhase4":"④ Push up","mctsSvgRoot":"Root","mctsSvgLeft":"Branch A","mctsSvgRight":"Branch B","mctsSvgLeaf":"Rollout","mctsSvgScore":"val. score s","mctsCaption":"Purple dashed line: one example path. Repeated runs accumulate scores on each branch.","coreFlow":{"0":"### [Abstract & intro] Three-line summary\n\n**Summary**\n\n- **LLM limits:** Code is often **low-diversity** and **fails to converge** to a good solution.\n- **Classical AutoML:** **Fixed pipelines** (e.g., Auto-sklearn-style) resist **dynamic reconfiguration** when tasks change.\n- **SELA:** Represent pipelines as a **tree**, schedule experiments with **MCTS**, use **validation scores** as feedback. **UCT-DP** biases search toward **deep, training-heavy** nodes.\n\n**Analogy:** Following only the **factory service playbook** ≈ classical AutoML. **Changing suspension, engine map, and tire pressure all at once and doing a single lap** ≈ one-shot LLM codegen. SELA is like a **race engineer who reads sector times and telemetry** (validation scores) and **branches on what to tune next**.","1":"# Chapter 2: Background — five concepts\n\n### [Background]\n\n- **AutoML:** Automate preprocessing, models, hyperparameters; often **try–measure–iterate**.\n\n- **LLM agent:** From natural-language task + data summary, **generate and run code**. SELA splits **plan** vs **code/execute**.\n\n- **Search space:** All feasible **preprocessing × model × hyperparameter** tuples—too large to exhaust.\n\n- **MCTS:** Rollouts + statistics on a tree; balances **exploration** vs **exploitation**.\n\n- **Exploration vs exploitation:** Visit rare children vs deepen high-reward paths. **UCT-DP** adds **prefer deep training**.","2":"$47","3":"$48","4":"# Chapter 5: Experiments\n\n### [Results]\n\nOn **20 ML datasets** (arXiv abstract), SELA reports roughly **65%–80% win rate** vs each baseline—**consistent advantage**. **MCTS beats random search**; more **rollouts** generally **improve** scores—useful for budgeting API/time.","5":"# Chapter 6: Conclusion & visualization guide\n\n### [Conclusion]\n\n**Takeaways (≤3)**\n\n1. **Strong AutoML baselines** without hand-picking every step.\n2. **Cache rollouts** to cut API/GPU cost.\n3. **Tree logs** explain **which branch** was taken.\n\n**Limits:** Generalizing to robotics/SWE; larger search spaces need better sample efficiency; clearer **interpretability** needs UI work.\n\n### [Diagram] Summary\n\n- **Classic:** Linear / one-shot flows—weak feedback may not reach target quality.\n- **SELA:** **MCTS + UCT-DP** on a tree, updated with **validation scores**—the **left/right panels** below only sketch this contrast."},"visualPlanTitle":"At-a-glance comparison","visualPlan":"**Left:** fixed order and one-shot generation—feedback can be weak. **Right:** tree search that uses validation scores to choose branches. **Simplified figure** below.","visualLegacyTitle":"Baseline: fixed pipeline / one-shot generation","visualLegacyBody":"One-shot full pipelines or rule-only flows have weak feedback loops; points may not converge to a good score.","visualProposedTitle":"SELA: tree search + UCT-DP","visualProposedBody":"Branch per stage, update average reward from validation scores, and bias toward deep training when promising.","visualAxisStart":"Start","visualAxisEnd":"Target quality","visualLegacyCurveLabel":"Scattered trial path","visualProposedCurveLabel":"Convergence on the tree","visualContributionLabel":"Experiment difficulty axis","visualLegacyBadgeLabel":"Hard to control","visualProposedBadgeLabel":"Controlled experiments","visualLegacyTemplateLabel":"Fixed AutoML template (locked order)","visualLegacyStageFe":"FE / prep","visualLegacyStageModel":"Model","visualLegacyStageTrain":"Train · val","visualLegacyDeadEndHint":"Mismatch → dead end","visualLegacyOneshotLabel":"One-shot LLM: full pipeline code σ in one pass","visualLegacyOpenLoopLabel":"Validation score s rarely feeds back to revise Λ","visualProposedInsightLabel":"Insight pool Λ (LLM)","visualProposedPrunedLabel":"Low UCT · pruned","visualProposedFeedbackLabel":"Validation s → update v(x), n","visualProposedCacheLabel":"Cache σ & intermediates","visualProposedUctDpLabel":"UCT-DP: favor deep training","visualProposedRolloutLabel":"MCTS rollouts · simulation","visualProposedBestScoreLabel":"Close to target score","visualSvgLabelPrep":"Data prep","visualSvgLabelModel":"Choose model","visualSvgLabelTrain":"Train & validate","visualSvgLabelStuck":"Stops here","visualSvgLabelOneShot":"All-in-one code","visualSvgLabelLowVal":"Low validation score","visualSvgLabelStart":"Start","visualSvgLabelSkip":"Weaker branch","visualSvgLabelAvg":"Running average","visualSvgLabelDone":"Near the goal","visualSvgFeedbackLine":"Scores feed back upward","summary":"SELA **places LLM ideas on a tree and uses MCTS**; **UCT-DP** avoids **wasting time** on shallow branches before **real training**. **NS** = “make every dataset’s score point the **same way** (higher = better)”; **Rescaled NS** = “set SELA to **1** and see others as **multiples**.” Caching and logs help **cost and explainability**. One line: **search is experiment with feedback**."},"mlChapters":{"mlSectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Summary"},"mlKnnProblemSolvingLabel":"Explanation for solving the problems","mlKnnVisualIntro":"Pick the K=3 nearest neighbors to the new point (?), then predict by majority vote of their labels.","mlKnnVisualCaption":"Dashed circles: distance order. K=3 neighbors (purple) labels: 1, 2, 2 → majority 2","mlKnnVisualStep0":"① Training data — points in feature space (labels 1 or 2)","mlKnnVisualStep1":"② New point (?) appears — we predict its label","mlKnnVisualStep2":"③ Find distance to the K=3 nearest (dashed circles)","mlKnnVisualStep3":"④ Connect to K=3 neighbors — in order of distance","mlKnnVisualStep4":"⑤ Majority vote: labels 1, 2, 2 → predict 2","mlLinearRegressionVisualIntro":"Find the line $\\hat y = w x + b$ that best fits the data points.","mlLinearRegressionVisualStep0":"① Training data — (x, y) scatter plot","mlLinearRegressionVisualStep1":"② Wrong initial line — before gradient descent","mlLinearRegressionVisualStep2":"③ Line learns and moves to optimal position","mlLinearRegressionVisualStep3":"④ Learning complete — predict $\\hat y$ from new $x$","mlLinearRegressionVisualCaption":"$$y \\approx 0.7x + 1.1$ — $w$, $b$ learned by gradient descent","mlLinearRegressionVisualLearningBadge":"Learning...","mlLinearRegressionVisualPlay":"Watch line learning process","mlLinearRegressionVisualReplay":"Replay","mlLinearRegressionProblemSolvingLabel":"Explanation for solving the problems","mlMseVisualIntro":"**Regression loss example:** MSE is the average of squared errors between prediction $\\hat y$ and actual $y$. (For classification we use cross-entropy.)","mlMseVisualStep0":"① Data points and prediction line $\\hat y = w x + b$","mlMseVisualStep1":"② Error (residual) bars from each point to the line","mlMseVisualStep2":"③ Squared errors $(y_i - \\hat y_i)^2$ visualized","mlMseVisualStep3":"④ MSE $= \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$","mlMseVisualCaption":"MSE $= \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$ — the smaller the loss, the better the line fits the data.","mlMseVisualSquaresLabel":"Squared error = area (side length = |residual|)","mlMseProblemSolvingLabel":"Explanation for solving the problems","mlLogisticProblemSolvingLabel":"Explanation for solving the problems","mlLogisticVisualIntro":"The larger the linear score $z$, the closer $\\sigma(z)$ is to 1, so we classify as class 1. $z=0$ is the decision boundary.","mlLogisticVisualCaption":"Sigmoid: $\\sigma(z) = \\frac{1}{1+e^{-z}}$. When $z>0$, $\\hat y=1$; when $z \\le 0$, $\\hat y=0$.","mlLogisticVisualFormulaExplain":"**How to read the formula** — When $z$ is large and negative, $e^{-z}$ is large so $\\sigma(z) \\approx 0$. When $z=0$, $\\sigma(0)=0.5$. When $z$ is large and positive, $e^{-z} \\approx 0$ so $\\sigma(z) \\approx 1$. So the formula squeezes any $z$ into a probability between 0 and 1.","mlLogisticVisualXAxisLabel":"z (linear score)","mlLinearRegressionProblemSolvingTable":"$49","mlKnnProblemSolvingTable":"**Algorithm steps**\n\n- **Input** — New feature vector $\\mathbf{x}$\n- **Stored** — Labeled examples $(\\mathbf{x}_i, y_i)$\n- **1** — Compute distance $d(\\mathbf{x}, \\mathbf{x}_i)$ to each $\\mathbf{x}_i$\n- **2** — Select K smallest distances\n- **3 (classification)** — Predict $\\hat y$ by **majority vote** of the K labels\n- **3 (regression)** — Predict $\\hat y$ as **average** of the K $y_i$ values","mlDataFeature":{"chapter":"Chapter 00","title":"Data and Features: The Start of Machine Learning","description":"Machine learning starts with data. We turn images, text, and numbers into **features**—numeric representations that let the model learn patterns. The world of numbers and functions from Basic Math Ch00 becomes reality here.","sectionTitle":"What are Data and Features?","whatIs":{"0":"**Data is the raw material of machine learning** — As we learned in Basic Math Ch00, deep learning and machine learning turn images, text, and sound into **numbers**. These **numeric inputs** paired with **labels** (correct answers) form **data**. For example, 'cat image + cat' is one data point, and thousands of such pairs become the material for the model to learn from.","1":"**Features are the numeric essence of data** — A photo we see is just a pile of tens of thousands of pixel numbers to a computer. **Features** are the useful information—like ear shape, eye size, fur color—extracted and expressed as numbers. Mathematically they are **vectors**, extracted from raw data through **functions**. The 'functions that define input-output rules' from Ch00 handle this transformation.","2":"**In short** — Data is a collection of (input, label) pairs; features are the result of turning that input into **numeric vectors** the model can understand. Good features lead to better learning; bad features hurt performance even with lots of data. The start of machine learning is deciding what data to use and what features to extract."},"whyImportant":{"0":"**Without data, learning is impossible** — Every decision a model makes is the result of **numbers and functions**. As in Ch00, to follow the AI computation we need data expressed as **numbers**. If data is scarce or labels are wrong, the model learns the wrong patterns.","1":"**Feature design sets the model's limits** — Deciding which information to turn into numbers is called **feature engineering**. Using only 'yesterday's closing price' vs. adding 'moving average, volume, volatility' for stock prediction leads to very different results. **Vectors and matrices** bundle many features for batch computation—a core part of the Ch00 roadmap—and the quality of features drives model performance.","2":"**Bridge to the next chapters** — Ch02 KNN, Ch03 Linear Regression, Ch05 Logistic Regression, and all ML algorithms take **feature vectors** as input. Understanding data and features is needed to interpret why a model made a given prediction, and the later chapters on **differentiation** and **probability** build on this foundation."},"howUsed":{"0":"**Input → feature extraction → model → prediction** — The ML pipeline matches the **input → numeric conversion → repeated functions → output** structure from Ch00. Feature extraction is the 'numeric conversion' step; models (linear regression, KNN, etc.) are sets of **functions**. **Differentiation** is used to reduce error during training; **probability** expresses uncertainty in predictions like '90% chance this image is a cat'."},"problemSolving":{"0":"**Data and features** — **Data** are (input $\\mathbf{x}$, target $y$) pairs; **features** encode observations as numbers that form the **feature vector** $\\mathbf{x}$. The **target** is the $y$ you want to predict; the **model** learns $y \\approx f(\\mathbf{x})$; **evaluation** uses loss and metrics.","1":"**Example (concept)**\n\nWhich best describes a feature vector? ① labels only ② numeric encoding of inputs ③ the loss function\n\nFeatures are the numeric vector built from inputs. → **Answer ②**\n\n---\n\n**Example (analogy table)**\n\n| Concept | Real-estate analogy | ML |\n| --- | --- | --- |\n| **Data** | Past transactions | Pairs $(x,y)$ |\n| **Feature** | Size, location | **Input vector $\\mathbf{x}$** |\n| **Target** | Sale price | **Label $y$** |\n| **Model** | \"Price per unit area\" rule | **Function $y=f(x)$** |\n| **Evaluation** | Compare estimate vs actual | **Loss** |"}},"mlMissingValueImputation":{"chapter":"Chapter 01","title":"Missing Value Handling: Strategies to Fill Data Gaps","description":"This chapter introduces practical missing-value handling from concept to deployment: single vs multiple imputation, outlier detection (Box Plot, Mahalanobis Distance, Isolation Forest, SVDD), and imbalance-aware resampling (Tomek Links, SMOTE, ADASYN, hybrid resampling)."},"mlSupervisedUnsupervisedSelf":{"chapter":"Chapter 02","title":"Supervised, Unsupervised, and Self-Supervised Learning","description":"Machine learning is often divided into **supervised**, **unsupervised**, and **self-supervised** learning depending on how data is used. **Supervised learning** is like studying with an answer key; **unsupervised learning** is like finding patterns and grouping similar items without labels; **self-supervised learning** is like masking part of the data and learning by predicting the missing part. This chapter summarizes the core ideas, math, and real-world use of these three paradigms so you can build a solid base for the algorithms covered later.","sectionTitle":"Three Ways of Learning: Supervised, Unsupervised, Self-Supervised","whatIs":{"0":"**Supervised Learning: Learning from input–label pairs**\nThe model is given **input $\\mathbf{x}$** and the corresponding **label (target) $y$** as pairs. The goal is to approximate a function $y = f(\\mathbf{x})$. Formally we have a training set $\\mathcal{D} = \\{(\\mathbf{x}_1, y_1), (\\mathbf{x}_2, y_2), \\ldots\\}$ and find $f$ by **minimizing a loss** (e.g. MSE, cross-entropy). Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression are all supervised.\n* **Example 1 (classification)**: Spam filter—email content ($\\mathbf{x}$) → spam or not ($y$).\n* **Example 2 (regression)**: House price—area, location ($\\mathbf{x}$) → price ($y$).\n* **Example 3 (medical)**: Patient test values ($\\mathbf{x}$) and diagnosis ($y$) for decision support.","1":"**Unsupervised Learning: Discovering hidden structure**\nOnly **input $\\mathbf{x}$** is given; there is **no label $y$**. Think of it as \"only questions, no answer key.\" The goal is to find **structure, patterns, or clusters** using **distance and similarity** between $\\mathbf{x}$s: group similar points (clustering), compress to fewer dimensions (dimensionality reduction), or flag **anomalies** that fall outside the normal pattern.\n* **Example 1 (clustering)**: Customer age and purchase history ($\\mathbf{x}$) → segment similar customers.\n* **Example 2 (anomaly detection)**: Learn normal payment patterns ($\\mathbf{x}$), then flag unusual transactions.\n* **Example 3 (dimension reduction)**: Reduce many features to 2–3 numbers for visualization or denoising. (You’ll learn concrete methods later.)","2":"**Self-Supervised Learning: Creating targets from data**\nInstead of human labels, the model creates **pseudo-labels** from the data. Typical flow: (1) **Mask** part of the input (e.g. a word, an image patch). (2) **Predict** the masked part from the rest. (3) **Use** the learned representation for downstream tasks with a small amount of supervised data. This is how BERT, GPT, and many vision models are pre-trained on large unlabeled corpora.\n* **Example 1 (language)**: \"I ate [ MASK ]\" → predict the masked word from context (LLMs).\n* **Example 2 (vision)**: Mask a region of an image and reconstruct it from the rest.\n* **Example 3 (contrastive)**: Treat two augmented views of the same image as \"same\" and different images as \"different\" to learn representations."},"whyImportant":{"0":"**Data nature and cost** — Building labels for all data is expensive. When labels are sufficient, **supervised** is effective; when they are scarce, **unsupervised** or **self-supervised** use unlabeled data, then a small supervised fine-tuning step. **Interpretability** also differs: supervised allows some explanation via loss and decision path; unsupervised/self-supervised require separate interpretation (e.g. cluster names, visualization).","1":"**Pre-training and fine-tuning** — Modern pipelines often use **self-supervised** pre-training on large unlabeled data, then **supervised** fine-tuning on a small labeled set. **Unsupervised** is common in preprocessing and exploration—e.g. cluster customers with K-Means, assign human meanings to clusters (e.g. \"loyal\", \"churn risk\"), then build a supervised churn model. Choosing the right paradigm makes the pipeline clear and realistic given data size and label cost."},"howUsed":{"0":"**Supervised** — Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression learn from (input, label) pairs. **Classification**: spam filter, disease prediction, image classification. **Regression**: house price, sales, temperature—Ch03/Ch04 cover the math and optimization.","1":"**Unsupervised** — Ch08 K-Means clusters data without labels; **dimension reduction** (reducing many features to 2–3 numbers) is another key tool. **Clustering**: customer segmentation, topic grouping. **Anomaly detection**: learn a \"normal\" region, flag points outside it.","2":"**Self-supervised** — BERT (masked word prediction), GPT (next-token prediction), and **contrastive learning** in vision are widely used. After pre-training, a small amount of labeled data is used for QA, summarization, or classification."},"problemSolving":{"0":"For supervised vs unsupervised vs self-supervised, ask: are labels **human-given**, **absent**, or **derived from the data**? **Supervised** learning fits $y=f(\\mathbf{x})$ from $(\\mathbf{x},y)$ pairs; **unsupervised** learning finds clusters/structure from $\\mathbf{x}$ only; **self-supervised** learning builds targets (e.g. masked words, next token), learns representations, then often fine-tunes with a little labeled data.","1":"**Example (concept understanding)**\n\nLearning spam vs not-spam with **human-provided labels** is closest to? ① Supervised ② Unsupervised ③ Self-supervised\n\nTraining on human-annotated answers is the hallmark of supervised learning. → **Answer ①**\n\n---\n\n**Example (T/F)**\n\n\"Learning that clusters customers without any labels is unsupervised.\" Answer 1 if true, 0 if false.\n\nClustering without labels is exactly what unsupervised learning does. → **Answer 1**\n\n---\n\n**Example (application)**\n\nPredicting **masked tokens** to learn representations is closest to? ① Supervised only ② Clustering only ③ Masked LM / contrastive pre-training\n\nSelf-created targets from the input match self-supervised pre-training. → **Answer ③**"},"mlSupervisedUnsupervisedSelfVisualIntro":"Three learning paradigms: supervised (input–label pairs), unsupervised (no label), self-supervised (self-created target).","mlSupervisedUnsupervisedSelfVisualStep0":"Supervised: learn a prediction function from (input, label) pairs","mlSupervisedUnsupervisedSelfVisualStep1":"Unsupervised: discover structure and clusters without labels","mlSupervisedUnsupervisedSelfVisualStep2":"Self-supervised: learn representations from self-created targets","mlSupervisedUnsupervisedSelfProblemSolvingLabel":"Problem-solving guide","mlSupervisedUnsupervisedSelfVisualPhase0Title":"Supervised: input x and label y come in pairs","mlSupervisedUnsupervisedSelfVisualPhase0Caption":"When (x, y) pairs are given in order, the model learns the rule","mlSupervisedUnsupervisedSelfVisualPhase1Title":"Unsupervised: only input x (no label y)","mlSupervisedUnsupervisedSelfVisualPhase1Caption":"There is no y (label), only x. Some x blink on and off → the model still finds structure and clusters","mlSupervisedUnsupervisedSelfVisualPhase1NoLabelBadge":"No label","mlSupervisedUnsupervisedSelfVisualPhase2Title":"Self-supervised: mask part of the data and predict the gap","mlSupervisedUnsupervisedSelfVisualPhase2Caption1":"Mask part of the input","mlSupervisedUnsupervisedSelfVisualPhase2Caption2":"Model predicts the masked part","mlSupervisedUnsupervisedSelfVisualPhase2Caption3":"The gap is filled with the predicted word","mlSupervisedUnsupervisedSelfVisualPhase2Prefix":"I ","mlSupervisedUnsupervisedSelfVisualPhase2Suffix":" ate","mlSupervisedUnsupervisedSelfVisualPhase2Filled":"rice","mlSupervisedUnsupervisedSelfVisualPhase2Example":"e.g. fill in the blank → representation learning (BERT, etc.)","mlSupervisedUnsupervisedSelfVisualPhase2Step1":"Mask","mlSupervisedUnsupervisedSelfVisualPhase2Step2":"Predict","mlSupervisedUnsupervisedSelfVisualPhase2Step3":"Fill","mlSupervisedUnsupervisedSelfVisualAutoCycle":"All three types animate at the same time","problemAnswerHint":"Choose the matching learning paradigm below.","mcAnswerSupervised":"Supervised","mcAnswerUnsupervised":"Unsupervised","mcAnswerSelfSupervised":"Self-supervised","mcAnswerDistractor":"Reinforcement learning","problems":{"definition_1_0":"Learning from data where input and label (answer) are paired is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_1_1":"Learning $y=f(\\mathbf{x})$ from (input $\\mathbf{x}$, label $y$) pairs is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_1_2":"The learning type likened to a teacher marking answers with a red pen is? ①Supervised ②Unsupervised ③Self-supervised","definition_1_3":"Learning that uses human-provided labels for classification or regression is? ①Supervised ②Unsupervised ③Self-supervised","definition_1_4":"The main learning type that learns classification or regression from (input, label) pairs is? ①Supervised ②Unsupervised ③Self-supervised","definition_1_5":"Learning where the data comes with a target and the model is trained to match it is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_0":"Learning that finds structure, patterns, or clusters from input only, without labels, is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_1":"When there is no label $y$, only $\\mathbf{x}$, finding groups in the data is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_2_2":"Clustering similar data without labels corresponds to which learning type? ①Supervised ②Unsupervised ③Self-supervised","definition_2_3":"The learning type likened to finding and grouping types by yourself is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_4":"Label-free learning often used for dimensionality reduction or anomaly detection is? ①Supervised ②Unsupervised ③Self-supervised","definition_2_5":"Discovering structure in data without human-provided answers is which type? ①Supervised ②Unsupervised ③Self-supervised","definition_3_0":"Learning from a 'pseudo-label' created from the data itself is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_1":"Learning that creates its own target (e.g. masked word, next sentence) is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_2":"Learning by masking part of a sentence and predicting that part is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_3":"The paradigm used to learn representations from large unlabeled data is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_4":"The learning type likened to making your own practice test and solving it is? ①Supervised ②Unsupervised ③Self-supervised","definition_3_5":"Learning that creates 'same vs. different' pairs by itself to learn representations is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_0":"Spam vs. non-spam classification (with labels) is which learning type? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_1":"Grouping similar customers from purchase data only, with no labels, is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_2":"Predicting masked words in sentences to learn word representations is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_3":"Predicting apartment price from size and location is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_4":"Grouping similar images with no labels (clustering) is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_5":"Pre-training on large text then fine-tuning with few labels—the pre-training stage is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_6":"Building a disease-prediction model from medical images and disease labels is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_7":"Customer segmentation by grouping similar customers only, with no labels, is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_8":"Learning context representations by predicting the next sentence is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_9":"Predicting exam score from study time is? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_10":"Anomaly detection when only normal data exists and almost no anomaly labels is closest to? ①Supervised ②Unsupervised ③Self-supervised","taskClassify_11":"Learning representations by predicting a masked part of an image from the rest is? ①Supervised ②Unsupervised ③Self-supervised","scenario_0":"A hospital trains a model on past patient data (symptoms, tests) and diagnosis (label) to predict 'Does this patient have disease A?' This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_1":"A store splits customers into groups using only purchase history, with no extra labels. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_2":"A model is trained by masking 15% of words in Wikipedia and predicting them. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_3":"A model predicts tomorrow's sales from weather, date, and past sales (label). This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_4":"Video data is indexed by grouping similar scenes with no labels. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_5":"Context is learned by predicting 'next sentence' on large documents, then fine-tuned with few QA labels. The first stage is? ①Supervised ②Unsupervised ③Self-supervised","scenario_6":"A classifier is trained on dog/cat images with species labels. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_7":"Stock price series only, no labels; patterns are split into segments. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_8":"Same sentence in different wording; 'same meaning' is used as target to learn representations. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_9":"An application (experience, education) and pass/fail (label) are used to build a pass-prediction model. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_10":"News articles only, no topic labels; articles are grouped by topic. This is? ①Supervised ②Unsupervised ③Self-supervised","scenario_11":"Speech representations are learned by masking and reconstructing parts of audio. This is? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_0":"\"Learning from data where input and label are paired\" describes supervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_1":"\"Finding only structure in data without labels\" describes unsupervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_2":"\"Learning from a target created from the data (e.g. masked word)\" describes self-supervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_3":"Fitting a function to predict a value from (input, label) pairs. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_4":"Splitting data into K groups using only the data, no labels. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_5":"Learning by predicting masked words in a sentence. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_6":"Learning from human-provided pass/fail labels. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_7":"\"Grouping similar items from data only, with no answers\" describes unsupervised learning. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_8":"Learning representations from self-created 'same/different' pairs. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_9":"Using (input, label) pairs at training time and predicting the label for new input. Which learning type? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_10":"In anomaly detection, learning a 'normal region' from normal data only is closest to unsupervised. Which type is this? ①Supervised ②Unsupervised ③Self-supervised","trueFalse_11":"\"Learning context by predicting the next sentence\" is self-supervised. Which type is this? ①Supervised ②Unsupervised ③Self-supervised"}},"mlKnn":{"chapter":"Chapter 03","title":"K-Nearest Neighbors (KNN): Birds of a Feather","description":"**Birds of a feather flock together** — KNN finds the **K nearest** stored examples and uses their labels (majority vote) to predict the new one. No fancy training; just **distance** and neighbors.","sectionTitle":"K-Nearest Neighbors (KNN): Birds of a Feather","whatIs":{"0":"**What is KNN?** — For a new data point, we pick the **K closest** points among labeled data and assign the **majority label**. Example: if 4 of the 5 nearest emails are spam, the new email is classified as spam.","1":"**'Closest' means distance in feature space** — Usually **Euclidean distance**: $d(\\mathbf{x}, \\mathbf y) = \\sqrt{\\sum_{i}(x_i - y_i)^2}$. With two features, this is the straight-line distance on the plane.","2":"**K is a hyperparameter** — K=1 uses only the single nearest neighbor; larger K smooths the decision but can blur boundaries. Odd K is often used to avoid ties."},"whyImportant":{"0":"**No explicit training (lazy learning)** — KNN does not learn a compact model; at prediction time it computes distances to all stored points. Training cost is low; prediction cost can be high.","1":"**Interpretable** — We can explain a prediction by showing the K neighbors (e.g. \"spam because 4 of 5 similar emails were spam\"), which supports explainable AI.","2":"**Useful as a baseline** — Before trying complex models, KNN gives a quick sense of how well the data can be classified."},"howUsed":{"0":"**Classification** — Majority vote among the K neighbors' labels. Used in image classification, spam detection, risk bands, etc.","1":"**Regression** — Predict the **average** of the K neighbors' target values (e.g. house price from nearby sales).","2":"**Distance and scale** — If features have different scales, distance is dominated by one feature. **Normalization** or **standardization** is recommended before computing distances."},"problemSolving":{"0":"**KNN** — Compute **distances** from the new point $\\mathbf{x}$ to stored points, pick the **K nearest** neighbors, then **majority vote** (classification) or **average** (regression). **Lazy learning**: no weights stored at train time. **Normalize or standardize** features when scales differ so distance is fair.","1":"$4a"},"problemSolvingTable":"**Algorithm steps**\n\n- **Input** — New feature vector $\\mathbf{x}$\n- **Stored data** — Pairs $(\\mathbf{x}_i, y_i)$\n- **Step 1** — Compute $d(\\mathbf{x}, \\mathbf{x}_i)$ for every $\\mathbf{x}_i$\n- **Step 2** — Take the K smallest distances\n- **Step 3 (classification)** — **Majority vote** among the K labels → $\\hat y$\n- **Step 3 (regression)** — **Average** of the K target values → $\\hat y$"},"mlLinearRegression":{"chapter":"Chapter 04","title":"Linear Regression: A Line Through the Data","description":"When data points are scattered, **linear regression** finds the **line that best fits** their trend and predicts values for new inputs. It is the first regression model where you can see how **functions**, **derivatives**, and **partial derivatives** from Basic Math lead directly to machine learning 'training'.","sectionTitle":"Linear Regression: A Line Through the Data","whatIs":{"0":"**What is linear regression?** — We assume a **linear relationship** $y = w_1 x + w_0$ (or $y = \\mathbf{w}^\\top \\mathbf{x} + b$ for multiple variables) between input $x$ and output $y$, and find the **weights $w$ and intercept $b$** that best fit the data. The **function** $y = f(x)$ from Basic Math Ch01 is here a concrete **linear function**.","1":"**What does 'best fit' mean?** — We minimize the **error** between predictions $\\hat y_i = w x_i + b$ and actual values $y_i$. The function that measures this error is the **loss function**; **MSE (Mean Squared Error)**, covered in Ch04, is the most common.","2":"**Difference from KNN** — KNN predicted by the 'average of neighbors'; linear regression learns and stores **one formula (a line)**. At prediction time, we only compute $\\hat y = w x + b$ without searching for neighbors."},"whyImportant":{"0":"**First application of differentiation and optimization** — To minimize error, we use **differentiation** (Basic Math Ch06). Following the **gradient** of the loss with respect to $w$ and $b$ leads to the minimum. This is **gradient descent**, the same principle behind deep learning training.","1":"**Interpretability** — The learned $w$ tells us 'how much $y$ changes when $x$ increases by 1'. For example, with house area ($x$) and price ($y$), $w > 0$ means 'larger area, higher price'—matching intuition. This **interpretability** matters when trusting and improving models in practice.","2":"**Foundation for other models** — Logistic regression (Ch05), a single neuron in a neural network—all use 'linear transformation + nonlinear function'. Understanding linear regression clarifies how their **linear part** works."},"howUsed":{"0":"**Regression** — Used to predict **continuous numbers**: house prices, sales, temperature, scores. With multiple features, $y = w_1 x_1 + w_2 x_2 + \\cdots + w_n x_n + b$ becomes **multiple linear regression**.","1":"**Feature importance** — Features with larger $|w_i|$ have more influence on predictions. When doing feature engineering (Ch01), we use these values to decide which features to keep or drop.","2":"**Normal equation vs gradient descent** — With few features, the **normal equation** gives the optimal solution in one step. With many features or large data, **gradient descent** updates $w$ iteratively. **Partial derivatives and gradients** from Basic Math Ch08 are the key tools here."},"visual":"Visualization of line fitting during linear regression training.","problemSolving":{"0":"**Linear regression** — Model $\\hat y = wx + b$. **Training** minimizes loss (e.g. MSE) to update $w$, $b$. **Predict** by substitution. **Slope** from two points, **intercept** $b=y-wx$, **residual** $y-\\hat y$.","1":"**Example (key terms)**\n\n- **Line** — $\\hat y = wx + b$\n- **Predict** — $\\hat y = wx + b$ for given $x$\n- **Slope** — $(y_2-y_1)/(x_2-x_1)$\n- **Intercept** — $b = y - wx$\n- **Residual** — $y - \\hat y$\n\n---\n\n**Example (predict)**\n\nLine $\\hat y = 2x + 1$, $x=3$: $\\hat y$?\n\n$7$. → **Answer 7**\n\n---\n\n**Example (slope)**\n\nThrough (1,3) and (4,9), slope $w$?\n\n$2$. → **Answer 2**\n\n---\n\n**Example (intercept)**\n\nSlope 2, through (3,7): $b$?\n\n$1$. → **Answer 1**\n\n---\n\n**Example (two-point predict)**\n\nThrough (0,1) and (2,5), $\\hat y$ at $x=1$?\n\n$3$. → **Answer 3**\n\n---\n\n**Example (residual)**\n\nLine $\\hat y=2x+1$, point (3,8): residual?\n\n$1$. → **Answer 1**\n\n---\n\n**Example (sum of residuals)**\n\nPoints (0,2), (1,4), line $\\hat y=2x+1$: sum of residuals?\n\n$2$. → **Answer 2**"}},"mlMse":{"chapter":"Chapter 05","title":"Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error","sectionTitle":"Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error","description":"A **loss function** turns how wrong the model is into **one number**. For **regression**, we often use **mean squared error (MSE)** from the gap between $\\hat y$ and $y$, and we also look at **$R^2$** (coefficient of determination) to understand how much variation the model explains. For **classification**, we measure how far predicted class probabilities are from the truth with **cross-entropy**. The diagram below shows **MSE as a regression example** of how loss decreases.","whatIs":{"0":"**Regression: MSE**\n\nWe need a **loss** that summarizes error in one number.\n\n- **Residual** — difference between actual $y$ and prediction $\\hat y$.\n- **SSE** — sum of $(y_i - \\hat y_i)^2$ over all points (sum of squared errors).\n- **MSE** — SSE divided by the number of points $n$ (mean squared error).\n\n$\\text{MSE} = \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2 = \\text{SSE}/n$. Smaller MSE means a better fit.","1":"**Why square?**\n\n- Residuals $+2$ and $-2$ both mean \"off by 2\"; raw sums can cancel.\n- **Squaring** keeps values positive and compares magnitude only.\n- Large errors get a bigger **penalty**, so the model avoids large mistakes.","2":"**Linear regression**\n\nThe line $\\hat y = w x + b$ from Ch.03 is \"best\" when **MSE** (or **SSE**) is minimized—we choose $w$ and $b$ that minimize average squared error.\n\n**Gradient descent** updates $w$ and $b$ step by step in the direction that lowers MSE.","3":"**Regression: MSE is the average of squared residuals**\n\nMSE is an error score made by squaring residuals $y_i-\\hat y_i$ and taking the average. As predictions get closer to the true values, residuals shrink and MSE becomes smaller.","4":"**Unpacking MSE**\n\n$\\text{MSE} = \\frac{1}{n}\\sum_i (y_i - \\hat y_i)^2$\n\n- **$i$** — sample index.\n- **$y_i$** — actual value at that point.\n- **$\\hat y_i$** — predicted value.\n- **$y_i - \\hat y_i$** — **residual**.\n- **$(y_i - \\hat y_i)^2$** — **squared error** at that point.\n- **$\\sum_i$** — sum over points = **SSE**.\n- **$\\frac{1}{n}$** — average = **MSE**.\n\nCloser predictions → smaller residuals and smaller MSE.","5":"$4b","6":"$4c"},"whyImportant":{"0":"**Learning direction** — In regression with MSE loss, the model updates in directions that **reduce MSE**—a clear objective.","1":"**MSE: smooth and easy to optimize** — Squared error is smooth and easy to differentiate, so gradient descent works well.","2":"**RMSE** — MSE uses squared units; $\\sqrt{\\text{MSE}}$ (**RMSE**) restores the same units as $y$ for interpretation.","3":"**Match loss to task** — Continuous targets fit **MSE**; class probabilities fit **cross-entropy**, which aligns with **maximum likelihood**. **Ch.05 logistic regression** connects sigmoid outputs $\\hat p$ to this loss."},"howUsed":{"0":"**Regression training** — Train with **MSE** for prices, temperatures, etc.","1":"**Model comparison (regression)** — Smaller **MSE** means better fit.","2":"**Deep learning regression** — Neural nets predicting numbers often use MSE at the output.","3":"**Classification** — Logistic regression, softmax classifiers, and neural classifiers typically minimize **cross-entropy**."},"visual":"...","problemSolving":{"0":"**Loss functions** — Regression uses **MSE**; classification uses **cross-entropy** for probabilities; **$R^2$** compares model error to the mean baseline. **RMSE** $=\\sqrt{\\text{MSE}}$ restores units.","1":"**Example (SSE)**\n\n$n=2$, residuals 3 and −3: sum of squared errors?\n\n$18$. → **Answer 18**\n\n---\n\n**Example (cross-entropy idea)**\n\nIf $y=1$ and $\\hat p=0.9$, loss behaves like? ① very large ② about $-\\log 0.9$ (small) ③ always 0\n\n②. → **Answer ②**\n\n---\n\n**Example ($R^2$)**\n\nIf SSE < SST, $R^2$ can be? ① always negative ② between 0 and 1 ③ always 1\n\n②. → **Answer ②**"}},"mlLogistic":{"chapter":"Chapter 06","title":"Logistic Regression: Pass or Fail?","description":"Where linear regression predicts a 'score', **logistic regression** is the specialist for **yes/no** classification—e.g. \"Will this score mean **pass (1)** or **fail (0)**?\" It uses the **sigmoid function** to turn a score into a probability between 0 and 1.","sectionTitle":"Logistic Regression: Pass or Fail?","whatIs":{"0":"**The S-curve: sigmoid** — The score $z$ from a linear model can be large or negative. Probabilities must lie between 0 and 1. The **sigmoid** $\\sigma(z) = \\frac{1}{1+e^{-z}}$ maps any real $z$ into (0, 1).","1":"**Decision boundary** — When the sigmoid outputs e.g. \"probability of pass = 0.7\", we need a rule. Usually we use **0.5**: if probability ≥ 0.5 we predict **1 (yes)**, otherwise **0 (no)**.","2":"**Same core as linear regression** — Logistic regression still computes a score $z = wx + b$ first; the only difference is passing that score through the **sigmoid** to get a probability.","3":"**How to read $\\sigma(z) = \\frac{1}{1+e^{-z}}$** — When $z$ is large and negative, $e^{-z}$ is large so $\\sigma(z) \\approx 0$. When $z=0$, $\\sigma(0)=0.5$. When $z$ is large and positive, $e^{-z} \\approx 0$ so $\\sigma(z) \\approx 1$. So any $z$ is squeezed into a probability in [0, 1]."},"whyImportant":{"0":"**Many real problems are yes/no** — Spam or not? Disease or not? Will the user buy? **Binary classification** is everywhere; logistic regression is the standard baseline.","1":"**Confidence as a number** — Saying \"pass with 98% probability\" is more useful than just \"pass\". Logistic regression gives a **probability**, which supports better decisions.","2":"**Bridge to deep learning** — A single neuron in a neural network behaves much like logistic regression. Mastering this makes deep learning easier later."},"howUsed":{"0":"**Spam filter** — Compute \"probability this email is spam\" from features; if above a threshold, send to spam.","1":"**Medical AI** — From X-rays or lab values, predict \"probability of disease\" to support diagnosis.","2":"**Marketing and recommendations** — Predict \"will this user churn?\" or \"will they click?\" for targeting and ads."},"visual":"Visualization of sigmoid output and decision boundary.","problemSolving":{"0":"**Logistic regression** — Linear score $z$, then **sigmoid** $\\sigma(z)=1/(1+e^{-z})$ for probability. Default: predict 1 if $\\sigma(z)\\ge 0.5$; $z=0$ is the **decision boundary**.","1":"**Example (T/F)**\n\n\"When $z=0$, $\\sigma(z)=0.5$.\" Answer 1 if true, 0 if false.\n\nTrue. → **Answer 1**\n\n---\n\n**Example (decision)**\n\nIf $\\sigma(z)=0.7$, default class at threshold 0.5? ① 0 ② 1\n\n②. → **Answer ②**\n\n---\n\n**Example (sign)**\n\nIf $z>0$, usual $\\hat y$? ① 0 ② 1\n\n②. → **Answer ②**"}},"mlDecisionTree":{"chapter":"Chapter 07","title":"Decision Tree: Twenty Questions to the Answer","description":"A decision tree works like the game of **Twenty Questions**: ask yes/no questions, follow branches, and reach a prediction at a leaf. It is easy to interpret (you can see exactly why it made each decision) and is the building block for random forests and other ensemble methods.","sectionTitle":"Decision Tree: Twenty Questions to the Answer","whatIs":{"0":"**Basic structure** — Picture an upside-down tree. At the top is the **root node** (first question). From there you ask a condition (e.g. “Is feature $x_1 \\le 3$?”); **yes** and **no** lead to **internal nodes**. When you can’t split further, you reach a **leaf node** and output the **prediction** (class or value).","1":"**Same as Twenty Questions** — Just like guessing an animal by asking “Does it have four legs?” → “Is it a herbivore?” → “Tiger!”, the tree narrows down the answer step by step. Each question splits the data into two groups.","2":"**Good questions: reducing impurity** — **Impurity** measures how mixed the classes are at a node. We want splits that make nodes purer. Two common formulas: **Gini** $G = 1 - \\sum p_i^2$ and **Entropy** $H = -\\sum p_i \\log_2 p_i$. When one class has 100% ($p=1$), both are 0 (pure). When classes are half-and-half, impurity is high.","3":"**Information gain** — **Information gain** = impurity before the split minus (weighted) impurity after. It measures how much a question “cleans up” the data. The tree chooses the question with the highest information gain at each step.","4":"**Prediction at the leaf** — At a **leaf**, we output: for **classification**, the **majority class** of the samples there; for **regression**, the **average** of their target values. For new data, we just follow the path and read off the leaf’s prediction.","5":"**Pruning** — A tree that is too deep **overfits** (memorizes the training set). **Pruning** cuts branches to limit depth and improve generalization. These pruned trees are the base models used in **random forest** and other ensembles."},"whyImportant":{"0":"**Explainable AI** — Unlike many black-box models, a decision tree shows the exact path of questions that led to each prediction (e.g. “age < 30 and income ≥ 30M → approve loan”). This is valued in finance and healthcare.","1":"**Nonlinear boundaries** — Linear models cut the space with a single line; a tree can approximate **step-like** boundaries by repeated splits, capturing more complex patterns.","2":"**Foundation for ensembles** — A single tree can be unstable, but hundreds of trees (e.g. **random forest**) form a strong, robust model. Ch06 is the basis for Ch07 Ensemble."},"howUsed":{"0":"**Credit and loans** — Questions like “Income ≥ 50M?” and “Any default in the last year?” form a path to approve or deny.","1":"**Medical decision support** — Patient data (blood pressure, cholesterol, etc.) is used in a sequence of questions to predict disease risk and support diagnosis.","2":"**Marketing (churn, purchase)** — “Registered > 6 months?”, “Logins in the last month ≤ 3?” help find at-risk customers for targeted campaigns."},"problemSolving":{"0":"**Decision trees** — Follow questions from the **root** to a **leaf**; classification leaves use **majority vote**, regression leaves use **mean**. **Gini** $G=1-\\sum p_i^2$, **entropy** $H=-\\sum p_i\\log_2 p_i$ measure impurity.","1":"**Example (path)**\n\nFollow bits from the root to a leaf and read that leaf’s prediction.\n\n---\n\n**Example (Gini)**\n\nIf one class is 100% ($p=1$), Gini $G=1-\\sum p_i^2$ equals?\n\n$0$. → **Answer 0**\n\n---\n\n**Example (leaf majority)**\n\nLeaf has two class-0 and five class-1 points: predicted class?\n\n$1$. → **Answer 1**"},"visual":"Visualization of decision-tree branching and prediction path."},"mlDecisionTreeProblemSolvingLabel":"Explanation for solving the problems","mlDecisionTreeVisualIntro":"From the root, follow branches by answering yes/no to each question; the leaf gives the prediction.","mlDecisionTreeVisualStep0":"① Root node — first question (e.g. is feature $x_1 \\le 3$?)","mlDecisionTreeVisualStep1":"② Move to left (0 = no) or right (1 = yes) child","mlDecisionTreeVisualStep2":"③ Repeat questions at internal nodes","mlDecisionTreeVisualStep3":"④ Leaf node — output prediction (class or value) with no further split","mlDecisionTreeVisualPathCaption0":"① Root node — ask the first question. Follow branches by yes/no.","mlDecisionTreeVisualPathCaption1":"④ Follow path: Yes(1) → Leaf 0","mlDecisionTreeVisualPathCaption2":"⑤ Follow path: No(0) → Leaf 1","mlDecisionTreeVisualStep0Description":"① Root node — at the first question, branch by yes/no and go down the left or right branch.","mlDecisionTreeVisualLabelRoot":"Root","mlDecisionTreeVisualLabelYes":"Yes(1)","mlDecisionTreeVisualLabelNo":"No(0)","mlDecisionTreeVisualLabelQuestion":"Question","mlDecisionTreeVisualLabelLeaf0":"Leaf 0","mlDecisionTreeVisualLabelLeaf1":"Leaf 1","mlDecisionTreeVisualDiagramAriaLabel":"Decision tree structure: root — question — leaf","mlBoostingTrees":{"chapter":"Chapter 08","title":"XGBoost, LightGBM, CatBoost","description":"Compare the boosting trio and learn practical model-selection criteria.","sectionTitle":"XGBoost, LightGBM, CatBoost","whatIs":{"0":"**XGBoost** is a strong default booster that emphasizes accuracy with regularization and second-order (Hessian) information.","1":"**LightGBM** is optimized for speed on large data with leaf-wise growth and histogram-based split finding.","2":"**CatBoost** handles categorical features robustly with ordered encoding, reducing preprocessing burden."},"whyImportant":{"0":"All three are gradient boosting trees, but their trade-offs differ in **speed, stability, and categorical handling**.","1":"In practice, picking the model that matches **data size, feature types, and time budget** matters more than chasing one universal winner."},"howUsed":{"0":"For tabular classification/regression, teams often start with XGBoost, try LightGBM for large-scale speed, and prioritize CatBoost when categorical columns are heavy.","1":"Final selection is based on validation score, training time, and overfitting behavior together."},"problemSolving":{"0":"Solve **model-selection** items by matching data traits to each algorithm, solve **true/false** by checking core properties, and solve **basic numeric** items by reading given rounds/trees directly.","1":"**Example 1 (model selection)**\n\nIf categorical features are dominant and you want to avoid heavy one-hot encoding, which model is preferred first? ① XGBoost ② LightGBM ③ CatBoost\n\nCatBoost is usually the first choice here. → **Answer ③**\n\n---\n\n**Example 2 (model selection)**\n\nFor very large tabular data where training speed is critical, which model is often tried first? ① XGBoost ② LightGBM ③ CatBoost\n\nLightGBM is commonly prioritized for speed. → **Answer ②**\n\n---\n\n**Example 3 (true/false)**\n\n\"XGBoost is a boosting method that uses regularization and second-order information.\" Enter 1 if true, 0 if false.\n\nThis statement is true. → **Answer 1**"},"visual":"A comparison view for choosing among XGBoost, LightGBM, and CatBoost by accuracy, speed, and categorical-feature handling."},"mlEnsemble":{"chapter":"Chapter 09","title":"Ensemble and Random Forest: The Wisdom of the Crowd","description":"Ensemble methods combine predictions from multiple models to produce a single, often better prediction. This chapter explains bagging, boosting, stacking, and random forest—where many decision trees vote or average—so beginners can follow the idea of collective intelligence.","sectionTitle":"Ensemble and Random Forest: The Wisdom of the Crowd","whatIs":{"0":"**The core idea of ensemble: many hands make light work** — An ensemble builds a **team of multiple models** and combines their predictions to reach a final answer. Like a jury voting on a verdict, using many models instead of one sharply reduces the chance of wrong answers (variance) and makes predictions **more stable**. For classification we use **majority vote**; for regression we use the **average** of predictions.","1":"**Why are many better than one? (Wisdom of the crowd)** — If you ask 100 people to guess a cow's weight, individual guesses may be off, but the **average** of 100 guesses is often surprisingly close to the true weight. When models **independently** judge and we combine results, their random errors tend to cancel out and the **shared signal** remains.","2":"**Three main ensemble methods: Bagging, Boosting, Stacking** — (1) **Bagging**: Each model gets a different random subset of data (like different practice tests); then they vote. (2) **Boosting**: The next model focuses on what the previous one got wrong, learning **sequentially** from mistakes. (3) **Stacking**: A meta-model takes the reports of base models and makes the final decision.","3":"**Random Forest: a forest of diverse trees** — Bagging with **decision trees**: grow hundreds of trees. To keep them diverse, each tree is trained on a **random subset of features** at each split. Some trees rely on \"age\", others on \"income\", maximizing **diversity**.","4":"**Voting and averaging in a formula** — For classification, majority vote means \"the class that most trees chose\". For regression (e.g. house price), average all tree predictions: **$\\hat y = \\frac{1}{B}\\sum_{b=1}^B \\hat y_b$** where $B$ is the number of trees and $\\hat y_b$ is the $b$-th tree's prediction. (e.g. three trees predict 100, 150, 200 → final prediction 150)","5":"**OOB (Out-of-Bag) evaluation** — In bagging/random forest, each tree is trained on a random sample of the data. The **left-out samples (Out-of-Bag)** can be used to evaluate those trees that did not see them—like a built-in validation set without holding out separate test data."},"whyImportant":{"0":"**A stable forest that doesn't sway** — A single decision tree can change a lot when data changes slightly. A **forest** of hundreds of trees stays stable; a few wrong trees don't change the overall vote. This leads to strong, reliable performance in practice.","1":"**Natural extension of Ch06 Decision Tree** — The same tree structure (impurity, information gain) is reused. You're not learning new rules—just how to **combine** many trees with voting, so the previous chapter's knowledge is fully used.","2":"**The go-to model in industry and competitions** — Random forest often works very well with little tuning, so it's many practitioners' first choice. It also provides **feature importance**, which helps explain which variables matter most."},"howUsed":{"0":"**General-purpose for business (classification and regression)** — From \"Is this email spam?\" to \"What will tomorrow's stock price be?\", ensembles are used across almost every business problem.","1":"**Finding what matters (feature importance)** — If trees in a loan model rely most on \"income\", that variable is the most important for the decision. This helps filter out unnecessary data.","2":"**Wide real-world use** — Fraud detection, recommendation systems (e.g. Netflix, YouTube), equipment failure prediction—wherever accuracy and stability matter."},"problemSolving":{"0":"**Ensembles / Random Forest** — Classification: **majority vote** across trees. Regression: **average** $\\hat y = \\frac{1}{B}\\sum_b \\hat y_b$. **OOB** samples estimate error without a separate validation set.","1":"**Example (majority vote)**\n\n3 votes for class 0, 5 for class 1: final class? ① 0 ② 1\n\n②. → **Answer ②**\n\n---\n\n**Example (regression mean)**\n\nThree trees predict 6, 9, 12: mean?\n\n$9$. → **Answer 9**\n\n---\n\n**Example (OOB)**\n\n10 trees total; a point appears in 6 bootstraps: OOB tree count?\n\n$4$. → **Answer 4**\n\n---\n\n**Example (formula mean)**\n\n$B=4$, sum of predictions = 20: mean?\n\n$5$. → **Answer 5**"},"visual":"Visualization of ensemble voting/averaging for final prediction."},"mlEnsembleVisualIntro":"Combine predictions from multiple models (trees) by voting or averaging to get the final prediction.","mlEnsembleVisualStep0":"① Draw bootstrap samples from training data and train multiple trees","mlEnsembleVisualStep1":"② Each tree predicts independently","mlEnsembleVisualStep2":"③ Classification: majority vote; Regression: average → final prediction","mlEnsembleVisualStep3":"④ The final prediction is determined","mlEnsembleVisualLabelData":"Data","mlEnsembleVisualLabelVote":"Vote/Average","mlEnsembleVisualLabelPrediction":"Prediction","mlEnsembleVisualLabelTree1":"Tree 1","mlEnsembleVisualLabelTree2":"Tree 2","mlEnsembleVisualLabelTree3":"Tree 3","mlEnsembleVisualAriaLabel":"Ensemble flow: Data → Trees → Vote/Average → Prediction","mlKmeansProblemSolvingLabel":"Explanation for solving problems","mlKmeansVisualIntro":"Assign each point to the nearest center, then move centers to the mean of assigned points; repeat.","mlKmeansVisualStep0":"① Data — unlabeled points in feature space","mlKmeansVisualStep1":"② Initialize K centers — place K centroids","mlKmeansVisualStep2":"③ Assign — assign each point to the nearest center (by color)","mlKmeansVisualStep3":"④ Update centers — set each center to the mean of its assigned points","mlKmeansVisualStep4":"⑤ Repeat until assignment and centers no longer change","mlKmeansVisualCaption":"K-Means: repeat assign → update to minimize SSE (distortion).","mlKmeansVisualAriaLabel":"K-Means flow: Data → Initial centers → Assign → Update → Converge","mlKmeansVisualMeanLabel":"mean","mlKmeansVisualPointDataLabel":"Point: Data","mlKmeansVisualLineCaption":"Line: from each point to its assigned center (μ)","mlKmeansVisualCenterMoveCaption":"Centers move to cluster mean","mlCrossValidationProblemSolvingLabel":"Explanation for solving the problems","mlCrossValidationVisualIntro":"Split data into train/validation/test; in K-Fold, take turns validating and estimate performance by the mean score.","mlCrossValidationVisualTitle":"① 5-Fold","mlCrossValidationVisualFoldLabel":"Fold{n}","mlCrossValidationVisualTrainLabel":"Train","mlCrossValidationVisualValLabel":"Validation","mlCrossValidationVisualScoreLabel":"Validation score","mlCrossValidationVisualMeanLabel":"Mean μ","mlCrossValidationVisualStep0":"① Full data — sample set for training and validation","mlCrossValidationVisualStep1":"② Train/Val/Test split — train to learn, validate to tune, test for final evaluation","mlCrossValidationVisualStep2":"③ K-Fold — split into K parts, use one part as validation and the rest for training each time","mlCrossValidationVisualStep3":"④ Per-fold validation scores — get $S_1, S_2, \\ldots, S_K$ from each fold","mlCrossValidationVisualStep4":"⑤ Mean $\\bar{S} = \\frac{1}{K}\\sum_{k=1}^K S_k$ — final performance estimate","mlCrossValidationVisualCaption":"Cross validation: practice tests (validation) to estimate skill, final exam (test) to confirm.","mlCrossValidationVisualAriaLabel":"Cross validation flow: data → split → K-Fold → per-fold scores → mean","mlCrossValidationProblemPrompt":"Read the instruction below and enter your answer in the blank (?).","mlCrossValidationProblemPromptDefinition":"If the following statement is **true**, choose **True**; otherwise choose **False**.\n\n{statement}","mlCrossValidationProblemPromptDefinitionChoice":"Choose the option that best matches the question.\n\n{question}","mlCrossValidationProblemPromptHoldoutTrain":"With {n} samples and training ratio {trainRatio}, how many training samples? (integer)","mlCrossValidationProblemPromptHoldoutTest":"With {n} samples and training ratio {trainRatio}, how many test samples? (integer)","mlCrossValidationProblemPromptKfoldSize":"With {n} samples and {K}-Fold, what is the size of one fold (validation set)? (integer quotient)","mlCrossValidationProblemPromptKfoldScoreMean":"K-Fold validation scores (%) are {scores}. Find the mean (integer).","mlCrossValidationProblemPromptScenario":"Choose the most suitable method for the scenario.\n\n{scenario}","mlCrossValidationProblemPromptStratified":"Choose the option that best matches the question.\n\n{question}","mlCrossValidationStatement_0":"Cross validation estimates performance by splitting data into train/validation/test instead of scoring only on training data.","mlCrossValidationStatement_1":"The validation set is used like a practice test for hyperparameter selection or model comparison.","mlCrossValidationStatement_2":"In K-Fold, data is split into K parts and each part is used once as validation; the mean of validation scores is the final estimate.","mlCrossValidationStatement_3":"The test set is used only once for final performance reporting.","mlCrossValidationStatement_4":"Hold-out splits data once into train and validation (or train and test).","mlCrossValidationStatement_5":"Overfitting is suspected when training score is high but validation/test score is low.","mlCrossValidationStatement_6":"The training set is the data used to learn model weights and parameters.","mlCrossValidationStatement_7":"In K-Fold, one fold size is usually the integer quotient of n/K.","mlCrossValidationStatement_10":"It is fine to report final performance on the validation set after training on it.","mlCrossValidationStatement_11":"Hold-out always gives more stable estimates than K-Fold.","mlCrossValidationStatement_12":"The test set can be used multiple times to choose models.","mlCrossValidationStatement_13":"Performance measured only on training data gives an accurate picture of generalization.","mlCrossValidationStatement_14":"In K-Fold, a larger K means fewer validation runs.","mlCrossValidationQuestionChoice_0":"The main purpose of cross validation is? ① Estimate generalization ② Speed up training ③ Data augmentation","mlCrossValidationQuestionChoice_1":"When data is limited, which is more advantageous? ① Hold-out ② K-Fold ③ Stratified only","mlCrossValidationQuestionChoice_2":"What corresponds to a practice test? ① Train ② Validation ③ Test","mlCrossValidationQuestionChoice_3":"Which keeps class proportions in each fold? ① Hold-out ② Plain K-Fold ③ Stratified K-Fold","mlCrossValidationQuestionChoice_4":"What corresponds to the final exam? ① Train ② Validation ③ Test","mlCrossValidationQuestionChoice_5":"Which set is used to choose hyperparameters? ① Train ② Validation ③ Test","mlCrossValidationQuestionChoice_6":"Which validates by using different splits multiple times? ① Hold-out ② K-Fold ③ Test only","mlCrossValidationQuestionChoice_7":"When might we suspect overfitting? ① High train and high validation ② High train and low validation ③ Low train and high validation","mlCrossValidationScenario_0":"We have 10,000 samples and want to evaluate once with a single split.","mlCrossValidationScenario_1":"We have only 500 samples and want a stable validation estimate by splitting multiple ways.","mlCrossValidationScenario_2":"We split once 80% train, 20% test and use the test set only once at the end.","mlCrossValidationScenario_3":"Classification with 90:10 class imbalance; we want each fold to preserve that ratio.","mlCrossValidationScenario_4":"We want to run validation 5 times and report the average accuracy.","mlCrossValidationScenario_5":"We split once 70:30 and use that split.","mlCrossValidationScenario_6":"We run K validation runs to reduce the variance of the estimate.","mlCrossValidationScenario_7":"Binary classification; we want to keep the positive rate in each fold.","mlCrossValidationStratified_0":"What is an advantage of Stratified K-Fold? ① Preserve class ratio ② Faster ③ Less memory","mlCrossValidationStratified_1":"For imbalanced classes in classification, what is recommended? ① Hold-out only ② Stratified K-Fold ③ Skip validation","mlCrossValidationStratified_2":"Stratified is mainly used for? ① Regression only ② Classification (preserve class ratio) ③ Clustering","mlEvaluationProblemPrompt":"Read the instruction below and enter your answer in the blank (?).","mlEvaluationProblemSolvingLabel":"Explanation for solving the problems","mlEvaluationVisualIntro":"Fill the 2×2 confusion matrix with actual (rows) and predicted (columns), then compute accuracy, precision, recall, and F1.","mlEvaluationVisualStep0":"① Actual vs predicted — rows: actual pos/neg, columns: predicted pos/neg","mlEvaluationVisualStep1":"② Confusion matrix — fill the four cells TP, TN, FP, FN","mlEvaluationVisualStep2":"③ Accuracy — (TP+TN)/total, fraction correct","mlEvaluationVisualStep3":"④ Precision & recall — precision: TP/(TP+FP), recall: TP/(TP+FN)","mlEvaluationVisualStep4":"⑤ F1 — harmonic mean of precision and recall","mlEvaluationVisualCaption":"Read the model's report card via the confusion matrix and choose metrics that match your goal.","mlEvaluationVisualAriaLabel":"Classification evaluation: confusion matrix → accuracy, precision, recall, F1","mlEvaluationVisualMatrixTitle":"Confusion Matrix (2×2)","mlEvaluationVisualStepLineTP":"Actual positive · Predicted positive → TP","mlEvaluationVisualStepLineFN":"Actual positive · Predicted negative → FN","mlEvaluationVisualStepLineFP":"Actual negative · Predicted positive → FP","mlEvaluationVisualStepLineTN":"Actual negative · Predicted negative → TN","mlEvaluationVisualPredPos":"Predicted positive","mlEvaluationVisualPredNeg":"Predicted negative","mlEvaluationVisualActualPos":"Actual positive","mlEvaluationVisualActualNeg":"Actual negative","mlEvaluationVisualBadgeTP":"True positive ✓","mlEvaluationVisualBadgeFN":"False negative (actual pos → predicted neg)","mlEvaluationVisualBadgeFP":"False positive (actual neg → predicted pos)","mlEvaluationVisualBadgeTN":"True negative ✓","mlEvaluationVisualBadgeFixed":"After distinguishing TP, FN, FP, TN, compute accuracy, precision, recall, and F1.","mlEvaluationProblemPromptDefinition":"If the following statement is **true**, choose **True**; otherwise choose **False**.\n\n{statement}","mlEvaluationProblemPromptDefinitionChoice":"Choose the option that best matches the question.\n\n{question}","mlEvaluationProblemPromptScenario":"Choose the most suitable option for the scenario.\n\n{scenario}","mlEvaluationProblemPromptConfusionCount":"With TP={tp}, TN={tn}, FP={fp}, FN={fn} in the confusion matrix, what is the value (integer) of {cell}?","mlEvaluationProblemPromptTotalCount":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is the total count n (integer)?","mlEvaluationProblemPromptAccuracy":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is accuracy (%) (integer)?","mlEvaluationProblemPromptPrecision":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is precision (%) (integer)?","mlEvaluationProblemPromptRecall":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is recall (%) (integer)?","mlEvaluationProblemPromptF1":"With TP={tp}, TN={tn}, FP={fp}, FN={fn}, what is F1 score (%) (integer)?","mlEvaluationStatement_0":"The confusion matrix is a 2×2 table of actual class (rows) and predicted class (columns).","mlEvaluationStatement_1":"Accuracy is (TP+TN) divided by total count.","mlEvaluationStatement_2":"The denominator of precision is TP+FP.","mlEvaluationStatement_3":"The denominator of recall is TP+FN.","mlEvaluationStatement_4":"F1 is the harmonic mean of precision and recall.","mlEvaluationStatement_5":"TP is the count of actual positive and predicted positive.","mlEvaluationStatement_6":"FN is actual positive but predicted negative (miss).","mlEvaluationStatement_7":"With imbalanced data, accuracy alone can be misleading.","mlEvaluationStatement_10":"Precision and recall are always equal.","mlEvaluationStatement_11":"High accuracy always means the model is suitable for production.","mlEvaluationStatement_12":"FP is actual positive but predicted negative.","mlEvaluationStatement_13":"The denominator of recall is TP+FP.","mlEvaluationStatement_14":"TN is actual positive and predicted positive.","mlEvaluationQuestionChoice_0":"The numerator of accuracy is? ① TP+TN ② TP+FP ③ TP+FN","mlEvaluationQuestionChoice_1":"The denominator of precision is? ① TP+FN ② TP+FP ③ TN+FN","mlEvaluationQuestionChoice_2":"When is recall important? ① Allowing spam as normal ② When we must not miss disease ③ Minimizing false alarms","mlEvaluationQuestionChoice_3":"F1 is the harmonic mean of? ① Accuracy and precision ② Precision and recall ③ Recall and accuracy","mlEvaluationQuestionChoice_4":"TP means? ① Actual pos, predicted pos ② Actual neg, predicted pos ③ Actual pos, predicted neg","mlEvaluationQuestionChoice_5":"False positive is? ① FP ② FN ③ TN","mlEvaluationQuestionChoice_6":"False negative is? ① FP ② FN ③ Precision","mlEvaluationQuestionChoice_7":"Total count n is? ① TP+TN ② TP+TN+FP+FN ③ TP+FP+FN","mlEvaluationScenario_0":"We must not miss spam (some false positives acceptable). Important metric? ① Recall ② Precision ③ Accuracy","mlEvaluationScenario_1":"In medical diagnosis, we must not say 'no disease' when there is. Important metric? ① Accuracy ② Recall ③ Precision","mlEvaluationScenario_2":"In ad click prediction, we want to raise 'fraction of predicted clicks that are real'. Important metric? ① Recall ② Precision ③ F1","mlEvaluationScenario_3":"In fraud detection we must not miss fraud. Important metric? ① Precision ② Recall ③ Accuracy","mlEvaluationScenario_4":"To balance precision and recall we use? ① Accuracy ② F1 ③ TP","mlEvaluationScenario_5":"When classes are 99:1 imbalanced, accuracy alone? ① Is reliable ② Can be misleading ③ Equals F1","mlEvaluationScenario_6":"The metric closest to 'fraction of relevant docs in top 10' is? ① Recall ② Precision ③ FN","mlEvaluationScenario_7":"The metric for 'fraction of actual positives that the model got right' is? ① Precision ② Recall ③ Accuracy","mlKmeans":{"chapter":"Chapter 10","title":"K-Means Clustering: Grouping Without Labels","description":"K-Means is a classic **unsupervised learning** algorithm that groups data into K clusters using **distance**—no labels. You will see how the 'unsupervised' idea from Ch01 works in practice: concept → intuition → math → application. It reuses the distance formula from Ch02 (KNN) and shows how repeating 'assign to nearest center' and 'update centers' yields clear clusters.","sectionTitle":"K-Means Clustering: Grouping Without Labels","whatIs":{"0":"**What is K-Means?** — With no labels $y$, only data $\\mathbf{x}_1, \\mathbf{x}_2, \\ldots$, K-Means partitions points into **K groups** by **nearest centroid**. Distance is **Euclidean** $d(\\mathbf{x}, \\boldsymbol{\\mu}) = \\sqrt{\\sum_j (x_j - \\mu_j)^2}$ (as in Ch02). Each group has one **centroid** $\\boldsymbol{\\mu}_k$. The algorithm alternates: assign each point to the nearest center → set each center to the mean of its assigned points, until convergence.","1":"**K is the number of clusters** — The user chooses **K** (e.g. K=2 → two groups). There are no 'correct' labels, only a partition. In practice, K is chosen by domain knowledge, the elbow method, or silhouette scores.","2":"**Objective: minimize SSE (distortion)** — K-Means minimizes $J = \\sum_{k=1}^K \\sum_{i \\in C_k} \\|\\mathbf{x}_i - \\boldsymbol{\\mu}_k\\|^2$. The update $\\boldsymbol{\\mu}_k = \\frac{1}{|C_k|}\\sum_{i \\in C_k} \\mathbf{x}_i$ (mean of assigned points) reduces each cluster's SSE.","3":"**If the formulas feel heavy** — The distance formula is just 'length between a point and a center.' SSE $J$ is a single number for 'how tightly points sit around their center'; the algorithm moves centers to make $J$ smaller. The centroid update is literally 'average of the coordinates of points in that cluster.' The **Formula guide** below spells out each symbol step by step."},"whyImportant":{"0":"**Ch01 unsupervised learning in action** — K-Means is the go-to when you have no labels and want structure (e.g. customer segmentation, clustering documents or images, preprocessing for anomaly detection).","1":"**Customer segmentation** — With only purchase history and no segment labels, K-Means groups similar customers; people then attach meaning (e.g. VIP, churn risk) to each cluster and use it for downstream tasks (Ch09, Ch12).","2":"**Simple and interpretable** — Assign (nearest center) and update (mean) are easy to implement and visualize in 2D."},"howUsed":{"0":"**Clustering** — Customer segmentation, topic/document grouping, image color compression, gene expression groups.","1":"**Preprocessing** — Use cluster index as a new feature for supervised models, or keep only centroids to reduce data size.","2":"**Choosing K** — The user sets K; compare SSE or silhouette across K to pick a value (e.g. elbow)."},"problemSolving":{"0":"**K-Means** — With no labels, place **K centroids**, **assign** each point to the nearest center, **update** centers as cluster means, repeat. Minimize SSE $J = \\sum_{k}\\sum_{i \\in C_k} \\|\\mathbf{x}_i - \\boldsymbol{\\mu}_k\\|^2$; update $\\boldsymbol{\\mu}_k = \\frac{1}{|C_k|}\\sum_{i \\in C_k} \\mathbf{x}_i$.","1":"**Example (key terms)**\n\n- **Distance²** — $(x_2-x_1)^2+(y_2-y_1)^2$; compare without sqrt if needed\n- **Assign** — Smallest distance → cluster index\n- **Center update** — Mean of coordinates in the cluster\n- **SSE** — Sum of squared distances to centers; smaller is tighter\n\n---\n\n**Example (assign)**\n\nCenters $(0,0)$ and $(4,0)$; point $(2,0)$. Tie → cluster 1. → **Answer 1**\n\n---\n\n**Example (center update)**\n\nPoints $(1,2)$ and $(3,4)$ only: new $\\bar{x}$?\n\n$2$. → **Answer 2**\n\n---\n\n**Example (distance²)**\n\nPoint $(1,2)$, center $(4,6)$: squared distance?\n\n$25$. → **Answer 25**\n\n---\n\n**Example (SSE idea)**\n\nAs $J$ gets smaller, clusters are? ① more spread ② tighter\n\n②. → **Answer ②**"},"visual":"Visualization of iterative assignment and centroid updates in K-Means."},"mlCrossValidation":{"chapter":"Chapter 11","title":"Cross Validation: Practice Tests and the Real Exam","description":"Cross validation is essential so that models do not become \"frogs in a well\"—only good at the exercises they memorized. Just as students use **practice tests** to check their real level and the **final exam** to confirm it, we do not score machine learning models only on **training data**; we evaluate them on **validation** and **test** data they have not seen. This chapter covers **cross validation** (Hold-out, K-Fold, etc.) and how to make performance estimates reliable.","sectionTitle":"Cross Validation: Practice Tests and the Real Exam","whatIs":{"0":"**What is cross validation? \"Don’t score with the same problems they practiced\"** — If a math exam contained only problems from the workbook, we could not tell whether students understood the ideas or had **overfit** by memorizing answers. The same holds for ML: testing on training data always looks good. So we split data into **train**, **validation**, and **test**, and evaluate the model strictly and fairly on data it has never seen. That process is cross validation.","1":"**Three roles when splitting data** — The ideal split and role of each part are as follows.\n\n- **Training (Train)** — Metaphor: textbook / practice set. Main data used to learn patterns and update weights. Typical ratio: ~70–80%.\n- **Validation** — Metaphor: practice exam. Used mid-learning to check performance and tune hyperparameters. Typical ratio: ~10–15%.\n- **Test** — Metaphor: final exam. Used **only once** after all learning to report final performance. Typical ratio: ~10–15%.","2":"**How to split? Hold-out and K-Fold** — There are two main approaches. **Hold-out** is like cutting a pizza once: you split the data once into train and test. It is simple and fast, but if by chance the \"easy\" part ends up in the test set, the estimate can be overly optimistic. **K-Fold cross validation** divides data into K segments and uses each in turn as the \"practice exam\" (validation) and the rest for training, so every sample is validated once and the estimate is more stable and objective.","3":"**K-Fold final score in a formula** — After K-Fold you have K \"exam\" scores. The model’s final performance is the average of these K scores.\n\n* **Mean score formula:** $\\bar{S} = \\frac{1}{K}\\sum_{k=1}^K S_k$\n\n* **Symbols:** $K$ = number of folds (number of validation runs), $S_k$ = score when the $k$-th fold was used for validation (e.g. accuracy or MSE). $\\sum_{k=1}^K S_k$ means $S_1 + S_2 + \\cdots + S_K$, so $\\bar{S}$ is the **mean of the K validation scores** and is used as the final performance estimate.\n\n* **Numeric example:** With 5-Fold, if the five scores are 80, 85, 90, 80, 85, then $\\bar{S} = (80+85+90+80+85)/5 = 84$."},"whyImportant":{"0":"**Escaping the \"frog in a well\" (detecting overfitting)** — If the model scores 99 on training data but 50 on unseen validation data, it is almost certainly **overfitting** (memorizing rather than understanding). Cross validation acts as a filter to catch such models before they fail in production.","1":"**Proving real-world performance (generalization)** — Companies adopt AI to predict the future, not to replay the past. Models validated with K-Fold and a held-out test set are more likely to perform well on truly new data.","2":"**Finding the best setup (hyperparameters and model choice)** — When choosing tree depth, K in K-NN, learning rate, etc., we run multiple settings on the validation set and pick the best. Because the test set is kept separate, we can compare models fairly."},"howUsed":{"0":"**Data scientist routine (production pipeline)** — In practice, the first step is to set aside about 10% of the data as the **test set** and lock it away. The rest is used for training and K-Fold validation until the best model is ready; then the test set is used once to report: \"Our model’s final accuracy is 92%.\"","1":"**Fair algorithm comparison** — When asking \"Is logistic regression or random forest better for our churn prediction?\", the same K-Fold setup is applied to both; the algorithm with the higher mean validation score ($\\bar{S}$) is chosen for deployment."},"problemSolving":{"0":"**Summary** — Cross validation starts from the premise that we must not measure performance only on the data used for training. Just as students take practice tests before the real exam, in machine learning we cannot tell if the model has \"memorized the exercises\" if we score only on **training data**. So we split data into **train**, **validation**, and **test**. The **training** set is used for the model to learn patterns; the **validation** set is used to check performance during learning or to choose hyperparameters; the **test** set is used **only once** after all learning to report final performance before deployment. The main split strategies are **Hold-out** and **K-Fold**. Hold-out splits the data once into train and test (or validation). K-Fold divides data into K segments, uses one segment at a time for validation and the rest for training. With K-Fold every sample is used for validation once, so the performance estimate is more stable than with a single split.","1":"**Example (terms & formulas)**\n\n- **Train size** — $n \\times (\\text{ratio}/100)$ etc.\n- **Test size** — $n - \\text{train}$\n- **Fold size** — $\\lfloor n/K \\rfloor$\n- **K-Fold mean** — $(S_1+\\cdots+S_K)/K$\n- **Stratified** — Class ratio per fold\n\n---\n\n**Example (T/F)**\n\n\"You may reuse the test set many times.\" 1 true, 0 false.\n\nUsually test once for final report. → **Answer 0**\n\n---\n\n**Example (Hold-out train)**\n\n100 samples, 80% train → train count?\n\n$80$. → **Answer 80**\n\n---\n\n**Example (Hold-out test)**\n\nSame setup → test count?\n\n$20$. → **Answer 20**\n\n---\n\n**Example (K-Fold size)**\n\n100 samples, 5-Fold → one fold size?\n\n$20$. → **Answer 20**\n\n---\n\n**Example (K-Fold mean)**\n\nScores 80,80,90,80,90 → mean?\n\n$84$. → **Answer 84**\n\n---\n\n**Example (Stratified)**\n\nStratified K-Fold keeps class ratio per fold? ① yes ② no only random\n\n①. → **Answer 1**"},"visual":"Visualization of data splitting and K-Fold evaluation flow."},"mlEvaluation":{"chapter":"Chapter 12","title":"Classification Metrics: The Model's Detailed Report Card","description":"Learn the **'detailed report card'** that a classification AI model receives after its test. Beyond \"how many did you get right?\" (accuracy), we look at **confusion matrix** concepts that ask \"which questions did you get wrong, and how?\" In business settings where *how* the model is wrong can be critical—spam filters, cancer diagnosis AI—we explain how **precision, recall, and F1** prove the model's real capability, with intuitive analogies.","sectionTitle":"Classification metrics: confusion matrix and the model's report card","whatIs":{"0":"**What is the confusion matrix? The AI's detailed report card** — Just as knowing only \"how many correct\" on an exam doesn't tell you whether a student is good at math or English, we need more for a classifier. The **confusion matrix** is a 2×2 table that compares the model's **predictions (columns)** with **actual answers (rows)**. By reading the four cells, you can see what the model gets right and where it gets confused and stumbles.","1":"**The four cells: TP, TN, FP, FN** — Think of the famous \"boy who cried wolf.\" Here 'positive' means the boy cries wolf; 'negative' means peace.\n* **TP (True Positive):** Wolf really came (1), boy cried wolf (1). Best outcome—village saved.\n* **TN (True Negative):** No wolf (0), boy stayed quiet (0). Peace.\n* **FP (False Positive):** No wolf (0), boy cried wolf (1). Villagers run out with pitchforks for nothing (false alarm).\n* **FN (False Negative):** Wolf came (1), boy was asleep (0). Sheep get eaten—worst outcome (miss).\n* Total count $n = \\mathrm{TP} + \\mathrm{TN} + \\mathrm{FP} + \\mathrm{FN}$.","2":"**Accuracy's dangerous trap** — It is the fraction of correct answers: $\\text{Accuracy} = \\frac{\\mathrm{TP}+\\mathrm{TN}}{n}$. Intuitive but treacherous. Suppose 99 out of 100 days are peaceful and the wolf comes only once. A robot that closes its eyes and always says \"No wolf!\" still gets 99% accuracy. When positive cases are rare (imbalanced data), you must not trust accuracy alone.","3":"**Precision and recall: two rabbits to chase** —\n* **Precision (caution):** \"When I cried wolf, how often was it really the wolf?\" The share of **predicted positives** that are **truly positive**. $\\text{Precision} = \\frac{\\mathrm{TP}}{\\mathrm{TP}+\\mathrm{FP}}$. It goes up when you avoid false alarms (FP).\n* **Recall (sensitivity):** \"Of all the times the wolf actually came, how often did I notice and warn?\" The share of **actual positives** that the model **got right**. $\\text{Recall} = \\frac{\\mathrm{TP}}{\\mathrm{TP}+\\mathrm{FN}}$. It goes up when you miss fewer true wolves (FN).","4":"**F1 score: the golden balance of precision and recall** — Precision and recall are like a seesaw: pushing one up often pushes the other down. **F1** summarizes both in one number using the **harmonic mean**: $\\text{F1} = \\frac{2 \\cdot \\mathrm{TP}}{2\\cdot\\mathrm{TP}+\\mathrm{FP}+\\mathrm{FN}}$. If either precision or recall is poor, F1 tanks. Use F1 when you want a model with good balance.","5":"**AUC (Area Under the ROC Curve): the model's ranker** — When the model outputs a probability (e.g. \"90% chance of wolf\") rather than a bare yes/no, **AUC** measures how well **true positives** get higher scores than **true negatives** (discriminative power), on a 0–1 scale. 1 = perfect ranking; 0.5 = coin flip. Very useful to compare models before choosing a threshold."},"whyImportant":{"0":"**Don't fall for 99% accuracy** — Imagine a credit-card fraud detector: 1 fraudulent transaction in 100,000. A model that does nothing and always says \"all normal\" still has 99.999% accuracy—but 0% recall (catches no fraud). You must open the **confusion matrix** and inspect **precision** and **recall** to see if the model is doing its job or gaming the numbers.","1":"**In practice, it's a fierce trade-off: which mistake can you live with?** — The metric you bet on depends on the business.\n* **Recall (don't miss) is life:** Cancer screening. Better to have healthy people get extra tests (FP) than to miss a real case (FN) and delay treatment.\n* **Precision (fewer false alarms) is life:** Spam filter. Missing a few spams (FN) is fine—delete and move on. Misclassifying the boss's email as spam (FP) can be career-threatening."},"howUsed":{"0":"**Final pass/fail for AI services (binary classification)** — COVID-19 positive/negative, YouTube harmful-video block/allow, bank loan approve/reject: before deployment, real-world projects draw the confusion matrix and review precision, recall, and F1.","1":"**Tuning alarm sensitivity (threshold tuning)** — Models usually output a probability. \"At what % do we sound the alarm?\" Adjusting this threshold tailors the model to the business: e.g. lower threshold for maximum recall (security-critical), higher for maximum precision (when too many false alarms annoy users)."},"problemSolving":{"0":"**Confusion matrix & metrics** — Count **TP/TN/FP/FN**; $n=\\mathrm{TP}+\\mathrm{TN}+\\mathrm{FP}+\\mathrm{FN}$. **Accuracy** $(\\mathrm{TP}+\\mathrm{TN})/n$, **precision** $\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FP})$, **recall** $\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FN})$, **F1** harmonic mean. On imbalance, use precision/recall, not accuracy alone.","1":"**Example (metrics)**\n\n- **Accuracy (%)** — $100(\\mathrm{TP}+\\mathrm{TN})/n$\n- **Precision (%)** — $100\\,\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FP})$\n- **Recall (%)** — $100\\,\\mathrm{TP}/(\\mathrm{TP}+\\mathrm{FN})$\n- **F1 (%)** — $100\\cdot 2\\mathrm{TP}/(2\\mathrm{TP}+\\mathrm{FP}+\\mathrm{FN})$\n\n---\n\n**Example (accuracy)**\n\nTP=10, TN=70, FP=10, FN=10 → accuracy (%)?\n\n$80$. → **Answer 80**\n\n---\n\n**Example (precision)**\n\nTP=10, FP=10 → precision (%)?\n\n$50$. → **Answer 50**\n\n---\n\n**Example (recall)**\n\nTP=10, FN=10 → recall (%)?\n\n$50$. → **Answer 50**\n\n---\n\n**Example (F1)**\n\nTP=10, FP=10, FN=10 → F1 (%)?\n\n$50$. → **Answer 50**"},"visual":"Visualization of confusion matrix and metric calculations."},"mlRegularizationProblemPrompt":"Read the problem and choose the correct option below.","mlRegularizationProblemSolvingLabel":"Explanation for problem solving","mlRegularizationVisualIntro":"We add a penalty for the model becoming too complex, not just for data error, so the model generalizes instead of memorizing.","mlRegularizationVisualVs":"VS","mlRegularizationVisualLabelNoReg":"No regularization","mlRegularizationVisualLabelWithReg":"With regularization","mlRegularizationVisualLabelOverfit":"Overfitting","mlRegularizationVisualLabelGeneral":"Generalization","mlRegularizationVisualStep0":"① No regularization — minimizing only training loss leads to **overfitting**","mlRegularizationVisualStep1":"② Add regularization — Loss = data loss + λ × penalty; **larger λ shrinks weights**","mlRegularizationVisualStep2":"③ L2 — **penalty $\\sum w_j^2$ keeps weights small**","mlRegularizationVisualStep3":"④ L1 — **penalty $\\sum |w_j|$ drives some weights to zero (sparse)**","mlRegularizationVisualStep4":"⑤ Generalization — a suitable λ gives **good performance on both train and validation**","mlRegularizationVisualCaption":"Regularization: loss + λ·penalty to reduce overfitting and improve generalization.","mlRegularizationVisualAriaLabel":"Regularization flow: overfitting → loss+penalty → L1/L2 → generalization","mlRegularization":{"chapter":"Chapter 13","title":"Regularization: Beyond Rote Memorization","description":"It is the key technique that keeps ML models from becoming **'rote memorizers'** that only memorize answers from the workbook. Fitting the training data too tightly means the model flounders when faced with slightly different new problems—this is **overfitting**. **Regularization** reduces the model's data error while imposing a **penalty (cost)** so the model does not become overly complex or forced. In this way, the model prunes the twigs and learns only the essential patterns, becoming strong in real-world **generalization**.","sectionTitle":"Regularization: Beyond Rote Memorization","whatIs":{"0":"**What is regularization? A 'penalty' for complexity**\n\nWhen a model tries to fit every small noise or exception in the training data, its formula becomes wiggly and needlessly complex. Regularization computes the model's **total loss** not only by \"how wrong the predictions are\" but also by **\"how complex the model is (size of weights)\"** and adds a penalty. To avoid that penalty, the model naturally stays simpler and cleaner.","1":"**Intuitive analogy: crammer vs principle-seeking student**\n\nA crammer who memorizes the workbook digit by digit gets 100 on practice tests but fails the real exam (new data). A student who understands principles may get a few practice problems wrong but scores steadily on the real exam. Regularization acts like a teacher, forcing the model to **\"prune the twigs (excessive weights) and focus on the main stem (core pattern)\"** so it becomes robust in practice.","2":"**Math: two 'magic' formulas (L1 and L2)**\n\nRegularization is divided into two main types by how it penalizes the model.\n\n- **L2 (Ridge)**: Uses the **square** of weights as the penalty. The objective is $J = \\text{MSE} + \\lambda \\sum_{j} w_j^2$. It smoothly pushes all weights down so they do not grow too large.\n- **L1 (Lasso)**: Uses the **absolute value** of weights as the penalty. The objective is $J = \\text{MSE} + \\lambda \\sum_{j} |w_j|$. It can drive less important weights **exactly to zero**, leaving only the key features (sparsity).","3":"**Real-world examples: spam filtering and medical diagnosis**\n\nIn spam filtering, giving high weight to a common word that happened to appear in training spam (e.g. \"hello\") can wrongly filter normal mail. Regularization prevents the model from obsessing over a single word (exploding weights). In medical diagnosis, it helps the AI avoid latching onto meaningless details like \"gown color\" among many patient features.","4":"**Reading the formulas: a beginner's dissection**\n\n- **Total loss (L2 example)**: $J = \\text{MSE} + \\lambda \\sum_{j} w_j^2$\n - **$J$**: The **\"final report card\"** we want to make as small as possible (minimize). The smaller, the better the model.\n - **$\\text{MSE}$**: The **\"error score\"** showing how much predictions differ from the true answers.\n - **$\\lambda$ (lambda)**: The **\"strength of the penalty\"** we set by hand. Larger $\\lambda$ acts like a strict teacher and heavily penalizes complex models; smaller $\\lambda$ barely penalizes.\n - **$\\sum_{j} w_j^2$ (L2 penalty)**: Sum of **squares** of all weights. If any weight grows, this sum grows and $J$ increases, so the model tries to keep weights small.\n\n- **L1 penalty ($\\lambda \\sum_{j} |w_j|$)**\n - Where L2 uses squares, L1 uses **absolute values ($|w_j|$)**. L1 is like a strict tidier: it mercilessly zeros out useless weights."},"whyImportant":{"0":"**Because real-world (generalization) performance is the true goal**\n\nThe real value of ML shows not during practice but when the model meets **unseen (test) data**. With regularization, accuracy on the training set may drop a bit, but accuracy in the wild goes up. This ability to handle unknown data well is called **generalization**.","1":"**The art of balance: bias–variance tradeoff**\n\nIf the model is too simple, **bias (underfitting)** grows and it cannot solve the problem. If it is too complex, **variance (overfitting)** grows and it memorizes noise. The two are like a seesaw: when one goes down, the other goes up. Tuning the regularization strength $\\lambda$ is the process of finding the **level (sweet spot)** of that seesaw.","2":"**The human role: finding $\\lambda$ (the hyperparameter)**\n\n$\\lambda$ is not learned by the model; it is a **dial (hyperparameter)** we must set. Turn the dial too hard and the model becomes underpowered; too soft and it becomes a memorizer again. So we must try many $\\lambda$ values and choose the one that gives the best real-world performance."},"howUsed":{"0":"**Adding wings to basic models (Ridge & Lasso)**\n\nWe simply add the L1 or L2 penalty to the usual **linear regression** or **logistic regression** formula.\n\n- Linear regression + L2 = **Ridge regression**\n- Linear regression + L1 = **Lasso regression**\n\nThe computer then minimizes the total loss (including the penalty) via gradient descent and adjusts the weights automatically.","1":"**A 3-step pipeline in practice**\n\nIn practice, regularization is applied as follows.\n\n**1. Split the data**: Divide data into [train / validation / test].\n\n**2. Run a $\\lambda$ audition**: Try $\\lambda$ values such as 0.01, 0.1, 1, 10, and train multiple models on the training set.\n\n**3. Pick the winner and deploy**: Test on the validation set and choose the $\\lambda$ with the best score as the final model. Then evaluate **once** on the test set for the final performance."},"problemSolving":{"0":"**Regularization** — Add **data loss** + **λ×penalty** to shrink weights and reduce **overfitting**. **L2 (Ridge)** uses $\\sum w_j^2$; **L1 (Lasso)** uses $\\sum|w_j|$ for sparsity. λ is a **hyperparameter**.","1":"**Example (formulas)**\n\n- **L2** — $w=(2,3,1)$ → $14$\n- **Total loss** — MSE=20, λ=2, pen=5 → $J=30$\n- **L1** — $w=(2,-3,1)$ → $6$\n\n---\n\n**Example (definition)**\n\nMain goal of regularization? ① reduce overfitting ② speed only\n\n①. → **Answer 1**\n\n---\n\n**Example (T/F)**\n\n\"Regularization only minimizes training error.\" 1 true, 0 false.\n\n0 — penalty term matters. → **Answer 0**\n\n---\n\n**Example (λ)**\n\nIn $J=\\text{MSE}+\\lambda\\cdot(\\text{penalty})$, λ is? ① strength ② learning rate\n\n①. → **Answer 1**\n\n---\n\n**Example (L2)**\n\n$w=(2,3,1)$: $\\sum w_j^2$?\n\n$14$. → **Answer 14**\n\n---\n\n**Example (total loss)**\n\nMSE=20, λ=2, L2 penalty=5 → $J$?\n\n$30$. → **Answer 30**\n\n---\n\n**Example (L1)**\n\n$w=(2,-3,1)$: $\\sum|w_j|$?\n\n$6$. → **Answer 6**\n\n---\n\n**Example (L1 vs L2)**\n\nWhich tends to zero exact weights? ① L1 ② L2\n\n①. → **Answer 1**"},"visual":"Visualization of reducing overfitting with regularization.","problems":{"definition_0":"The main purpose of regularization is? ① Reduce overfitting ② Speed up training ③ Data augmentation","definition_1":"Adding a penalty on weights to keep the model simple is? ① Regularization ② Normalization ③ Ensemble","definition_2":"Adding λ·(penalty) to the loss to reduce overfitting is? ① Regularization ② Gradient descent ③ K-Fold","definition_3":"In L2 regularization the penalty term is? ① $\\sum w_j$ ② $\\sum w_j^2$ ③ $\\sum |w_j|$","definition_4":"In L1 regularization the penalty term is? ① $\\sum w_j$ ② $\\sum w_j^2$ ③ $\\sum |w_j|$","definition_5":"As λ increases, the model becomes? ① More complex ② Simpler ③ Unchanged","definition_6":"Which regularization makes some weights exactly zero (sparse)? ① L1 ② L2 ③ Both","definition_7":"Which keeps weights small but rarely exactly zero? ① L1 ② L2 ③ Both","definition_8":"Ridge regression uses which regularization? ① L1 ② L2 ③ None","definition_9":"Lasso regression uses which regularization? ① L1 ② L2 ③ None","definition_10":"Elastic Net uses which regularization? ① L1 only ② L2 only ③ L1 and L2","trueFalse_0":"With regularization, training error may increase but generalization can improve. 1 if true, 0 if false.","trueFalse_1":"If λ=0 there is no regularization; large λ increases penalty and shrinks weights. 1 if true, 0 if false.","trueFalse_2":"L2 penalty is the sum of absolute values of weights. 1 if true, 0 if false.","trueFalse_3":"L1 tends to set some weights exactly to zero. 1 if true, 0 if false.","trueFalse_4":"λ is usually chosen by cross-validation. 1 if true, 0 if false.","trueFalse_5":"When overfitting, increasing λ can help. 1 if true, 0 if false.","trueFalse_6":"Minimizing only training loss always gives good validation performance. 1 if true, 0 if false.","trueFalse_7":"Total loss = data loss + λ×penalty is the basic form of regularization. 1 if true, 0 if false.","trueFalse_8":"With L2, more weights become zero than with L1. 1 if true, 0 if false.","choice_0":"In J = MSE + λ·(penalty), λ is? ① Regularization strength ② Learning rate ③ Batch size","choice_1":"If L2 penalty $\\sum w_j^2$ is large, the model is? ① More complex ② Large weights ③ Penalty is large; weights are shrunk by training","choice_2":"Ridge and Lasso both? ① Use only L1 ② Penalize weights ③ Classification only","choice_3":"Without regularization (λ=0), we often get? ① Underfitting ② Overfitting ③ No learning","choice_4":"To choose λ we compare? ① Training loss only ② Validation (or CV) performance ③ Test repeatedly","choice_5":"When λ=0 in $\\lambda \\sum w_j^2$? ① No regularization ② Maximum regularization ③ Same as L1","l2Penalty_0":"Weights $w_1=1$, $w_2=2$, $w_3=2$. L2 penalty $\\sum_j w_j^2$ (integer)?","l2Penalty_1":"Weights $w_1=0$, $w_2=3$, $w_3=4$. L2 penalty $\\sum_j w_j^2$ (integer)?","l2Penalty_2":"Weights $w_1=2$, $w_2=2$. L2 penalty $w_1^2+w_2^2$ (integer)?","l2Penalty_3":"Weights $w_1=1$, $w_2=1$, $w_3=1$, $w_4=1$. L2 penalty $\\sum_j w_j^2$ (integer)?","l2Penalty_4":"Weights $w_1=3$, $w_2=4$. L2 penalty (integer)?","totalLoss_0":"MSE=10, λ=1, L2 penalty=6. Total loss J=MSE+λ·(penalty) (integer)?","totalLoss_1":"MSE=16, λ=2, L2 penalty=5. J (integer)?","totalLoss_2":"MSE=8, λ=4, penalty=2. J (integer)?","totalLoss_3":"MSE=12, λ=3, penalty=4. J=MSE+λ·penalty (integer)?","totalLoss_4":"MSE=20, λ=2, penalty=10. J (integer)?","l1Penalty_0":"Weights $w_1=2$, $w_2=-3$, $w_3=1$. L1 penalty $\\sum |w_j|$ (integer)?","l1Penalty_1":"Weights $w_1=1$, $w_2=2$, $w_3=3$. L1 penalty (integer)?","l1Penalty_2":"Weights $w_1=-1$, $w_2=2$. L1 penalty $|w_1|+|w_2|$ (integer)?","l1Penalty_3":"Weights $w_1=4$, $w_2=0$, $w_3=3$. L1 penalty (integer)?","l1Penalty_4":"Weights $w_1=5$, $w_2=5$. L1 penalty (integer)?","concept_0":"'Generalization' in regularization means? ① Fit training only ② Do well on unseen data ③ More data","concept_1":"In bias–variance tradeoff, stronger regularization? ① Increases variance ② Decreases variance ③ Only increases bias","concept_2":"Adding a penalty to the loss causes weights to? ① Grow without bound ② Be penalized if too large ③ Always be zero","concept_3":"A practical reason for Lasso (L1) is? ① Faster than L2 ② Sparse weights, interpretable ③ Always better than L2","concept_4":"Ridge (L2) and Lasso (L1) together is? ① Elastic Net ② Dropout ③ Batch Norm","concept_5":"When tuning λ we compare? ① Training loss ② Validation (or CV) performance ③ Parameter count","concept_6":"When overfitting badly, try? ① Decrease λ ② Increase λ or more data ③ More complex model","concept_7":"'Rote memorizer' in the analogy is? ① Model overfitting training ② Well-generalizing model ③ Model with large λ","concept_8":"In J = MSE + λ·(L2 penalty), if λ=0? ① Only penalty ② No regularization (same as OLS) ③ Same as L1","concept_9":"If validation error is much larger than training error? ① Underfitting ② Overfitting ③ Good fit"}},"mlRecommendationProblemPrompt":"Read the problem and choose the correct option below.","mlRecommendationProblemSolvingLabel":"Explanation for problem solving","mlRecommendationSubjectivePrompt":"Write a one-line reason (ungraded).","mlRecommendationSubjectivePlaceholder":"E.g., we predict the blank by averaging neighbors' ratings using similarity as weights.","mlRecommendationVisualIntro":"From the user-item rating matrix, find similar users (neighbors) and predict missing entries using their ratings.","mlRecommendationVisualStep0":"① Rating matrix — Rows: users, Columns: items. Known ratings and blanks (?)","mlRecommendationVisualStep1":"② Similarity — Compute how similar users (or items) are","mlRecommendationVisualStep2":"③ Neighbor selection — Select the K most similar neighbors","mlRecommendationVisualStep3":"④ Prediction — Weighted average of neighbors' ratings to fill the blank","mlRecommendationVisualStep4":"⑤ Recommendation — Recommend items with high predicted scores","mlRecommendationVisualHowItWorks":"① Find neighbors → ② Use their ratings → ③ Predict empty cell → ④ Recommend","mlRecommendationVisualRowTitle":"Neighbors' ratings for this item → fill in my predicted rating","mlRecommendationVisualCardNeighbor1":"Neighbor 1 (similar user)","mlRecommendationVisualCardNeighbor2":"Neighbor 2 (similar user)","mlRecommendationVisualCardItem":"This item (I haven't seen it yet)","mlRecommendationVisualCardNeighbor1Short":"Neighbor 1","mlRecommendationVisualCardNeighbor2Short":"Neighbor 2","mlRecommendationVisualCardItemShort":"This item","mlRecommendationVisualCalc":"Avg prediction: $\\hat{r}_{u,i}=\\frac{5+4}{2}=4.5\\approx4$ (neighbors rated ★5 and ★4) → predict ★4","mlRecommendationVisualBottomDesc":"Similar users gave this item ★5, ★4 → we recommend ★4!","mlRecommendationVisualCaption":"Collaborative filtering: predict $\\hat{r}_{u,i}$ from similar users.","mlRecommendationVisualAriaLabel":"Recommendation flow: rating matrix → similarity → neighbors → weighted average","mlRecommendation":{"chapter":"Chapter 14","title":"Collaborative Filtering: Recommendation Basics","description":"Have you ever seen 'You might also like' on Netflix? **Collaborative filtering** recommends items that users with similar tastes liked. This chapter covers the rating matrix, similarity, neighbor-based prediction, and how it is used in practice.","sectionTitle":"Recommendation basics: Collaborative filtering","whatIs":{"0":"**What is collaborative filtering?** — It uses **other users' behavior** (ratings, clicks, purchases) to recommend items to you. The idea is that people with similar tastes tend to like similar things. It is widely used in streaming, e-commerce, and music apps.","1":"**Intuition: borrowing from neighbors** — For movie recommendations, if someone who liked the same movies A and B as you also liked C, you might like C. Those **similar users** are **neighbors**, and predicting from their ratings is the core of collaborative filtering.","2":"**Math: rating matrix and prediction** — The **rating matrix** has size (users × items); many entries are missing (sparse). **User-based** collaborative filtering finds **neighbors** of user $u$, then fills a missing rating for item $i$ with a **weighted average** of the neighbors' ratings. Similarity is often measured by **cosine similarity** or **Pearson correlation**.","3":"**In practice** — **Cold start** (new users/items have no neighbors) and **sparsity** make pure collaborative filtering hard, so it is often combined with **content-based** methods or **matrix factorization**."},"whyImportant":{"0":"**Recommendations drive business and UX** — Good recommendations increase engagement and revenue. Collaborative filtering personalizes results using behavior data alone, without rich metadata.","1":"**Core ML application** — Recommendation is a different kind of problem: we fill in missing entries of a matrix. Understanding collaborative filtering is a step toward matrix factorization and deep learning-based recommenders."},"howUsed":{"0":"**User-based vs item-based** — **User-based**: find users similar to you and recommend what they liked. **Item-based**: find items similar to the one you are viewing ('Users who bought this also bought'). Both use similarity and neighbors.","1":"**Similarity and prediction** — Similarity $s_{u,v}$ between users is computed, then prediction uses a weighted average of neighbors' ratings. Metrics like **MAE** and **RMSE** are used for evaluation.","2":"**Matrix factorization** — Advanced methods approximate the rating matrix by a product of lower-rank matrices. **Hybrid** systems combine collaborative filtering with content or context."},"problemSolving":{"0":"**Collaborative filtering** — Use other users' **behavior** to find **neighbors**, then fill missing $\\hat{r}_{u,i}$ by **simple** or **weighted** average. **Rating matrix**: rows=users, columns=items, often **sparse**. Cold start / sparsity: combine with content, MF, hybrid.","1":"**Example (summary)**\n\n- **Definition** — Based on other users' **behavior**\n- **Matrix** — Rows×cols = cell count\n- **Simple avg** — $\\hat{r}=\\frac{1}{K}\\sum r$\n- **Weighted avg** — $\\hat{r}=\\frac{\\sum s\\,r}{\\sum|s|}$\n\n---\n\n**Example (definition)**\n\nClosest to collaborative filtering? ① other users' behavior ② genre only ③ random\n\n①. → **Answer 1**\n\n---\n\n**Example (simple average)**\n\nRatings 3,4,5 → mean?\n\n$4$. → **Answer 4**\n\n---\n\n**Example (cells)**\n\n3 users, 4 items → cells?\n\n$12$. → **Answer 12**\n\n---\n\n**Example (weighted)**\n\nRatings 4,5,3 weights 2,1,1 → weighted mean?\n\n$4$. → **Answer 4**"},"visual":"Visualization of rating-matrix recommendation flow.","problems":{"definition_0":"Collaborative filtering is? ① Recommendation based on other users' behavior (ratings, clicks) ② Recommendation based on item features (e.g. genre) ③ Random recommendation","definition_1":"A method that recommends what 'similar users' liked is? ① Collaborative filtering ② Supervised learning ③ K-Means","definition_2":"In user-based collaborative filtering, 'neighbors' are? ① Users with similar taste ② Users in the same region ③ Users in the same age group","definition_3":"In the rating matrix, rows and columns are? ① Rows=users, Columns=items ② Rows=items, Columns=users ③ Rows=time, Columns=rating","definition_4":"Cold start is? ① New users/items have no neighbors, so recommendation is hard ② Server stops ③ Too many ratings","definition_5":"Similarity in collaborative filtering is used to? ① Find similar users (or items) ② Normalize ratings ③ Compress the matrix","definition_6":"Filling blanks using neighbors' ratings is? ① A core step of collaborative filtering ② Preprocessing ③ An evaluation metric","definition_7":"Cosine similarity and Pearson correlation are? ① Similarity measures between users (or items) ② Loss functions ③ Activation functions","definition_8":"Item-based collaborative filtering? ① Finds similar items to recommend ② Uses only similar users ③ Does not use the rating matrix","definition_9":"Sparsity means? ① Most entries of the matrix are missing ② Too many ratings ③ Too many users","definition_10":"MAE and RMSE in recommendation are? ① Evaluation metrics for prediction accuracy ② Similarity metrics ③ Matrix size","definition_11":"Hybrid recommendation? ① Combines collaborative + content-based etc. ② Uses only collaborative ③ No recommendation","trueFalse_0":"Collaborative filtering uses other users' ratings for recommendation. Enter 1 if true, 0 if false.","trueFalse_1":"More neighbors (larger K) always give more accurate predictions. Enter 1 if true, 0 if false.","trueFalse_2":"The rating matrix is usually sparse (most entries are missing). Enter 1 if true, 0 if false.","trueFalse_3":"Cold start refers to difficulty in recommending to new users. Enter 1 if true, 0 if false.","trueFalse_4":"Both user-based and item-based use similarity and neighbors. Enter 1 if true, 0 if false.","trueFalse_5":"Prediction can only be the simple average of neighbor ratings. Enter 1 if true, 0 if false.","trueFalse_6":"Matrix factorization is used to predict missing ratings. Enter 1 if true, 0 if false.","trueFalse_7":"Collaborative filtering alone can fully solve cold start. Enter 1 if true, 0 if false.","trueFalse_8":"Collaborative filtering is widely used in Netflix and e-commerce. Enter 1 if true, 0 if false.","choice_0":"The main idea of collaborative filtering is? ① Use similar users' behavior ② Use only item descriptions ③ Random choice","choice_1":"One cell in the rating matrix means? ① One user's rating for one item ② Number of users ③ Number of items","choice_2":"To predict from K neighbors' ratings we use? ① Average (or weighted average) ② Maximum ③ Minimum","choice_3":"Similarity is used to? ① Choose similar neighbors ② Normalize ratings ③ Compress the matrix","choice_4":"A sparse matrix causes? ① Unstable similarity estimates ② Faster computation ③ No users","choice_5":"Recommendation quality is measured by? ① MAE, RMSE ② Similarity ③ Matrix size","choice_6":"Item-based recommendation finds 'similar items' using? ① Item-item similarity ② Number of users ③ Sum of ratings","choice_7":"To ease cold start we use? ① Content-based, hybrid ② Collaborative only ③ No recommendation","scenario_0":"Hard to recommend to a new user because? ① Cold start (no neighbors/ratings) ② Too many ratings ③ Similarity is 1","scenario_1":"'Users who bought this also bought' is close to? ① Item-based collaborative filtering ② User-based only ③ Random","scenario_2":"Hard to recommend a new movie with few ratings? ① Cold start (item side) ② Too many neighbors ③ Similarity is 0","scenario_3":"Combining collaborative filtering with genre/tags is? ① Hybrid ② Collaborative only ③ Content only","scenario_4":"'For you' recommendations like Netflix are based on? ① Personalization (collaborative, content, etc.) ② Same for everyone ③ Ads only","scenario_5":"When the matrix is very sparse, to improve quality we? ① Use matrix factorization, hybrid, etc. ② Just increase K ③ Delete ratings","concept_0":"When choosing K neighbors, K is? ① A hyperparameter set by the user ② Always 1 ③ Always all users","concept_1":"In weighted average prediction, the weights are? ① Similarity ② Ratings only ③ Random","concept_2":"Matrix factorization aims to? ① Predict missing entries, reduce dimension ② Delete ratings ③ Remove similarity","concept_3":"The size (number of cells) of the rating matrix is? ① (Number of users)×(Number of items) ② Number of users only ③ Number of items only","concept_4":"Neighbor ratings 3, 4, 5. Simple average prediction (integer)? ① 4 ② 5 ③ 3","concept_5":"In user-based filtering, prediction uses? ① Neighbors' ratings for that item ② Only my past ratings ③ Only item descriptions","concept_6":"Lower MAE means? ① Predictions are closer to actual ② Predictions are worse ③ No relation","concept_7":"Content-based recommendation uses? ① Item features (genre, tags) ② Collaborative only ③ Random","concept_8":"To ease cold start we use? ① Content, popular items, hybrid ② Just increase K ③ Stop recommendation","neighborPredict_0":"Three neighbors' ratings are 3, 4, 5. Mean prediction (integer)?","neighborPredict_1":"Three neighbors' ratings are 2, 4, 6. Mean prediction (integer)?","neighborPredict_2":"Three neighbors' ratings are 4, 4, 4. Mean prediction (integer)?","neighborPredict_3":"Three neighbors' ratings are 1, 3, 5. Mean prediction (integer)?","neighborPredict_4":"Four neighbors' ratings are 2, 2, 4, 4. Mean prediction (integer)?","neighborPredict_5":"Three neighbors' ratings are 5, 5, 5. Mean prediction (integer)?","matrixCells_0":"3 users, 4 items. Number of cells in the rating matrix (integer)?","matrixCells_1":"5 users, 6 items. Number of cells (integer)?","matrixCells_2":"2 users, 10 items. Number of cells (integer)?","matrixCells_3":"4 users, 5 items. Number of cells (integer)?","matrixCells_4":"6 users, 5 items. Number of cells (integer)?","weightedPredict_0":"Ratings 4, 5, 3 with weights 2, 1, 1. Weighted average prediction (integer)?","weightedPredict_1":"Ratings 3, 5 with weights 1, 1. Weighted average prediction (integer)?","weightedPredict_2":"Ratings 5, 3, 4 with weights 2, 2, 2. Weighted average prediction (integer)?","weightedPredict_3":"Ratings 2, 4 with weights 1, 1. Weighted average prediction (integer)?","weightedPredict_4":"Ratings 5, 5, 1 with weights 1, 1, 2. Weighted average prediction (integer)?"}}},"mlCh01":{"chapter":"Chapter 01","title":"Missing Value Handling: Strategies to Fill Data Gaps","description":"Real-world data often has missing values—empty cells like in a spreadsheet. Ignoring them can halt training or yield biased results. This chapter walks through filling those gaps, screening extreme values (**outliers**), and correcting skewed class ratios (**class imbalance**)—a practical **data quality pipeline** that underpins reliable machine learning.","sectionTitle":"Missing Value Handling: preprocessing that reduces gaps and raises trust","whatIs":{"0":"**What is a missing value?** An empty cell in a data table—like a puzzle with a tooth missing. In practice they come from skipped survey answers, sensor failures, data transfer loss, and more.","1":"**Missingness mechanisms (MCAR/MAR/MNAR)** ask *why* the blank appeared. **MCAR** (*Missing Completely at Random*) is like coffee spilled on a form—chance alone. **MAR** (*Missing at Random*) is like male respondents leaving “cosmetics spend” empty—linked to *other* observed variables. **MNAR** (*Missing Not at Random*) is like low-income people leaving “income” blank—the missingness itself carries meaning.","2":"**Handling strategies** fall into three broad types: **listwise deletion**, **single imputation** (fill with one value), and **multiple imputation** (fill several times and pool). Each trades off how much data you keep, speed, and statistical rigor—pick to fit the situation.","3":"**Single vs multiple imputation:** **Single imputation** fills each gap once with e.g. the mean or mode—fast but risky. **Multiple imputation** builds several plausible completed datasets (parallel “worlds”) and pools results for a more careful conclusion.","4":"**Two views on outliers:** **Univariate detection (box plot)** flags extreme values in one variable; **multivariate detection (Mahalanobis / Isolation Forest / SVDD)** flags odd *combinations* across variables. They answer different questions—in practice you often check both.","5":"**Class imbalance correction:** When one class dominates, models may behave as if the rare class barely exists. Practitioners combine Tomek Links (boundary cleaning), SMOTE/ADASYN (synthetic minority samples), and SMOTE+Tomek (synthesize then clean).","6":"**Core message:** Missing-value handling is not a standalone trick—it is **one pipeline design problem** tied to outlier checks and imbalance correction."},"whyImportant":{"0":"**Systems hate blanks.** If you leave gaps, the pipeline may error—like an OMR sheet that cannot be scored without marks.","1":"**Bad fills mislead.** Filling everything with 0 or the mean breaks the true distribution; the model may treat imputed values as real and become **overconfident**.","2":"**Preprocessing is a set menu.** Filling missing values is not the end—you should plan outlier screening and imbalance handling in the same breath so the model behaves in production.","3":"**Fairness and safety:** If missingness differs by group (MAR/MNAR), careless imputation can widen performance gaps between groups—check bias signals early.","4":"**It beats model choice to the punch:** With the same algorithm, better preprocessing can change outcomes more than swapping models—often “good data flow” wins over “good model name.”","5":"**Deployment stability:** If you define rules for missingness, outliers, and imbalance up front, new data can be handled consistently—retraining and monitoring get easier."},"howUsed":{"0":"**End-to-end flow:** EDA → hypothesize why values are missing → choose imputation → catch extremes (**outlier detection**, e.g. box plot) → adjust class mix (**imbalance correction**, e.g. SMOTE) → then train and evaluate.","1":"**Single-imputation formulas:** Mean fill: $x_{miss} \\leftarrow \\bar{x}$; median fill: $x_{miss} \\leftarrow \\mathrm{median}(x)$.","2":"**Multiple imputation:** Build $m$ completed datasets (“parallel worlds”), then pool estimates $\\theta_k$: $\\bar{\\theta}=\\frac{1}{m}\\sum_{k=1}^{m}\\theta_k$.","3":"**Box plot (IQR) rule:** Fences from $Q_1-1.5\\times IQR$ to $Q_3+1.5\\times IQR$; points outside are outlier *candidates*.","4":"**Covariance:** Measures how two variables move together—e.g. do taller people tend to weigh more? $\\mathrm{cov}(X,Y)=\\mathbb{E}[(X-\\mu_X)(Y-\\mu_Y)]$. Stacking covariances yields $\\Sigma$, which sets the orientation and stretch of the multivariate “cloud” (ellipses).","5":"**Mahalanobis distance:** Not plain Euclidean distance—it uses $\\Sigma^{-1}$ to weight directions by spread: $D_M(\\mathbf{x})=\\sqrt{(\\mathbf{x}-\\boldsymbol\\mu)^\\top\\Sigma^{-1}(\\mathbf{x}-\\boldsymbol\\mu)}$ (covariance is central).","6":"**Isolation Forest:** Outliers are points that become **isolated quickly** under random splits—few splits needed to separate them (short path length), often in high dimensions with weak distributional assumptions.","7":"**SVDD (one-class):** Learn a **boundary** around normal data (minimum-volume sphere or kernel-shaped region) and flag points outside as outliers—common in one-class anomaly detection.","8":"**Class imbalance:** With a very rare positive class, accuracy can look high while the model ignores positives—use Recall, Precision, F1, PR-AUC together and resample when needed.","9":"**Tomek Links:** Pairs of opposite-class mutual nearest neighbors near the boundary—often remove the majority point (or both) to **clean** overlap (undersampling-based cleaning).","10":"**SMOTE:** Interpolate between a minority point $\\mathbf{x}$ and a neighbor $\\mathbf{x}_{nn}$: $\\mathbf{x}_{new}=\\mathbf{x}+\\lambda(\\mathbf{x}_{nn}-\\mathbf{x})$, $\\lambda\\sim U(0,1)$—richer than copy-paste but can add bad samples if the boundary is noisy.","11":"**Hybrid resampling (e.g. SMOTE+Tomek):** **Oversample** the minority with SMOTE, then **clean** ambiguous boundary pairs with Tomek—think **oversample → clean**.","12":"**ADASYN:** Like SMOTE but allocates **more** synthetic samples to “hard” minority regions (surrounded by majority)—denser support where the classifier struggles."},"summary":"**One-page cheat sheet**\n- There is no universal “magic” imputation—start by **why** data are missing (**MCAR/MAR/MNAR**).\n- **Single imputation** is fast but ignores uncertainty; **multiple imputation** is stronger statistically but costs more compute.\n- Check outliers in **both** **univariate (box plot)** and **multivariate (Mahalanobis / Isolation Forest / SVDD)** ways to miss fewer cases.\n- For imbalance combine **Tomek (clean)**, **SMOTE/ADASYN (synthesize)**, **SMOTE+Tomek (hybrid)** as needed.\n- Always compare metrics before and after preprocessing (Recall, F1, PR-AUC, etc.) to see real gains.","problemSolving":{"0":"$4d","1":"$4e"},"sectionLabels":{"whatIs":"Concept","whyImportant":"Intuition","howUsed":"Math","summary":"Practical use","problemSolving":"Problem solving"},"problemSolvingLabel":"How to approach the problems","imputationTable":{"title":"Common Single-Imputation Values/Methods","caption":"A compact table of common single-imputation methods with definitions and formulas.","headers":{"method":"Value/Method","definition":"Definition (short formula)"},"rows":{"0":{"method":"Mean","definition":"Impute with sample mean: $x_{miss} \\leftarrow \\bar{x}=\\frac{1}{n}\\sum_{i=1}^{n}x_i$"},"1":{"method":"Median","definition":"Impute with median: $x_{miss} \\leftarrow \\mathrm{median}(x)$"},"2":{"method":"Mode","definition":"Impute with most frequent value: $x_{miss} \\leftarrow \\arg\\max_v\\,\\mathrm{count}(x=v)$"},"3":{"method":"Regression · KNN · Hot-deck","definition":"Regression: $\\hat{x}=f(\\mathbf{z})$, KNN: $x_{miss}\\leftarrow\\frac{1}{k}\\sum_{j\\in N_k}x_j$, Hot-deck: $x_{miss}\\leftarrow x_{donor}$"}}},"practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"Ten questions drawn at random from a pool of 60: 4 easy, 3 medium, 3 hard.","practiceProblemsInstruction":"Choose one of ①–④, then press Check answer.","checkAnswer":"Check answer","correctAnswer":"Correct!","wrongAnswer":"Incorrect. Try again.","testCodeLabel":"Test code","visualIntro":"Data quality pipeline from missing-value handling to outlier/imbalance correction","visualStep0":"Detect missingness: rate and pattern","visualStep1":"Handle missingness: deletion / single vs multiple imputation","visualStep2":"Outlier detection: Box Plot, Mahalanobis, Isolation Forest, SVDD","visualStep3":"Imbalance correction: Tomek, SMOTE, ADASYN, SMOTE+Tomek","visualStep4":"⑤ Train and validate: check generalization","visualAriaLabel":"Diagram of missing-value handling and data quality improvement","problemSolvingFallback":"Identify MCAR/MAR/MNAR → choose single vs multiple imputation → screen outliers with Box Plot / Mahalanobis / Isolation Forest / SVDD → apply Tomek / SMOTE / ADASYN / hybrid resampling.","visualDiagram":{"hintStep0":"Observe: look at the missingness pattern first","hintStep1":"Choose: single vs multiple imputation","hintStep2":"Check: outliers (univariate / multivariate)","hintStep3":"Correct: imbalance (synthesize → clean)","clickMechanismCards":"Click the MCAR · MAR · MNAR cards below to change the pattern.","pipelineNavAria":"Pipeline steps","chipPattern":"Missing pattern","chipImpute":"Imputation","chipOutlier":"Outliers","chipImbalance":"Imbalance","panelDetectTitle":"Missing detection (pattern)","badgeMcar":"MCAR (random)","badgeMar":"MAR (conditional)","badgeMnar":"MNAR (value-dependent)","legendObserved":"Observed","legendMissing":"Missing","gridColorHint":"Cell colors hint at “why is this blank?”","tooltipObserved":"Observed","tooltipMissing":"Missing","mcarLine1":"MCAR","mcarLine2":"Missing completely at random · Missing Completely At Random","mcarLine3":"Scattered pattern—could be “pure chance”","marLine1":"MAR","marLine2":"Missing at random (MAR) · Missing At Random","marLine3":"Vertical bands—missingness when certain conditions hold","mnarLine1":"MNAR","mnarLine2":"Not missing at random · Missing Not At Random","mnarLine3":"Concentrated in tails—the blank itself carries meaning","panelImputeTitle":"Missing handling: single vs multiple imputation","imputePhase0":"Check blanks","imputePhase1":"Single imputation","imputePhase2":"Multiple imputation","imputePhase3":"Pool","singleTitle":"Single imputation (1×)","singleLead":"Each blank gets the same single fill","singleFoot":"Filling once is **fast** but can make data look “less noisy” than it is (underestimated variance).","multiTitle":"Multiple imputation (m×)","multiLead":"Several plausible fills → pool mean and uncertainty at the end","multiFoot":"Impute several times, then **pool (mean / variance)** to reflect uncertainty.","boxTitle":"Univariate outliers: box plot (IQR)","boxPhase0":"Box (Q1–Q3)","boxPhase1":"Fences (1.5×IQR)","boxPhase2":"Past fence = candidates","boxChip1":"Box","boxChip2":"Fences","boxChip3":"Outside points","boxPlotStagesAria":"Box plot steps","fenceLower":"Lower fence","fenceUpper":"Upper fence","boxSummary":"In short: **Q1·Q3 → IQR → 1.5×IQR fences**—points outside are outlier candidates.","mvTitle":"Multivariate outliers: odd “combinations”","mvPhase0":"Distance (covariance)","mvPhase1":"Isolation (short path)","mvPhase2":"Boundary (normal region)","mahalPara1":"When axes move together (covariance), points form an **elliptical cloud**. Inside is common; **far outside the ellipse** is suspicious.","mahalPara2":"Distance that respects correlation","mahalBadge":"Far from ellipse → candidate","ifPara1":"Under random cuts, a point that is **isolated in few splits** is easy to separate—remember that intuition.","ifPara2":"Isolated quickly under random splits","ifBadge":"Short path → candidate","svddPara1":"Wrap normals in a **balloon-like boundary**. Inside=familiar; **outside=unfamiliar**.","svddPara2":"Learn a boundary around normal data","svddBadge":"Outside boundary → candidate","imbTitle":"Class imbalance: SMOTE/ADASYN + Tomek Links","imbIntro":"**Tomek Links** finds pairs of mutual nearest neighbors across classes near the boundary and often removes the **majority** point to clear overlap.","imbSmoteAdasynIntro":"**SMOTE** creates synthetic minority points by interpolating between a minority sample and its neighbors; **ADASYN** allocates *more* synthetic samples to ‘hard’ minority regions surrounded by the majority, densifying the boundary neighborhood.","imbPhase0":"Minority squeezed at the boundary","imbPhase1":"Synthesize to fill gaps","imbPhase2":"Clean boundary with Tomek","imbWhyTitle":"Why it matters","imbWhyBody":"With strong imbalance, the model can look good while mostly predicting the majority—check recall/F1 and fix the data too.","imbMajor":"Majority (85%)","imbMinor":"Minority (15%)","imbHowTitle":"How to fix it? (visual)","imbHowLead":"Picture a **curved** decision boundary—SMOTE and Tomek fit boundary clutter naturally.","imbChip0":"Noisy boundary","imbChip1":"Synthesize to fill","imbChip2":"Clean with Tomek","imbChip2Title":"Pairs of mutual nearest neighbors across classes: often remove the majority side to tidy the boundary.","imbTomekCallout":"Yellow ring: **majority (gray)** points intruding on the boundary are Tomek candidates; after cleaning they fade and the boundary is cleaner.","chartDenseTop":"Top: dense majority","chartSparseBottom":"Bottom: minority (+ synthetic)","imbBoundaryMsg":"Near the boundary, misclassification noise grows easily","imbFlow1":"Flow: SMOTE/ADASYN densifies the minority neighborhood → **Tomek Links** removes points in **mutual nearest-neighbor pairs** across classes at the boundary (typically the majority side) to tidy it","imbFlow2":"Intuition: after synthesis, pair points that are nearest neighbors yet from different classes—then drop the **majority** points that clutter the boundary.","legMinor":"Minority","legMajor":"Majority","legSyn":"Synthetic (SMOTE/ADASYN)","legCurve":"Curved boundary","pointTitleMajor":"Majority","pointTitleMajorTomek":"Majority intruding (clean-up candidate)","pointTitleSyn":"Synthetic (SMOTE/ADASYN)"}},"mlCh07":{"chapter":"Chapter 07","title":"XGBoost, LightGBM, CatBoost: Tabular ML Powerhouses","description":"When you work with spreadsheet-like **tabular data**, a family of models often beats even heavy deep learning: **gradient boosting**. Boosting lines up many \"average students\" (weak learners) in order; each one studies the mistakes the previous models still make, until the team acts like a single **strong predictor**.\n\nThis chapter dissects **XGBoost, LightGBM, and CatBoost**—the trio behind countless production systems and Kaggle solutions—and gives you **clear rules for which tool fits your dataset**.","sectionTitle":"CH07 — The boosting trio: mastering residuals one tree at a time","whatIs":{"0":"**1. Core idea: sequential error notebooks**\n\n**Concept:** Boosting chains decision trees **in sequence**. Each new tree focuses on the **residuals** (errors) left by the ensemble so far.\n\n**Intuition:** Picture a study group before an exam. Student 1 takes a practice test and writes an **error notebook** of mistakes. Student 2 drills only those questions. Student 3 fixes what student 2 still misses. Repeat many rounds and the group’s combined score skyrockets.\n\n**Key update:** $F_t(x)=F_{t-1}(x)+\\eta h_t(x)$\n\n- $F_t(x)$: prediction after stage $t$\n- $F_{t-1}(x)$: prediction before adding the latest tree\n- $h_t(x)$: new tree trained to reduce the **remaining error**\n- $\\eta$: **learning rate**—how aggressively you trust the new tree (smaller $\\eta$ often means you need more trees but can be more stable)\n\n**Practice:** Loan default, churn, CTR, and many other **row-and-column** tasks still treat boosting as a top-tier baseline.","1":"**2. XGBoost: stable, regularized workhorse**\n\n**Concept:** The library that popularized modern gradient boosting. It optimizes loss while penalizing overly complex trees through built-in **regularization** terms, which tends to make training predictable and robust.\n\n**Intuition:** A strict teacher who cares about progress **and** about stopping you from \"memorizing\" the textbook—penalties kick in when the model gets too wiggly (overfitting).","2":"**3. LightGBM: speed for huge datasets**\n\n**Concept:** Built for scale when millions of rows made classic boosting slow. It uses **histogram-based** binning to cut computation and usually grows trees **leaf-wise**—splitting the leaf that most reduces loss—instead of expanding an entire level at a time (level-wise).\n\n**Intuition:** Like skipping chapters you already know and camping on the one chapter most likely to be on the exam: **maximum efficiency**, but you can **over-drill** one corner of the space.\n\n**Caveat:** Leaf-wise trees overfit more easily on **small** data. Tune **`max_depth`**, **`min_data_in_leaf`**, and related knobs.","3":"**4. CatBoost: categoricals without the headache**\n\n**Concept:** From Yandex—the name merges **Cat**egory and **Boost**. It is strong on **high-cardinality categorical** features (city, job title, product ID) with less manual encoding drama.\n\n**Intuition:** Think of taking an exam: you should solve each question **without peeking at later answers**. In tabular ML, mixing future information into training causes **target leakage** and inflated scores. CatBoost’s design (including **ordered / permutation-based** ideas) is built to reduce that \"peeking ahead\" risk. That is why it often works well even with strong **default settings**.","4":"$4f","5":"$50"},"whyImportant":{"0":"**The go-to baseline for tabular work**\n\nFor many **database / CSV** problems, gradient boosting is **fast, accurate, and simpler to iterate** than a full deep-learning stack. Teams routinely reach for it **before** designing exotic neural nets.","1":"**Pick the weapon to match the data**\n\n- Need **stability** and mature tooling on medium-sized data? **XGBoost**\n- Need **training speed** and memory efficiency at **very large** scale? **LightGBM**\n- Drowning in **categorical columns** and want sane defaults? **CatBoost**","2":"**Hyperparameters are the steering wheel**\n\n`learning_rate`, tree depth / leaves, `n_estimators`, early stopping—these jointly control the **bias–variance trade-off** and compute cost. Understanding how they interact lets you **tune without guessing**."},"howUsed":{"0":"**① Pipeline pattern**\n\nClean missing values and categories $\\rightarrow$ split **train / validation** $\\rightarrow$ fit a booster $\\rightarrow$ explain with **SHAP** or feature importance for stakeholders $\\rightarrow$ ship and monitor.","1":"**② Early stopping**\n\nMore trees are not always better— eventually you memorize the training set. When **validation loss** plateaus or worsens, **stop** and keep the best iteration. In production this is standard practice.","2":"**③ Align metrics with the business**\n\n- **Classification (churn, fraud):** look beyond accuracy—**AUC**, **F1**, precision/recall at a chosen threshold.\n- **Regression (demand, price):** track **RMSE** / **MAE** in units stakeholders understand."},"summary":"**Cheat-sheet**\n\n| Model | Keywords | Strengths | Watch-outs |\n| :--- | :--- | :--- | :--- |\n| **XGBoost** | Regularization, stability | Reliable default; strong on many tabular workloads | Can be slower on very large data |\n| **LightGBM** | Speed, leaf-wise | Fast training; memory-friendly bins | Overfits more easily on small data |\n| **CatBoost** | Categories, defaults | Less hand-built encoding; strong out-of-box | Heavier; model size can grow |\n\nAll three share the boosting idea: **reduce residuals stage by stage and blend many trees**.","problemSolving":{"0":"**Practice problem playbook**\n\n- Practice items are **four-choice multiple choice**; pick the option that matches the prompt (numeric items show the value as one of the choices).\n- **LightGBM / leaf-wise** items often pair with **`max_depth`**, **`min_data_in_leaf`**, or **`num_leaves`** constraints that fight overfitting.\n- **Model choice** questions: match **data volume**, **categorical load**, and **latency** to the table above.\n- Theory items: locate definitions of $F_t=F_{t-1}+\\eta h_t$, the **regularizer** $\\Omega(f)$, **histogram binning**, and **ordered statistics** inside the question text first."},"sectionLabels":{"whatIs":"What It Is","whyImportant":"Why It Matters","howUsed":"How It Is Used","summary":"Summary","problemSolving":"Problem solving & tips"},"problemSolvingLabel":"Problem Solving Notes","practiceProblemsTitle":"Practice Problems","practiceProblemsIntro":"10 random questions are sampled from 60 in total (easy→medium→hard, 4-3-3).","practiceProblemsInstruction":"Read the prompt, select one of the four choices (①–④), then press Check.","boostingTestCodeLabel":"Test code","boostingVisualIntro":"Each new tree reduces the remaining error from previous trees.","boostingVisualIntroPanels":"These libraries grow trees differently: level-wise, leaf-wise, and symmetric (oblivious). Watch each panel animate in turn.","boostingVisualAriaLabel":"Comparison of XGBoost level-wise, LightGBM leaf-wise, and CatBoost symmetric tree growth","boostingVisualTitleXgb":"XGBoost","boostingVisualTitleLgb":"LightGBM","boostingVisualTitleCat":"CatBoost","boostingVisualCaptionXgb":"Level-wise\nfills each depth before going deeper","boostingVisualCaptionLgb":"Leaf-wise\nsplits the leaf with the largest loss drop","boostingVisualCaptionCat":"Oblivious\nsame split rule at each depth (symmetric)","boostingVisualPhaseCaption0":"① XGBoost — level-wise: completes each depth before going deeper.","boostingVisualPhaseCaption1":"② LightGBM — leaf-wise: splits the leaf with the largest loss reduction first.","boostingVisualPhaseCaption2":"③ CatBoost — oblivious trees: identical splits at each depth (symmetric).","boostingVisualPhaseCaption3":"Side by side, the three growth rules are easy to contrast.","boostingVisualStep0":"① Initial model leaves large error","boostingVisualStep1":"② Trees sequentially fit residuals","boostingVisualStep2":"③ Later tree fixes harder patterns","boostingVisualStep3":"④ Final ensemble improves prediction","checkAnswer":"Check Answer","correctAnswer":"Correct!","wrongAnswer":"Try again."},"mathChapters":{"mathCumulativeVisualTitle":"Basic math concept flow","mathCumulativeVisualLabel":"Basic math chapter concept visual","sectionLabels":{"whatIs":"What the concept is","whyImportant":"Why it matters","howUsed":"How it is used","problemSolving":"Explanation for solving the problems"},"mathIntro":{"chapter":"Chapter 00","title":"Basic Math and AI: Learning the Language of AI","description":"Why math is needed to understand deep learning and machine learning, what math tools are used—we draw that map together.","sectionTitle":"Why do we need math to understand deep learning and machine learning?","visualIntro":"Visualize how AI inputs pass through math to become predictions.","visualInputLabel":"Input","visualInputTypes":"Image, text, sound","visualMathLabel":"Basic math","visualMathTopics":"Functions · vectors · matrices","whatIs":{"0":"**Understanding AI requires math as a lens** — Deep learning and machine learning turn the images, text, and sound we give them into **numbers**. Those numbers pass through **functions** and repeated **multiplication and addition** to find the answer. Because this whole process is written in math, knowing math lets you read the **inner workings** of AI clearly.","1":"**What math tools will we use?** — We will learn **functions** (rules that map input to output), **vectors and matrices** (bundling lots of data for batch computation), **differentiation** (so the model can learn and move toward the right answer), and **probability and distributions** (to measure how likely an outcome is). These tools together build the intelligence of AI.","2":"**In short** — AI runs on a solid foundation of numbers and functions. To interpret why AI produced a given result and to build better models, you need basic strength in **functions**, **limits**, **differentiation**, and **probability**. This course is the journey of building that foundation step by step."},"whyImportant":{"0":"**To understand why AI decides as it does** — Every decision AI makes is ultimately the result of **numbers and functions**. We learn functions and differentiation so we can follow the computation and logically understand **why that answer was produced**.","1":"**Where math works in the AI model** — Each **layer** of the model is a set of **functions** that multiply by weights and add. The process of the model learning and reducing error uses the concept of **gradient** (differentiation). Probability becomes the measure of how confident the AI is in its prediction.","2":"**The roadmap we will follow (Ch01–Ch12)** — This course proceeds in order: **Functions (Ch01–03)** (flow of data), **Limits and continuity (Ch04–05)** (foundations of change), **Differentiation (Ch06–08)** (heart of learning), **Integral (Ch09)** (accumulation and basis of probability), and **Probability and distributions (Ch10–12)** (uncertainty)."},"howUsed":{"0":"**The link between reality and math** — An AI model has the structure **input → turn into numbers → repeat functions → output**. **Functions** are the building blocks, **differentiation** is the chisel that shapes them to get smarter, and **probability** is the tool that checks the stability of the finished building. Once you master this basic math, the complex formulas of deep learning start to read like meaningful sentences."},"problemSolving":{"0":"| Category | Role in AI | Key math concepts |\n| --- | --- | --- |\n| **Input & output** | Basic framework for feeding data and getting answers | Functions, exponents, logarithms |\n| **Learning (Training)** | Process of reducing error to approach the correct answer | Limits, derivatives, chain rule |\n| **Prediction & decision** | Choosing the best among uncertain outcomes | Probability, statistics, normal distribution |"}},"mathFunctions":{"chapter":"Chapter 01","title":"Functions: The Basic Unit of AI That Connects Input and Output","description":"A function is a rule that assigns one output to each input. The way AI turns input into output is directly connected to this function concept.","sectionTitle":"What is a function?","visualIntro":"One input x gives exactly one output y. The diagram below shows the flow x → f → y.","visualCaption":"Example: x = 3 gives 7 for f(x) = 2x + 1","whatIs":{"0":"A **function** is a strict **mapping** between two sets. Every element of the **domain** (the set of inputs) must be connected to **exactly one** element of the **codomain** (the set of outputs). Just as a vending machine is broken if pressing a button gives no drink or two drinks at once, a function must have exactly one output for each input.","1":"We write **y = f(x)**. Here **x** is the **independent variable (cause)** and **y** is the **dependent variable (result)**. From an AI perspective, **x** is the **data** we provide (pixels, text, sensor values), and **y** is the **prediction** the AI computes. The function **f** acts as a **transformer** that turns this data into answers.","2":"An **AI model** itself is a huge **composite function**. Input data is transformed by the first function (layer), and that result is fed into the next function (layer); this repeats dozens of times. Just as we write $y = f(g(h(x)))$ in math, deep learning stacks many functions in layers to read complex patterns from data."},"whyImportant":{"0":"Because we can **model the real world**. A vague relation like \"more study leads to better grades\" can be expressed as a **linear function** $y = ax + b$, so we can compute expected grades ($y$) from study time ($x$). AI approximates far more complex nonlinear relations (e.g., images to object names) as functions to solve problems.","1":"Functions are the **object of optimization**. The goal of AI training is to minimize the error between the correct answer and the prediction. That error is computed by a **loss function**, and we use differentiation to find its minimum. Without functions, there would be no mathematical basis for training AI.","2":"They are the language of **change**. We need to know how much the output changes when the input changes a little (the slope) so that AI can move step by step toward the correct answer. Functions make the **cause–effect** relationship between input and output explicit in math, so we can analyze **why** the AI made a given decision."},"howUsed":{"0":"Every **neuron** in AI is a small **function**. It takes input signals ($x$), multiplies them by weights ($w$) and adds ($wx+b$), then passes the result through an **activation function** to the next neuron. Functions like **ReLU** and **Sigmoid** decide whether to pass the signal on; many such small functions together make complex decisions like the human brain.","1":"**Data transformation**: A photo is just a pile of numbers ($x$) to the computer. AI passes them through functions to shrink or expand dimensions and keep only key features ($y$) like \"ear shape\" or \"eye shape.\" That's mapping high-dimensional vectors to a lower-dimensional space.","2":"**Probability**: The **softmax** function at the last step of classification turns raw scores into \"probabilities that sum to 1.\" So the AI can say \"this image is 90% a dog.\" Functions turn raw data into information we can interpret."},"problemSolving":{"0":"| Function | Example (input → output) |\n| --- | --- |\n| $f(x)=x+1$ | 3 → 4, 10 → 11 |\n| $g(x)=2x$ | 3 → 6, 10 → 20 |\n| $h(x)=x^2$ | 3 → 9, $-2$ → 4 |","1":"In the visual below, **f(x) = 2x + 1** gives 7 for x = 3 and 21 for x = 10. Fill in the blank in the problem."}},"mathVideoExponential":{"chapter":"Chapter 02","title":"Exponents and Exponential Functions: The Math of Growth and Activation","description":"Exponentiation is repeated multiplication of the same base; an exponential function fixes the base and uses the exponent as the variable. Used in activation and loss design in deep learning.","sectionTitle":"What are exponent and exponential function?","visualIntro":"Fix a base $a$; for each exponent $x$ the value $a^x$ is determined. Below are examples for $2^x$.","visualCaption":"Example: $2^0=1$, $2^1=2$, $2^2=4$, $2^3=8$","whatIs":{"0":"An **exponent** is how many times a number (the **base**) is multiplied by itself. Like the fact that folding a piece of paper 42 times would reach the moon, repeated **multiplication** (not addition) makes values grow **explosively (exponential growth)**.","1":"An **exponential function** puts that repeated power in a variable: $y = a^x$. In polynomials the variable is in the base ($x^2$); in exponentials the variable is in the **exponent**. That means growth proportional to current size. If $a>1$, the value shoots up as $x$ increases (**exponential growth**); if $00$. AI cannot say \"the probability is -50%,\" so exponentials are essential when we need outputs to be **positive** (e.g. probabilities or positive scores).","1":"They **amplify small differences**. Inputs 1 and 2 differ by 1, but $10^1=10$ and $10^2=100$ differ by 90. AI uses this to **sharply separate** similar data and **classify** with confidence.","2":"**Efficient differentiation**: Backprop is a long chain of derivatives. The exponential $e^x$ keeps the same shape when differentiated (or stays in a simple form), which is crucial for fast, stable training."},"howUsed":{"0":"Used in the **softmax** function. When AI chooses one out of 1000 images, it applies $e^x$ to each score. Slightly higher scores get much larger values and lower ones shrink toward 0, so the model can say \"this is the answer with 99% confidence.\"","1":"The **sigmoid** function $y = \\frac{1}{1+e^{-x}}$ squeezes the input into (0, 1). The output never exceeds 1 or goes below 0, so the neuron acts like an on/off switch."},"problemSolving":{"0":"| Expression | Value |\n| --- | --- |\n| $2^0$ | 1 |\n| $2^1$ | 2 |\n| $2^2$ | 4 |\n| $2^3$ | 8 |\n| $2^4$ | 16 |\n| $3^2$ | 9 |\n| $3^3$ | 27 |","1":"In the visual below, $y = 2^x$ gives $1$ for $x=0$, $2$ for $x=1$, $4$ for $x=2$, $8$ for $x=3$. Use it to see how base and exponent relate.","2":"**Problem types and how to solve them**\n\n| Type | Description | How to get the answer |\n| --- | --- | --- |\n| **Find value** | $a^x = ?$ | Multiply base $a$ by itself $x$ times. E.g. $2^3 = 8$. |\n| **Find exponent** | $a^? = \\text{value}$ | \"How many times do we multiply $a$ to get this value?\" That count is the answer. E.g. $2^? = 8 \\Rightarrow 3$. |\n| **Compare** | Which is larger: 1) $a^{m}$, 2) $b^{n}$? | Compute each, then compare. If (1) is larger enter **1**, if (2) enter **2**. |\n| **Product, same base** | $a^p \\times a^q = a^?$ | **Add** exponents: $? = p + q$. (Rule: $a^p \\cdot a^q = a^{p+q}$) |\n| **Quotient, same base** | $a^p \\div a^q = a^?$ ($p \\ge q$) | **Subtract** exponents: $? = p - q$. (Rule: $a^p / a^q = a^{p-q}$) |\n| **Power of power** | $(a^p)^q = ?$ | **Multiply** exponents: $? = a^{p \\times q}$. (Rule: $(a^p)^q = a^{pq}$) |"}},"mathVideoLog":{"chapter":"Chapter 03","title":"Logarithm: From Multiplication to Addition, the Language of Loss Design","description":"A logarithm answers 'how many times we multiply the base to get this number?' It is the inverse of exponentiation and is used with exponentials in loss and probability in deep learning.","sectionTitle":"What is the logarithm?","visualIntro":"Logarithm is the inverse of exponent. $y = \\log_2 x$ means $2^y = x$. Below are the graphs of $y = \\log_2 x$ and its inverse $y = 2^x$.","visualCaption":"Example: $\\log_2 1 = 0$, $\\log_2 2 = 1$, $\\log_2 4 = 2$, $\\log_2 8 = 3$ (when $2^y = x$, $y$ is $\\log_2 x$)","visualLegend":"Purple: $y=\\log_2 x$, Teal: $y=2^x$","whatIs":{"definition":"The **logarithm** is like \"running exponentiation backward.\" In $2^3 = 8$, when you see the result 8 and ask \"**how many times** did we multiply 2 to get 8?\", that count (3) is the logarithm: $\\log_2 8 = 3$. Here 2 is the **base** and 8 is the **argument**.","example":"Think of it as **counting digits**. $100 = 10^2$ so $\\log_{10} 100 = 2$; $1000 = 10^3$ so $\\log_{10} 1000 = 3$. When the number grows 10×, the log value only goes up by 1. So log acts as a **filter** that turns explosively large numbers into much gentler ones. **Basic properties**: $\\log_a 1 = 0$ (base to the 0th power is 1), $\\log_a a = 1$ (base to the 1st power is itself).","logSumProduct":"**The magic of log** is that it turns multiplication into addition: $\\log_a(b \\times c) = \\log_a b + \\log_a c$. For computers, multiplication is costlier than addition and can overflow or underflow; taking the log turns that multiplication into a safer, simpler addition.","whyInAI":"The **argument condition ($x>0$)** matters: log of 0 or a negative number is undefined. So in AI code we often add a tiny constant ($\\epsilon$, epsilon) to avoid $\\log(0)$ errors. The **natural log** ($\\ln$, base $e$) keeps differentiation tidy and is the standard in deep learning."},"whyImportant":{"0":"**Avoiding underflow** is essential. If AI multiplies probability $0.1$ a hundred times ($0.1^{100}$), the computer may treat it as zero. Taking the log gives $\\log(0.1^{100}) = 100 \\times \\log(0.1) = -100$—a **meaningful number** the computer can still handle.","1":"It is the **ruler for information (entropy)**. The rarer an event, the larger (in absolute value) its log. A rare event (e.g. \"sun rises in the west\") carries high information; an obvious one (\"morning comes\") carries almost none. AI uses this log-based measure to see **how much surprising information** was learned.","2":"**It penalizes mistakes harshly**. For $y=\\ln x$ with $00$, $\\cos\\theta<0$\n2) So $\\tan\\theta=\\frac{\\sin\\theta}{\\cos\\theta}<0$\n\nSo the **answer is negative**.\n\n---\n\n**Example (Period calculation)**\n\nFind the period (degrees) of $y=\\sin(8x)$.\n\n**Solution**\n\n1) Use period formula $\\frac{360}{k}$\n2) With $k=8$, $\\frac{360}{8}=45$\n\nSo the **answer is 45**.\n\n---\n\n**Example (ML application, without direct $\\pi$ arithmetic)**\n\nFor $hour=6$, when a day (24h) is mapped to 360°, what is the angle and $\\sin\\theta$?\n\n**Solution**\n\n1) 24h = 360°, so 1h = 15°\n2) 6h = $6\\times15=90^\\circ$\n3) $\\sin90^\\circ=1$\n\nSo the **answer is 1**.\n\n(Equivalent formula form: $\\theta=2\\pi\\cdot\\frac{6}{24}=\\frac{\\pi}{2}$.)"},"summary":"**One-line summary:** Trigonometric functions are not just calculators from angles to ratios. They are a unified language for circular motion and waves, linking geometric intuition to practical AI applications like cyclic feature encoding and positional encoding.","problemSolvingLabel":"Explanation for solving the problems","practiceProblemsTitle":"Practice problems","practiceProblemsIntro":"From a bank of 60 problems, 10 are selected per session. Selection prefers non-overlapping problem types, and difficulty is ordered as easy → medium → hard.","problemPromptQuadrantSign":"Find the sign of {func} in Quadrant {quadrant}. (positive=1, negative=-1)","problemPromptPeriodDeg":"period (in degrees)?","problemPromptIntSum":"Integer-sum problem: {a} + {b} = ?","problemPromptUnitCircleCoord":"On the unit circle, at θ={deg}°, find the value of {axis}.","problemPromptCoterminalAngle":"Choose the coterminal angle in 0?~360? for {deg}?.","problemPromptQuadrantFromAngle":"Which quadrant contains ?={deg}?? (1~4)","cosineVisualTitle":"Cosine Similarity Vector Visual","cosineVisualHint":"The closer the vector directions, the closer cosine is to 1.","cosineVisualNow":"Current cosine similarity:","cosineVisualHigh":"High similarity","cosineVisualMedium":"Medium similarity","cosineVisualLow":"Low similarity"}}},"now":"$undefined","timeZone":"UTC","children":["$L51","$L52","$L53"]}]

Scope

Keywords

이 카테고리의 세부 논문 리뷰

Scope

Keywords

이 카테고리의 세부 논문 리뷰