Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.11

Cross Validation: Practice Tests and the Real Exam

Cross validation is essential so that models do not become "frogs in a well"—only good at the exercises they memorized. Just as students use practice tests to check their real level and the final exam to confirm it, we do not score machine learning models only on training data; we evaluate them on validation and test data they have not seen. This chapter covers cross validation (Hold-out, K-Fold, etc.) and how to make performance estimates reliable.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Split data into train/validation/test; in K-Fold, take turns validating and estimate performance by the mean score.

① 5-FoldValidationFold1TrainFold2TrainFold3TrainFold4TrainFold5Validation scoreS1S2S3S4S5Mean μ

Cross validation: practice tests (validation) to estimate skill, final exam (test) to confirm.

Cross Validation: Practice Tests and the Real Exam

What is cross validation? "Don’t score with the same problems they practiced" — If a math exam contained only problems from the workbook, we could not tell whether students understood the ideas or had overfit by memorizing answers. The same holds for ML: testing on training data always looks good. So we split data into train, validation, and test, and evaluate the model strictly and fairly on data it has never seen. That process is cross validation.
Three roles when splitting data — The ideal split and role of each part are as follows.
- Training (Train) — Metaphor: textbook / practice set. Main data used to learn patterns and update weights. Typical ratio: ~70–80%.
- Validation — Metaphor: practice exam. Used mid-learning to check performance and tune hyperparameters. Typical ratio: ~10–15%.
- Test — Metaphor: final exam. Used only once after all learning to report final performance. Typical ratio: ~10–15%.
How to split? Hold-out and K-Fold — There are two main approaches. Hold-out is like cutting a pizza once: you split the data once into train and test. It is simple and fast, but if by chance the "easy" part ends up in the test set, the estimate can be overly optimistic. K-Fold cross validation divides data into K segments and uses each in turn as the "practice exam" (validation) and the rest for training, so every sample is validated once and the estimate is more stable and objective.
K-Fold final score in a formula — After K-Fold you have K "exam" scores. The model’s final performance is the average of these K scores.
* Mean score formula: Sˉ=1K∑k=1KSk\bar{S} = \frac{1}{K}\sum_{k=1}^K S_kSˉ=K1​∑k=1K​Sk​
* Symbols: KKK = number of folds (number of validation runs), SkS_kSk​ = score when the kkk-th fold was used for validation (e.g. accuracy or MSE). ∑k=1KSk\sum_{k=1}^K S_k∑k=1K​Sk​ means S1+S2+⋯+SKS_1 + S_2 + \cdots + S_KS1​+S2​+⋯+SK​, so Sˉ\bar{S}Sˉ is the mean of the K validation scores and is used as the final performance estimate.
* Numeric example: With 5-Fold, if the five scores are 80, 85, 90, 80, 85, then Sˉ=(80+85+90+80+85)/5=84\bar{S} = (80+85+90+80+85)/5 = 84Sˉ=(80+85+90+80+85)/5=84.

Why it matters

Escaping the "frog in a well" (detecting overfitting) — If the model scores 99 on training data but 50 on unseen validation data, it is almost certainly overfitting (memorizing rather than understanding). Cross validation acts as a filter to catch such models before they fail in production.
Proving real-world performance (generalization) — Companies adopt AI to predict the future, not to replay the past. Models validated with K-Fold and a held-out test set are more likely to perform well on truly new data.
Finding the best setup (hyperparameters and model choice) — When choosing tree depth, K in K-NN, learning rate, etc., we run multiple settings on the validation set and pick the best. Because the test set is kept separate, we can compare models fairly.

How it is used

Data scientist routine (production pipeline) — In practice, the first step is to set aside about 10% of the data as the test set and lock it away. The rest is used for training and K-Fold validation until the best model is ready; then the test set is used once to report: "Our model’s final accuracy is 92%."
Fair algorithm comparison — When asking "Is logistic regression or random forest better for our churn prediction?", the same K-Fold setup is applied to both; the algorithm with the higher mean validation score (Sˉ\bar{S}Sˉ) is chosen for deployment.