Ch.09

Cross Validation: Practice Tests and the Real Exam

Cross validation is essential so that models do not become "frogs in a well"—only good at the exercises they memorized. Just as students use practice tests to check their real level and the final exam to confirm it, we do not score machine learning models only on training data; we evaluate them on validation and test data they have not seen. This chapter covers cross validation (Hold-out, K-Fold, etc.) and how to make performance estimates reliable.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Split data into train/validation/test; in K-Fold, take turns validating and estimate performance by the mean score.

① 5-FoldValidationFold1TrainFold2TrainFold3TrainFold4TrainFold5Validation scoreS1S2S3S4S5Mean μ

Cross validation: practice tests (validation) to estimate skill, final exam (test) to confirm.

Cross Validation: Practice Tests and the Real Exam

What is cross validation? "Don’t score with the same problems they practiced" — If a math exam contained only problems from the workbook, we could not tell whether students understood the ideas or had overfit by memorizing answers. The same holds for ML: testing on training data always looks good. So we split data into train, validation, and test, and evaluate the model strictly and fairly on data it has never seen. That process is cross validation.
Three roles when splitting data — The ideal split and role of each part are as follows.
  • Data typeTraining (Train)
  • MetaphorTextbook / practice set
  • Role and useMain data used to learn patterns and update weights.
  • Typical ratio~70–80%
  • Data typeValidation
  • MetaphorPractice exam
  • Role and useUsed mid-learning to check performance and tune hyperparameters.
  • Typical ratio~10–15%
  • Data typeTest
  • MetaphorFinal exam
  • Role and useUsed only once after all learning to report final performance.
  • Typical ratio~10–15%
How to split? Hold-out and K-Fold — There are two main approaches. Hold-out is like cutting a pizza once: you split the data once into train and test. It is simple and fast, but if by chance the "easy" part ends up in the test set, the estimate can be overly optimistic. K-Fold cross validation divides data into K segments and uses each in turn as the "practice exam" (validation) and the rest for training, so every sample is validated once and the estimate is more stable and objective.
K-Fold final score in a formula — After K-Fold you have K "exam" scores. The model’s final performance is the average of these K scores.
* Mean score formula: Sˉ=1Kk=1KSk\bar{S} = \frac{1}{K}\sum_{k=1}^K S_k
* Symbols: KK = number of folds (number of validation runs), SkS_k = score when the kk-th fold was used for validation (e.g. accuracy or MSE). k=1KSk\sum_{k=1}^K S_k means S1+S2++SKS_1 + S_2 + \cdots + S_K, so Sˉ\bar{S} is the mean of the K validation scores and is used as the final performance estimate.
* Numeric example: With 5-Fold, if the five scores are 80, 85, 90, 80, 85, then Sˉ=(80+85+90+80+85)/5=84\bar{S} = (80+85+90+80+85)/5 = 84.
Escaping the "frog in a well" (detecting overfitting) — If the model scores 99 on training data but 50 on unseen validation data, it is almost certainly overfitting (memorizing rather than understanding). Cross validation acts as a filter to catch such models before they fail in production.
Proving real-world performance (generalization) — Companies adopt AI to predict the future, not to replay the past. Models validated with K-Fold and a held-out test set are more likely to perform well on truly new data.
Finding the best setup (hyperparameters and model choice) — When choosing tree depth, K in K-NN, learning rate, etc., we run multiple settings on the validation set and pick the best. Because the test set is kept separate, we can compare models fairly.
Data scientist routine (production pipeline) — In practice, the first step is to set aside about 10% of the data as the test set and lock it away. The rest is used for training and K-Fold validation until the best model is ready; then the test set is used once to report: "Our model’s final accuracy is 92%."
Fair algorithm comparison — When asking "Is logistic regression or random forest better for our churn prediction?", the same K-Fold setup is applied to both; the algorithm with the higher mean validation score (Sˉ\bar{S}) is chosen for deployment.
Summary — Cross validation starts from the premise that we must not measure performance only on the data used for training. Just as students take practice tests before the real exam, in machine learning we cannot tell if the model has "memorized the exercises" if we score only on training data. So we split data into train, validation, and test. The training set is used for the model to learn patterns; the validation set is used to check performance during learning or to choose hyperparameters; the test set is used only once after all learning to report final performance before deployment. The main split strategies are Hold-out and K-Fold. Hold-out splits the data once into train and test (or validation). K-Fold divides data into K segments, uses one segment at a time for validation and the rest for training. With K-Fold every sample is used for validation once, so the performance estimate is more stable than with a single split.