Ch.11

Regularization: Beyond Rote Memorization

It is the key technique that keeps ML models from becoming 'rote memorizers' that only memorize answers from the workbook. Fitting the training data too tightly means the model flounders when faced with slightly different new problems—this is overfitting. Regularization reduces the model's data error while imposing a penalty (cost) so the model does not become overly complex or forced. In this way, the model prunes the twigs and learns only the essential patterns, becoming strong in real-world generalization.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

We add a penalty for the model becoming too complex, not just for data error, so the model generalizes instead of memorizing.

VSNo regularizationOverfittingWith regularizationGeneralization
① No regularization — minimizing only training loss leads to overfitting
② Add regularization — Loss = data loss + λ × penalty; larger λ shrinks weights
③ L2 — penalty wj2\sum w_j^2 keeps weights small
④ L1 — penalty wj\sum |w_j| drives some weights to zero (sparse)
⑤ Generalization — a suitable λ gives good performance on both train and validation

Regularization: loss + λ·penalty to reduce overfitting and improve generalization.

Regularization: Beyond Rote Memorization

What is regularization? A 'penalty' for complexity
When a model tries to fit every small noise or exception in the training data, its formula becomes wiggly and needlessly complex. Regularization computes the model's total loss not only by "how wrong the predictions are" but also by "how complex the model is (size of weights)" and adds a penalty. To avoid that penalty, the model naturally stays simpler and cleaner.
Intuitive analogy: crammer vs principle-seeking student
A crammer who memorizes the workbook digit by digit gets 100 on practice tests but fails the real exam (new data). A student who understands principles may get a few practice problems wrong but scores steadily on the real exam. Regularization acts like a teacher, forcing the model to "prune the twigs (excessive weights) and focus on the main stem (core pattern)" so it becomes robust in practice.
Math: two 'magic' formulas (L1 and L2)
Regularization is divided into two main types by how it penalizes the model.
- L2 (Ridge): Uses the square of weights as the penalty. The objective is J=MSE+λjwj2J = \text{MSE} + \lambda \sum_{j} w_j^2. It smoothly pushes all weights down so they do not grow too large.
- L1 (Lasso): Uses the absolute value of weights as the penalty. The objective is J=MSE+λjwjJ = \text{MSE} + \lambda \sum_{j} |w_j|. It can drive less important weights exactly to zero, leaving only the key features (sparsity).
Real-world examples: spam filtering and medical diagnosis
In spam filtering, giving high weight to a common word that happened to appear in training spam (e.g. "hello") can wrongly filter normal mail. Regularization prevents the model from obsessing over a single word (exploding weights). In medical diagnosis, it helps the AI avoid latching onto meaningless details like "gown color" among many patient features.
Reading the formulas: a beginner's dissection
- Total loss (L2 example): J=MSE+λjwj2J = \text{MSE} + \lambda \sum_{j} w_j^2
- JJ: The "final report card" we want to make as small as possible (minimize). The smaller, the better the model.
- MSE\text{MSE}: The "error score" showing how much predictions differ from the true answers.
- λ\lambda (lambda): The "strength of the penalty" we set by hand. Larger λ\lambda acts like a strict teacher and heavily penalizes complex models; smaller λ\lambda barely penalizes.
- jwj2\sum_{j} w_j^2 (L2 penalty): Sum of squares of all weights. If any weight grows, this sum grows and JJ increases, so the model tries to keep weights small.
- L1 penalty (λjwj\lambda \sum_{j} |w_j|)
- Where L2 uses squares, L1 uses absolute values (wj|w_j|). L1 is like a strict tidier: it mercilessly zeros out useless weights.
Because real-world (generalization) performance is the true goal
The real value of ML shows not during practice but when the model meets unseen (test) data. With regularization, accuracy on the training set may drop a bit, but accuracy in the wild goes up. This ability to handle unknown data well is called generalization.
The art of balance: bias–variance tradeoff
If the model is too simple, bias (underfitting) grows and it cannot solve the problem. If it is too complex, variance (overfitting) grows and it memorizes noise. The two are like a seesaw: when one goes down, the other goes up. Tuning the regularization strength λ\lambda is the process of finding the level (sweet spot) of that seesaw.
The human role: finding λ\lambda (the hyperparameter)
λ\lambda is not learned by the model; it is a dial (hyperparameter) we must set. Turn the dial too hard and the model becomes underpowered; too soft and it becomes a memorizer again. So we must try many λ\lambda values and choose the one that gives the best real-world performance.
Adding wings to basic models (Ridge & Lasso)
We simply add the L1 or L2 penalty to the usual linear regression or logistic regression formula.
- Linear regression + L2 = Ridge regression
- Linear regression + L1 = Lasso regression
The computer then minimizes the total loss (including the penalty) via gradient descent and adjusts the weights automatically.
A 3-step pipeline in practice
In practice, regularization is applied as follows.
1. Split the data: Divide data into [train / validation / test].
2. Run a λ\lambda audition: Try λ\lambda values such as 0.01, 0.1, 1, 10, and train multiple models on the training set.
3. Pick the winner and deploy: Test on the validation set and choose the λ\lambda with the best score as the final model. Then evaluate once on the test set for the final performance.
Regularization reduces overfitting by adding a penalty to the loss.
Total loss = data loss + λ×penalty. Larger λ shrinks weights (simpler model). L2 uses the sum of squared weights; L1 uses the sum of absolute values and can yield sparse weights. In practice, Ridge(L2) and Lasso(L1) are applied to linear and logistic regression, and λ is chosen by cross-validation.