Ch.11

Regularization: Beyond Rote Memorization

It is the key technique that keeps ML models from becoming'rote memorizers' that only memorize answers from the workbook. Fitting the training data too tightly means the model flounders when faced with slightly different new problems—this is overfitting . Regularization reduces the model's data error while imposing a penalty (cost) so the model does not become overly complex or forced. In this way, the model prunes the twigs and learns only the essential patterns, becoming strong in real-world generalization .

Select a chapter to see its diagram below. View the machine learning flow at a glance.

We add a penalty for the model becoming too complex, not just for data error, so the model generalizes instead of memorizing.

① No regularization — minimizing only training loss leads to overfitting

② Add regularization — Loss = data loss + λ \times penalty; larger λ shrinks weights

\sum w_j^2

\sum |w_j|

⑤ Generalization — a suitable λ gives good performance on both train and validation

Regularization: loss + λ·penalty to reduce overfitting and improve generalization.

What is regularization? A 'penalty' for complexity When a model tries to fit every small noise or exception in the training data, its formula becomes wiggly and needlessly complex. Regularization computes the model's total loss not only by "how wrong the predictions are" but also by "how complex the model is (size of weights)" and adds a penalty. To avoid that penalty, the model naturally stays simpler and cleaner.

Intuitive analogy: crammer vs principle-seeking student A crammer who memorizes the workbook digit by digit gets 100 on practice tests but fails the real exam (new data). A student who understands principles may get a few practice problems wrong but scores steadily on the real exam. Regularization acts like a teacher, forcing the model to "prune the twigs (excessive weights) and focus on the main stem (core pattern)" so it becomes robust in practice.

J = \text{MSE} + \lambda \sum_{j} w_j^2

Real-world examples: spam filtering and medical diagnosis In spam filtering, giving high weight to a common word that happened to appear in training spam (e.g. "hello") can wrongly filter normal mail. Regularization prevents the model from obsessing over a single word (exploding weights). In medical diagnosis, it helps the AI avoid latching onto meaningless details like "gown color" among many patient features.

J = \text{MSE} + \lambda \sum_{j} w_j^2

Because real-world (generalization) performance is the true goal The real value of ML shows not during practice but when the model meets unseen (test) data . With regularization, accuracy on the training set may drop a bit, but accuracy in the wild goes up. This ability to handle unknown data well is called generalization .

\lambda

\lambda

Adding wings to basic models (Ridge & Lasso) We simply add the L1 or L2 penalty to the usual linear regression or logistic regression formula. - Linear regression + L2 = Ridge regression - Linear regression + L1 = Lasso regression The computer then minimizes the total loss (including the penalty) via gradient descent and adjusts the weights automatically.

\lambda

Regularization reduces overfitting by adding a penalty to the loss. Total loss = data loss + λ\timespenalty. Larger λ shrinks weights (simpler model). L2 uses the sum of squared weights; L1 uses the sum of absolute values and can yield sparse weights. In practice, Ridge(L2) and Lasso(L1) are applied to linear and logistic regression, and λ is chosen by cross-validation .