Ch.05
Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error
A loss function turns how wrong the model is into one number. For regression, we often use mean squared error (MSE) from the gap between and , and we also look at (coefficient of determination) to understand how much variation the model explains. For classification, we measure how far predicted class probabilities are from the truth with cross-entropy. The diagram below shows MSE as a regression example of how loss decreases.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
Regression loss example: MSE is the average of squared errors between prediction and actual . (For classification we use cross-entropy.)
MSE — the smaller the loss, the better the line fits the data.
Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error
Regression: MSE
We need a loss that summarizes error in one number.
- Residual — difference between actual and prediction .
- SSE — sum of over all points (sum of squared errors).
- MSE — SSE divided by the number of points (mean squared error).
. Smaller MSE means a better fit.
Why square?
- Residuals and both mean "off by 2"; raw sums can cancel.
- Squaring keeps values positive and compares magnitude only.
- Large errors get a bigger penalty, so the model avoids large mistakes.
Linear regression
The line from Ch.03 is "best" when MSE (or SSE) is minimized—we choose and that minimize average squared error.
Gradient descent updates and step by step in the direction that lowers MSE.
Regression: MSE is the average of squared residuals
MSE is an error score made by squaring residuals and taking the average. As predictions get closer to the true values, residuals shrink and MSE becomes smaller.
Unpacking MSE
- — sample index.
- — actual value at that point.
- — predicted value.
- — residual.
- — squared error at that point.
- — sum over points = SSE.
- — average = MSE.
Closer predictions → smaller residuals and smaller MSE.
Classification: cross-entropy
Cross-entropy measures how wrong the predicted probability of the true class is.
Binary classification is interpreted in detail below as Unpacking binary cross-entropy.
Unpacking binary cross-entropy
- — label.
- — predicted probability of class 1 (between 0 and 1).
- — usually natural log.
When — so . Higher means lower loss.
When — so . is the probability of class 0.
The two branches and mean that only one branch is active per sample, pushing probability toward the true class.
Multi-class — for true class , the per-sample loss is usually
(typically paired with softmax probabilities). When the predicted probability of the true class is low, loss becomes large, and training pushes that probability upward.
(Coefficient of Determination): improvement over "predicting the mean"
In regression, MSE/RMSE measure error magnitude, but if you want to go one step further and ask how much better the model explains the variation compared to the baseline (mean prediction), look at as well.
First, the symbols
- : the true value of the -th data point.
- : the predicted value of the -th data point.
- : the mean of all .
- SSE = : model's squared error sum (smaller is better)
- SST = : baseline squared error sum (using only the mean)
Quick calculation steps
1. Compute .
2. Compute the baseline error .
3. Compute the model error .
4. .
Interpretation guide
- : SSE=0 → almost perfect predictions
- : SSE=SST → about the same as predicting the mean
- : SSE>SST → worse than the mean baseline
So tells you the fraction of reduction in squared error compared to the baseline.
A short numeric example
Let the true values be so .
- Baseline (mean only):
- Model predictions :
Therefore
→ The model reduced squared error by about 75% compared to predicting the mean.
Important: don't rely on alone
- Since is a ratio, values may not be directly comparable across different datasets.
- In practice, report RMSE + together (error size + explanatory power).
Why it matters
Learning direction — In regression with MSE loss, the model updates in directions that reduce MSE—a clear objective.
MSE: smooth and easy to optimize — Squared error is smooth and easy to differentiate, so gradient descent works well.
RMSE — MSE uses squared units; (RMSE) restores the same units as for interpretation.
Match loss to task — Continuous targets fit MSE; class probabilities fit cross-entropy, which aligns with maximum likelihood. Ch.05 logistic regression connects sigmoid outputs to this loss.
How it is used
Regression training — Train with MSE for prices, temperatures, etc.
Model comparison (regression) — Smaller MSE means better fit.
Deep learning regression — Neural nets predicting numbers often use MSE at the output.
Classification — Logistic regression, softmax classifiers, and neural classifiers typically minimize cross-entropy.