Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.05

Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error

A loss function turns how wrong the model is into one number. For regression, we often use mean squared error (MSE) from the gap between y^\hat yy^​ and yyy, and we also look at R2R^2R2 (coefficient of determination) to understand how much variation the model explains. For classification, we measure how far predicted class probabilities are from the truth with cross-entropy. The diagram below shows MSE as a regression example of how loss decreases.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Regression loss example: MSE is the average of squared errors between prediction y^\hat yy^​ and actual yyy. (For classification we use cross-entropy.)

xySquared error = area (side length = |residual|)0.85SSE = 4.25   →   MSE = SSE ÷ 5MSE = 0.85

MSE =1n∑i(yi−y^i)2= \frac{1}{n}\sum_i (y_i - \hat y_i)^2=n1​∑i​(yi​−y^​i​)2 — the smaller the loss, the better the line fits the data.

Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error

Regression: MSE
We need a loss that summarizes error in one number.
- Residual — difference between actual yyy and prediction y^\hat yy^​.
- SSE — sum of (yi−y^i)2(y_i - \hat y_i)^2(yi​−y^​i​)2 over all points (sum of squared errors).
- MSE — SSE divided by the number of points nnn (mean squared error).
MSE=1n∑i(yi−y^i)2=SSE/n\text{MSE} = \frac{1}{n}\sum_i (y_i - \hat y_i)^2 = \text{SSE}/nMSE=n1​∑i​(yi​−y^​i​)2=SSE/n. Smaller MSE means a better fit.
Why square?
- Residuals +2+2+2 and −2-2−2 both mean "off by 2"; raw sums can cancel.
- Squaring keeps values positive and compares magnitude only.
- Large errors get a bigger penalty, so the model avoids large mistakes.
Linear regression
The line y^=wx+b\hat y = w x + by^​=wx+b from Ch.03 is "best" when MSE (or SSE) is minimized—we choose www and bbb that minimize average squared error.
Gradient descent updates www and bbb step by step in the direction that lowers MSE.
Regression: MSE is the average of squared residuals
MSE is an error score made by squaring residuals yi−y^iy_i-\hat y_iyi​−y^​i​ and taking the average. As predictions get closer to the true values, residuals shrink and MSE becomes smaller.
Unpacking MSE
MSE=1n∑i(yi−y^i)2\text{MSE} = \frac{1}{n}\sum_i (y_i - \hat y_i)^2MSE=n1​∑i​(yi​−y^​i​)2
- iii — sample index.
- yiy_iyi​ — actual value at that point.
- y^i\hat y_iy^​i​ — predicted value.
- yi−y^iy_i - \hat y_iyi​−y^​i​ — residual.
- (yi−y^i)2(y_i - \hat y_i)^2(yi​−y^​i​)2 — squared error at that point.
- ∑i\sum_i∑i​ — sum over points = SSE.
- 1n\frac{1}{n}n1​ — average = MSE.
Closer predictions → smaller residuals and smaller MSE.

Classification: cross-entropy
Cross-entropy measures how wrong the predicted probability of the true class is.
Binary classification is interpreted in detail below as Unpacking binary cross-entropy.
Unpacking binary cross-entropy
ℓ=−(ylog⁡p^+(1−y)log⁡(1−p^))\ell = -\big(y\log\hat p + (1-y)\log(1-\hat p)\big)ℓ=−(ylogp^​+(1−y)log(1−p^​))
- y∈{0,1}y \in \lbrace 0,1 \rbracey∈{0,1} — label.
- p^\hat pp^​ — predicted probability of class 1 (between 0 and 1).
- log⁡\loglog — usually natural log.
When y=1y=1y=1 — (1−y)log⁡(1−p^)=0(1-y)\log(1-\hat p)=0(1−y)log(1−p^​)=0 so ℓ=−log⁡p^\ell = -\log\hat pℓ=−logp^​. Higher p^\hat pp^​ means lower loss.
When y=0y=0y=0 — ylog⁡p^=0y\log\hat p=0ylogp^​=0 so ℓ=−log⁡(1−p^)\ell = -\log(1-\hat p)ℓ=−log(1−p^​). 1−p^1-\hat p1−p^​ is the probability of class 0.
The two branches ylog⁡p^y\log\hat pylogp^​ and (1−y)log⁡(1−p^)(1-y)\log(1-\hat p)(1−y)log(1−p^​) mean that only one branch is active per sample, pushing probability toward the true class.
Multi-class — for true class kkk, the per-sample loss is usually
ℓ=−log⁡p^k\ell = -\log \hat p_kℓ=−logp^​k​
(typically paired with softmax probabilities). When the predicted probability of the true class is low, loss becomes large, and training pushes that probability upward.

R2R^2R2 (Coefficient of Determination): improvement over "predicting the mean"
In regression, MSE/RMSE measure error magnitude, but if you want to go one step further and ask how much better the model explains the variation compared to the baseline (mean prediction), look at R2R^2R2 as well.
R2=1−∑i(yi−y^i)2∑i(yi−yˉ)2=1−SSESSTR^2 = 1 - \frac{\sum_i (y_i-\hat y_i)^2}{\sum_i (y_i-\bar y)^2} = 1 - \frac{\text{SSE}}{\text{SST}}R2=1−∑i​(yi​−yˉ​)2∑i​(yi​−y^​i​)2​=1−SSTSSE​
First, the symbols
- yiy_iyi​: the true value of the iii-th data point.
- y^i\hat y_iy^​i​: the predicted value of the iii-th data point.
- yˉ\bar yyˉ​: the mean of all yiy_iyi​.
- SSE = ∑i(yi−y^i)2\sum_i (y_i-\hat y_i)^2∑i​(yi​−y^​i​)2 : model's squared error sum (smaller is better)
- SST = ∑i(yi−yˉ)2\sum_i (y_i-\bar y)^2∑i​(yi​−yˉ​)2 : baseline squared error sum (using only the mean)
Quick calculation steps
1. Compute yˉ\bar yyˉ​.
2. Compute the baseline error SST=∑i(yi−yˉ)2\text{SST} = \sum_i (y_i-\bar y)^2SST=∑i​(yi​−yˉ​)2.
3. Compute the model error SSE=∑i(yi−y^i)2\text{SSE} = \sum_i (y_i-\hat y_i)^2SSE=∑i​(yi​−y^​i​)2.
4. R2=1−SSE/SSTR^2 = 1 - \text{SSE}/\text{SST}R2=1−SSE/SST.
Interpretation guide
- R2=1R^2 = 1R2=1: SSE=0 → almost perfect predictions
- R2=0R^2 = 0R2=0: SSE=SST → about the same as predicting the mean
- R2<0R^2 < 0R2<0: SSE>SST → worse than the mean baseline
So R2R^2R2 tells you the fraction of reduction in squared error compared to the baseline.
A short numeric example
Let the true values be y=[3,5,7]y=[3,5,7]y=[3,5,7] so yˉ=5\bar y=5yˉ​=5.
- Baseline (mean only): SST=(3−5)2+(5−5)2+(7−5)2=4+0+4=8\text{SST}=(3-5)^2+(5-5)^2+(7-5)^2=4+0+4=8SST=(3−5)2+(5−5)2+(7−5)2=4+0+4=8
- Model predictions y^=[4,5,6]\hat y=[4,5,6]y^​=[4,5,6]:
SSE=(3−4)2+(5−5)2+(7−6)2=1+0+1=2\text{SSE}=(3-4)^2+(5-5)^2+(7-6)^2=1+0+1=2SSE=(3−4)2+(5−5)2+(7−6)2=1+0+1=2
Therefore
R2=1−2/8=0.75R^2 = 1 - 2/8 = 0.75R2=1−2/8=0.75
→ The model reduced squared error by about 75% compared to predicting the mean.
Important: don't rely on R2R^2R2 alone
- Since R2R^2R2 is a ratio, values may not be directly comparable across different datasets.
- In practice, report RMSE + R2R^2R2 together (error size + explanatory power).

Why it matters

Learning direction — In regression with MSE loss, the model updates in directions that reduce MSE—a clear objective.
MSE: smooth and easy to optimize — Squared error is smooth and easy to differentiate, so gradient descent works well.
RMSE — MSE uses squared units; MSE\sqrt{\text{MSE}}MSE​ (RMSE) restores the same units as yyy for interpretation.
Match loss to task — Continuous targets fit MSE; class probabilities fit cross-entropy, which aligns with maximum likelihood. Ch.05 logistic regression connects sigmoid outputs p^\hat pp^​ to this loss.

How it is used

Regression training — Train with MSE for prices, temperatures, etc.
Model comparison (regression) — Smaller MSE means better fit.
Deep learning regression — Neural nets predicting numbers often use MSE at the output.
Classification — Logistic regression, softmax classifiers, and neural classifiers typically minimize cross-entropy.