What is machine learning?

Machine learning learns patterns from data to make predictions. Start with https://mdooai.com/en/learn/ml/mlSupervisedUnsupervisedSelf.

What is the difference between ML and DL?

Deep learning is a subset of machine learning focused on neural networks. Build foundations at https://mdooai.com/en/learn/ml/mlDataFeature first.

How do I start hyperparameter tuning?

Use cross-validation while narrowing search ranges. Start at https://mdooai.com/en/learn/ml/mlCrossValidation.

Ch.05

Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error

\hat y

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

\hat y

= \frac{1}{n}\sum_i (y_i - \hat y_i)^2

Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error

y

Why it matters

\sqrt{\text{MSE}}

How it is used

Regression training — Train with MSE for prices, temperatures, etc. Model comparison (regression) — Smaller MSE means better fit. Deep learning regression — Neural nets predicting numbers often use MSE at the output. Classification — Logistic regression, softmax classifiers, and neural classifiers typically minimize cross-entropy .

Loss Function (MSE · Cross-Entropy · R²): Measuring Prediction Error

Regression: MSE

We need a loss that summarizes error in one number.

- Residual — difference between actual

y

and prediction

\hat y

- SSE — sum of

(y_i - \hat y_i)^2

over all points (sum of squared errors).

- MSE — SSE divided by the number of points

n

(mean squared error).

\text{MSE} = \frac{1}{n}\sum_i (y_i - \hat y_i)^2 = \text{SSE}/n

. Smaller MSE means a better fit.

Why square?

- Residuals

+2

and

-2

both mean "off by 2"; raw sums can cancel.

- Squaring keeps values positive and compares magnitude only.

- Large errors get a bigger penalty, so the model avoids large mistakes.

Linear regression

The line

\hat y = w x + b

from Ch.03 is "best" when MSE (or SSE) is minimized—we choose

w

and

b

that minimize average squared error.

Gradient descent updates

w

and

b

step by step in the direction that lowers MSE.

Regression: MSE is the average of squared residuals

MSE is an error score made by squaring residuals

y_i-\hat y_i

and taking the average. As predictions get closer to the true values, residuals shrink and MSE becomes smaller.

Unpacking MSE

\text{MSE} = \frac{1}{n}\sum_i (y_i - \hat y_i)^2

- $i$ — sample index.

- $y_i$ — actual value at that point.

- $\hat y_i$ — predicted value.

- $y_i - \hat y_i$ — residual.

- $(y_i - \hat y_i)^2$ — squared error at that point.

- $\sum_i$ — sum over points = SSE.

- $\frac{1}{n}$ — average = MSE.

Closer predictions → smaller residuals and smaller MSE.

Classification: cross-entropy

Cross-entropy measures how wrong the predicted probability of the true class is.

Binary classification is interpreted in detail below as Unpacking binary cross-entropy.

Unpacking binary cross-entropy

\ell = -\big(y\log\hat p + (1-y)\log(1-\hat p)\big)

- $y \in \lbrace 0,1 \rbrace$ — label.

- $\hat p$ — predicted probability of class 1 (between 0 and 1).

- $\log$ — usually natural log.

When $y=1$ —

(1-y)\log(1-\hat p)=0

\ell = -\log\hat p

. Higher

\hat p

means lower loss.

When $y=0$ —

y\log\hat p=0

\ell = -\log(1-\hat p)

1-\hat p

is the probability of class 0.

The two branches

y\log\hat p

and

(1-y)\log(1-\hat p)

mean that only one branch is active per sample, pushing probability toward the true class.

Multi-class — for true class

k

, the per-sample loss is usually

\ell = -\log \hat p_k

(typically paired with softmax probabilities). When the predicted probability of the true class is low, loss becomes large, and training pushes that probability upward.

$R^2$ (Coefficient of Determination): improvement over "predicting the mean"

In regression, MSE/RMSE measure error magnitude, but if you want to go one step further and ask how much better the model explains the variation compared to the baseline (mean prediction), look at $R^2$ as well.

R^2 = 1 - \frac{\sum_i (y_i-\hat y_i)^2}{\sum_i (y_i-\bar y)^2} = 1 - \frac{\text{SSE}}{\text{SST}}

First, the symbols

y_i

: the true value of the

i

-th data point.

\hat y_i

: the predicted value of the

i

-th data point.

\bar y

: the mean of all

y_i

- SSE =

\sum_i (y_i-\hat y_i)^2

: model's squared error sum (smaller is better)

- SST =

\sum_i (y_i-\bar y)^2

: baseline squared error sum (using only the mean)

Quick calculation steps

1. Compute

\bar y

2. Compute the baseline error

\text{SST} = \sum_i (y_i-\bar y)^2

3. Compute the model error

\text{SSE} = \sum_i (y_i-\hat y_i)^2

R^2 = 1 - \text{SSE}/\text{SST}

Interpretation guide

R^2 = 1

: SSE=0 → almost perfect predictions

R^2 = 0

: SSE=SST → about the same as predicting the mean

R^2 < 0

: SSE>SST → worse than the mean baseline

R^2

tells you the fraction of reduction in squared error compared to the baseline.

A short numeric example

Let the true values be

y=[3,5,7]

\bar y=5

- Baseline (mean only):

\text{SST}=(3-5)^2+(5-5)^2+(7-5)^2=4+0+4=8

- Model predictions

\hat y=[4,5,6]

\text{SSE}=(3-4)^2+(5-5)^2+(7-6)^2=1+0+1=2

Therefore

R^2 = 1 - 2/8 = 0.75

→ The model reduced squared error by about 75% compared to predicting the mean.

Important: don't rely on $R^2$ alone

- Since

R^2

is a ratio, values may not be directly comparable across different datasets.

- In practice, report RMSE + $R^2$ together (error size + explanatory power).