Ch.03
Loss Function (MSE): Measuring Prediction Error
When finding the 'best-fitting line' in linear regression, we need a single number that says how far predictions are from the truth. The Sum of Squared Errors (SSE) is the sum of over all points. Dividing SSE by the number of data points gives the Mean Squared Error (MSE). The closer MSE is to zero, the better the model fits the data—and gradient descent minimizes this MSE.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
MSE is the average of squared errors between prediction and actual .
MSE — the smaller the loss, the better the line fits the data.
Loss Function (MSE): Measuring Prediction Error
The ruler for error — We need a loss function that summarizes how wrong the model is. At each point, the difference between actual and prediction is the residual (or error). Squaring each residual and adding them up gives the Sum of Squared Errors (SSE). Dividing SSE by the number of points gives the Mean Squared Error (MSE): . The smaller this value, the better the model fits.
Why square? — A residual of or both mean 'off by 2'. If we summed raw residuals, and they would cancel. Squaring keeps everything positive and penalizes large errors more.
Link to linear regression — The line from Ch03 'fits the data best' when MSE (or equivalently SSE) is minimized. Gradient descent updates the slope and intercept in the direction that reduces MSE.
It defines the learning goal — Machine learning is often summarized as 'minimize the loss'. For regression, when that loss is MSE, the model moves only in directions that lower MSE, so the objective is clear.
Differentiation is easy — The square function has a simple derivative, so gradient descent with MSE is tractable. Deep learning also uses squared-error-style losses widely.
RMSE: back to the original units — Because MSE averages squared errors, its unit is ' squared' (e.g. dollars² for price prediction). In practice we often want to say “on average we’re off by so many dollars or degrees.” Taking the square root of MSE gives RMSE (Root Mean Squared Error): , which has the same units as . Once you understand MSE, RMSE follows naturally.
Training regression models — Linear regression, neural network regression, etc. compute MSE on the training data and update parameters to reduce it.
Comparing models — To compare which line (or model) fits the data better, compute MSE for each; the smaller value wins.
Validation and test — After training, computing MSE on unseen data (validation/test set) gives an objective measure of generalization.
Summary: Loss function (MSE)
① Flow of concepts — The difference between actual and predicted is the residual (error) . Squaring each residual and summing over all points gives the Sum of Squared Errors (SSE) ; dividing SSE by the number of points gives the Mean Squared Error (MSE) . To express error in the same units as , we use RMSE .
② Why square? — A residual of or both mean "off by 3." Summing raw residuals can cancel out; squaring keeps everything positive and penalizes larger errors more, so the model is encouraged to avoid big mistakes.
③ Role in learning — MSE is the compass: "move in the direction that reduces this value." Gradient descent updates and to minimize MSE. The square function is smooth and easy to differentiate, which makes finding the minimum tractable.
④ Where it’s used — Regression (price, temperature, stock prediction, etc.), model comparison (smaller MSE is better), and as the output-layer loss in deep learning. For solution steps and worked examples, see the Explanation for problem solving block below.