Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.04

Linear Regression: A Line Through the Data

When data points are scattered, linear regression finds the line that best fits their trend and predicts values for new inputs. It is the first regression model where you can see how functions, derivatives, and partial derivatives from Basic Math lead directly to machine learning 'training'.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

① Training data — (x, y) scatter plot

xy

y≈0.7x+1.1y \approx 0.7x + 1.1y≈0.7x+1.1 — www, bbb learned by gradient descent

Linear Regression: A Line Through the Data

What is linear regression? — We assume a linear relationship y=w1x+w0y = w_1 x + w_0y=w1​x+w0​ (or y=w⊤x+by = \mathbf{w}^\top \mathbf{x} + by=w⊤x+b for multiple variables) between input xxx and output yyy, and find the weights www and intercept bbb that best fit the data. The function y=f(x)y = f(x)y=f(x) from Basic Math Ch01 is here a concrete linear function.
What does 'best fit' mean? — We minimize the error between predictions y^i=wxi+b\hat y_i = w x_i + by^​i​=wxi​+b and actual values yiy_iyi​. The function that measures this error is the loss function; MSE (Mean Squared Error), covered in Ch04, is the most common.
Difference from KNN — KNN predicted by the 'average of neighbors'; linear regression learns and stores one formula (a line). At prediction time, we only compute y^=wx+b\hat y = w x + by^​=wx+b without searching for neighbors.

Why it matters

First application of differentiation and optimization — To minimize error, we use differentiation (Basic Math Ch06). Following the gradient of the loss with respect to www and bbb leads to the minimum. This is gradient descent, the same principle behind deep learning training.
Interpretability — The learned www tells us 'how much yyy changes when xxx increases by 1'. For example, with house area (xxx) and price (yyy), w>0w > 0w>0 means 'larger area, higher price'—matching intuition. This interpretability matters when trusting and improving models in practice.
Foundation for other models — Logistic regression (Ch05), a single neuron in a neural network—all use 'linear transformation + nonlinear function'. Understanding linear regression clarifies how their linear part works.

How it is used

Regression — Used to predict continuous numbers: house prices, sales, temperature, scores. With multiple features, y=w1x1+w2x2+⋯+wnxn+by = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + by=w1​x1​+w2​x2​+⋯+wn​xn​+b becomes multiple linear regression.
Feature importance — Features with larger ∣wi∣|w_i|∣wi​∣ have more influence on predictions. When doing feature engineering (Ch01), we use these values to decide which features to keep or drop.
Normal equation vs gradient descent — With few features, the normal equation gives the optimal solution in one step. With many features or large data, gradient descent updates www iteratively. Partial derivatives and gradients from Basic Math Ch08 are the key tools here.