Ch.04
Linear Regression: A Line Through the Data
When data points are scattered, linear regression finds the line that best fits their trend and predicts values for new inputs. It is the first regression model where you can see how functions, derivatives, and partial derivatives from Basic Math lead directly to machine learning 'training'.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
① Training data — (x, y) scatter plot
— , learned by gradient descent
Linear Regression: A Line Through the Data
What is linear regression? — We assume a linear relationship (or for multiple variables) between input and output , and find the weights and intercept that best fit the data. The function from Basic Math Ch01 is here a concrete linear function.
What does 'best fit' mean? — We minimize the error between predictions and actual values . The function that measures this error is the loss function; MSE (Mean Squared Error), covered in Ch04, is the most common.
Difference from KNN — KNN predicted by the 'average of neighbors'; linear regression learns and stores one formula (a line). At prediction time, we only compute without searching for neighbors.
Why it matters
First application of differentiation and optimization — To minimize error, we use differentiation (Basic Math Ch06). Following the gradient of the loss with respect to and leads to the minimum. This is gradient descent, the same principle behind deep learning training.
Interpretability — The learned tells us 'how much changes when increases by 1'. For example, with house area () and price (), means 'larger area, higher price'—matching intuition. This interpretability matters when trusting and improving models in practice.
Foundation for other models — Logistic regression (Ch05), a single neuron in a neural network—all use 'linear transformation + nonlinear function'. Understanding linear regression clarifies how their linear part works.
How it is used
Regression — Used to predict continuous numbers: house prices, sales, temperature, scores. With multiple features, becomes multiple linear regression.
Feature importance — Features with larger have more influence on predictions. When doing feature engineering (Ch01), we use these values to decide which features to keep or drop.
Normal equation vs gradient descent — With few features, the normal equation gives the optimal solution in one step. With many features or large data, gradient descent updates iteratively. Partial derivatives and gradients from Basic Math Ch08 are the key tools here.