Ch.03

선형 회귀 (Linear Regression): 데이터의 흐름을 꿰뚫는 선

When data points are scattered, linear regression finds the line that best fits their trend and predicts values for new inputs. It is the first regression model where you can see how functions, derivatives, and partial derivatives from Basic Math lead directly to machine learning 'training'.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

① Training data — (x, y) scatter plot

y0.7x+1.1y \approx 0.7x + 1.1ww, bb learned by gradient descent

Linear Regression: A Line Through the Data

What is linear regression? — We assume a linear relationship y=w1x+w0y = w_1 x + w_0 (or y=wx+by = \mathbf{w}^\top \mathbf{x} + b for multiple variables) between input xx and output yy, and find the weights ww and intercept bb that best fit the data. The function y=f(x)y = f(x) from Basic Math Ch01 is here a concrete linear function.
What does 'best fit' mean? — We minimize the error between predictions y^i=wxi+b\hat{y}_i = w x_i + b and actual values yiy_i. The function that measures this error is the loss function; MSE (Mean Squared Error), covered in Ch04, is the most common.
Difference from KNN — KNN predicted by the 'average of neighbors'; linear regression learns and stores one formula (a line). At prediction time, we only compute y^=wx+b\hat{y} = w x + b without searching for neighbors.
First application of differentiation and optimization — To minimize error, we use differentiation (Basic Math Ch06). Following the gradient of the loss with respect to ww and bb leads to the minimum. This is gradient descent, the same principle behind deep learning training.
Interpretability — The learned ww tells us 'how much yy changes when xx increases by 1'. For example, with house area (xx) and price (yy), w>0w > 0 means 'larger area, higher price'—matching intuition. This interpretability matters when trusting and improving models in practice.
Foundation for other models — Logistic regression (Ch05), a single neuron in a neural network—all use 'linear transformation + nonlinear function'. Understanding linear regression clarifies how their linear part works.
Regression — Used to predict continuous numbers: house prices, sales, temperature, scores. With multiple features, y=w1x1+w2x2++wnxn+by = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b becomes multiple linear regression.
Feature importance — Features with larger wi|w_i| have more influence on predictions. When doing feature engineering (Ch01), we use these values to decide which features to keep or drop.
Normal equation vs gradient descent — With few features, the normal equation gives the optimal solution in one step. With many features or large data, gradient descent updates ww iteratively. Partial derivatives and gradients from Basic Math Ch08 are the key tools here.
Summary: A process of trial and error that reduces error — Linear regression is like a detective finding one single line (y=wx+by=wx+b) that best passes through scattered data points. Model (assumption): We start by drawing a random line. Of course it doesn't fit the data well, so the error is large. Learning: We use gradient descent to reduce this error—like walking down a mountain with eyes closed, step by step, toward the lowest valley (the point of minimum error). Prediction: Once we reach the valley floor, we've found the optimal slope (ww) and position (bb). Now when a new question (xx) arrives, we simply plug it into this finished formula to predict the answer (y^\hat{y}) instantly.
Three steps: extracting a rule from data — Linear regression finds a simple rule (y=wx+by=wx+b) within complex data.
① Model — We assume "input (xx) and target (yy) have a linear relationship" and set up the model.
② Optimization (training) — We compute the loss (the difference between prediction y^\hat{y} and actual yy), then use gradient descent to update ww (slope) and bb (intercept) little by little to minimize it. This is exactly the same principle as deep learning.
③ Inference (prediction) — The learned line compresses the data's pattern. When new data arrives, we substitute it into the line formula and predict the result instantly.