Ch.03
선형 회귀 (Linear Regression): 데이터의 흐름을 꿰뚫는 선
When data points are scattered, linear regression finds the line that best fits their trend and predicts values for new inputs. It is the first regression model where you can see how functions, derivatives, and partial derivatives from Basic Math lead directly to machine learning 'training'.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
① Training data — (x, y) scatter plot
— , learned by gradient descent
Linear Regression: A Line Through the Data
What is linear regression? — We assume a linear relationship (or for multiple variables) between input and output , and find the weights and intercept that best fit the data. The function from Basic Math Ch01 is here a concrete linear function.
What does 'best fit' mean? — We minimize the error between predictions and actual values . The function that measures this error is the loss function; MSE (Mean Squared Error), covered in Ch04, is the most common.
Difference from KNN — KNN predicted by the 'average of neighbors'; linear regression learns and stores one formula (a line). At prediction time, we only compute without searching for neighbors.
First application of differentiation and optimization — To minimize error, we use differentiation (Basic Math Ch06). Following the gradient of the loss with respect to and leads to the minimum. This is gradient descent, the same principle behind deep learning training.
Interpretability — The learned tells us 'how much changes when increases by 1'. For example, with house area () and price (), means 'larger area, higher price'—matching intuition. This interpretability matters when trusting and improving models in practice.
Foundation for other models — Logistic regression (Ch05), a single neuron in a neural network—all use 'linear transformation + nonlinear function'. Understanding linear regression clarifies how their linear part works.
Regression — Used to predict continuous numbers: house prices, sales, temperature, scores. With multiple features, becomes multiple linear regression.
Feature importance — Features with larger have more influence on predictions. When doing feature engineering (Ch01), we use these values to decide which features to keep or drop.
Normal equation vs gradient descent — With few features, the normal equation gives the optimal solution in one step. With many features or large data, gradient descent updates iteratively. Partial derivatives and gradients from Basic Math Ch08 are the key tools here.
Summary: A process of trial and error that reduces error — Linear regression is like a detective finding one single line () that best passes through scattered data points. Model (assumption): We start by drawing a random line. Of course it doesn't fit the data well, so the error is large. Learning: We use gradient descent to reduce this error—like walking down a mountain with eyes closed, step by step, toward the lowest valley (the point of minimum error). Prediction: Once we reach the valley floor, we've found the optimal slope () and position (). Now when a new question () arrives, we simply plug it into this finished formula to predict the answer () instantly.
Three steps: extracting a rule from data — Linear regression finds a simple rule () within complex data.
① Model — We assume "input () and target () have a linear relationship" and set up the model.
② Optimization (training) — We compute the loss (the difference between prediction and actual ), then use gradient descent to update (slope) and (intercept) little by little to minimize it. This is exactly the same principle as deep learning.
③ Inference (prediction) — The learned line compresses the data's pattern. When new data arrives, we substitute it into the line formula and predict the result instantly.