Ch.01
Supervised, Unsupervised, and Self-Supervised Learning
Machine learning is often divided into supervised, unsupervised, and self-supervised learning depending on how data is used. Supervised learning is like studying with an answer key; unsupervised learning is like finding patterns and grouping similar items without labels; self-supervised learning is like masking part of the data and learning by predicting the missing part. This chapter summarizes the core ideas, math, and real-world use of these three paradigms so you can build a solid base for the algorithms covered later.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
Three learning paradigms: supervised (input–label pairs), unsupervised (no label), self-supervised (self-created target).
Supervised: input x and label y come in pairs
(x₁,y₁)→
(x₂,y₂)→
(x₃,y₃)
When (x, y) pairs are given in order, the model learns the rule
Unsupervised: only input x (no label y)
x1x2x3x4x5x6
There is no y (label), only x. Some x blink on and off → the model still finds structure and clusters
Self-supervised: mask part of the data and predict the gap
1
2
…
4
① Mask② Predict③ Fill
e.g. fill in the blank → representation learning (BERT, etc.)
Three Ways of Learning: Supervised, Unsupervised, Self-Supervised
Supervised Learning: Learning from input–label pairs
The model is given input and the corresponding label (target) as pairs. The goal is to approximate a function . Formally we have a training set and find by minimizing a loss (e.g. MSE, cross-entropy). Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression are all supervised.
* Example 1 (classification): Spam filter—email content () → spam or not ().
* Example 2 (regression): House price—area, location () → price ().
* Example 3 (medical): Patient test values () and diagnosis () for decision support.
Unsupervised Learning: Discovering hidden structure
Only input is given; there is no label . Think of it as "only questions, no answer key." The goal is to find structure, patterns, or clusters using distance and similarity between s: group similar points (clustering), compress to fewer dimensions (dimensionality reduction), or flag anomalies that fall outside the normal pattern.
* Example 1 (clustering): Customer age and purchase history () → segment similar customers.
* Example 2 (anomaly detection): Learn normal payment patterns (), then flag unusual transactions.
* Example 3 (dimension reduction): Reduce many features to 2–3 numbers for visualization or denoising. (You’ll learn concrete methods later.)
Self-Supervised Learning: Creating targets from data
Instead of human labels, the model creates pseudo-labels from the data. Typical flow:
(1) Mask part of the input (e.g. a word, an image patch).
(2) Predict the masked part from the rest.
(3) Use the learned representation for downstream tasks with a small amount of supervised data. This is how BERT, GPT, and many vision models are pre-trained on large unlabeled corpora.
* Example 1 (language): "I ate [ MASK ]" → predict the masked word from context (LLMs).
* Example 2 (vision): Mask a region of an image and reconstruct it from the rest.
* Example 3 (contrastive): Treat two augmented views of the same image as "same" and different images as "different" to learn representations.
Data nature and cost — Building labels for all data is expensive. When labels are sufficient, supervised is effective; when they are scarce, unsupervised or self-supervised use unlabeled data, then a small supervised fine-tuning step. Interpretability also differs: supervised allows some explanation via loss and decision path; unsupervised/self-supervised require separate interpretation (e.g. cluster names, visualization).
Pre-training and fine-tuning — Modern pipelines often use self-supervised pre-training on large unlabeled data, then supervised fine-tuning on a small labeled set. Unsupervised is common in preprocessing and exploration—e.g. cluster customers with K-Means, assign human meanings to clusters (e.g. "loyal", "churn risk"), then build a supervised churn model. Choosing the right paradigm makes the pipeline clear and realistic given data size and label cost.
Supervised — Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression learn from (input, label) pairs. Classification: spam filter, disease prediction, image classification. Regression: house price, sales, temperature—Ch03/Ch04 cover the math and optimization.
Unsupervised — Ch08 K-Means clusters data without labels; dimension reduction (reducing many features to 2–3 numbers) is another key tool. Clustering: customer segmentation, topic grouping. Anomaly detection: learn a "normal" region, flag points outside it.
Self-supervised — BERT (masked word prediction), GPT (next-token prediction), and contrastive learning in vision are widely used. After pre-training, a small amount of labeled data is used for QA, summarization, or classification.
Summary —
(1) Supervised: learn from pairs.
(2) Unsupervised: find structure/clusters from only.
(3) Self-supervised: learn from pseudo-labels (e.g. masked tokens), then use small supervised data for downstream tasks.
- Label
- SupervisedYes ()
- UnsupervisedNo
- Self-SupervisedSelf-created target
- Goal
- SupervisedPredict (classification/regression)
- UnsupervisedStructure, clusters, dimensionality reduction
- Self-SupervisedRepresentation learning
- Examples
- SupervisedKNN, linear/logistic regression
- UnsupervisedK-Means, dimension reduction
- Self-SupervisedBERT, contrastive learning
| Supervised | Unsupervised | Self-Supervised | |
|---|---|---|---|
| Label | Yes () | No | Self-created target |
| Goal | Predict (classification/regression) | Structure, clusters, dimensionality reduction | Representation learning |
| Examples | KNN, linear/logistic regression | K-Means, dimension reduction | BERT, contrastive learning |
By problem type — Definition: supervised = (x,y) pairs; unsupervised = no label; self-supervised = self-created target. Task: Human-provided labels? → Supervised. No labels, only grouping/reduction? → Unsupervised. Labels derived from data (e.g. masked word)? → Self-supervised. Scenarios: spam classification (supervised), customer clustering (unsupervised), predict masked word (self-supervised).
One-line comparison — Supervised: "Learn from (question, answer) pairs." Unsupervised: "No answers—only group or reduce the data." Self-supervised: "Mask part of the data and predict the gap to learn representations." In problems, check whether labels exist and whether they are human-provided or data-derived to choose the type.