Ch.01

Supervised, Unsupervised, and Self-Supervised Learning

Machine learning is often divided into supervised, unsupervised, and self-supervised learning depending on how data is used. Supervised learning is like studying with an answer key; unsupervised learning is like finding patterns and grouping similar items without labels; self-supervised learning is like masking part of the data and learning by predicting the missing part. This chapter summarizes the core ideas, math, and real-world use of these three paradigms so you can build a solid base for the algorithms covered later.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Three learning paradigms: supervised (input–label pairs), unsupervised (no label), self-supervised (self-created target).

Supervised: input x and label y come in pairs

(x₁,y₁)
(x₂,y₂)
(x₃,y₃)

When (x, y) pairs are given in order, the model learns the rule

Unsupervised: only input x (no label y)

x1x2x3x4x5x6

There is no y (label), only x. Some x blink on and off → the model still finds structure and clusters

Self-supervised: mask part of the data and predict the gap

1
2
4
MaskPredictFill

e.g. fill in the blank → representation learning (BERT, etc.)

Three Ways of Learning: Supervised, Unsupervised, Self-Supervised

Supervised Learning: Learning from input–label pairs
The model is given input x\mathbf{x} and the corresponding label (target) yy as pairs. The goal is to approximate a function y=f(x)y = f(\mathbf{x}). Formally we have a training set D={(x1,y1),(x2,y2),}\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots\} and find ff by minimizing a loss (e.g. MSE, cross-entropy). Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression are all supervised.
* Example 1 (classification): Spam filter—email content (x\mathbf{x}) → spam or not (yy).
* Example 2 (regression): House price—area, location (x\mathbf{x}) → price (yy).
* Example 3 (medical): Patient test values (x\mathbf{x}) and diagnosis (yy) for decision support.
Unsupervised Learning: Discovering hidden structure
Only input x\mathbf{x} is given; there is no label yy. Think of it as "only questions, no answer key." The goal is to find structure, patterns, or clusters using distance and similarity between x\mathbf{x}s: group similar points (clustering), compress to fewer dimensions (dimensionality reduction), or flag anomalies that fall outside the normal pattern.
* Example 1 (clustering): Customer age and purchase history (x\mathbf{x}) → segment similar customers.
* Example 2 (anomaly detection): Learn normal payment patterns (x\mathbf{x}), then flag unusual transactions.
* Example 3 (dimension reduction): Reduce many features to 2–3 numbers for visualization or denoising. (You’ll learn concrete methods later.)
Self-Supervised Learning: Creating targets from data
Instead of human labels, the model creates pseudo-labels from the data. Typical flow:
(1) Mask part of the input (e.g. a word, an image patch).
(2) Predict the masked part from the rest.
(3) Use the learned representation for downstream tasks with a small amount of supervised data. This is how BERT, GPT, and many vision models are pre-trained on large unlabeled corpora.
* Example 1 (language): "I ate [ MASK ]" → predict the masked word from context (LLMs).
* Example 2 (vision): Mask a region of an image and reconstruct it from the rest.
* Example 3 (contrastive): Treat two augmented views of the same image as "same" and different images as "different" to learn representations.
Data nature and cost — Building labels for all data is expensive. When labels are sufficient, supervised is effective; when they are scarce, unsupervised or self-supervised use unlabeled data, then a small supervised fine-tuning step. Interpretability also differs: supervised allows some explanation via loss and decision path; unsupervised/self-supervised require separate interpretation (e.g. cluster names, visualization).
Pre-training and fine-tuning — Modern pipelines often use self-supervised pre-training on large unlabeled data, then supervised fine-tuning on a small labeled set. Unsupervised is common in preprocessing and exploration—e.g. cluster customers with K-Means, assign human meanings to clusters (e.g. "loyal", "churn risk"), then build a supervised churn model. Choosing the right paradigm makes the pipeline clear and realistic given data size and label cost.
Supervised — Ch02 KNN, Ch03 Linear Regression, Ch04 Logistic Regression learn from (input, label) pairs. Classification: spam filter, disease prediction, image classification. Regression: house price, sales, temperature—Ch03/Ch04 cover the math and optimization.
Unsupervised — Ch08 K-Means clusters data without labels; dimension reduction (reducing many features to 2–3 numbers) is another key tool. Clustering: customer segmentation, topic grouping. Anomaly detection: learn a "normal" region, flag points outside it.
Self-supervised — BERT (masked word prediction), GPT (next-token prediction), and contrastive learning in vision are widely used. After pre-training, a small amount of labeled data is used for QA, summarization, or classification.
Summary
(1) Supervised: learn y=f(x)y=f(\mathbf{x}) from (x,y)(\mathbf{x},y) pairs.
(2) Unsupervised: find structure/clusters from x\mathbf{x} only.
(3) Self-supervised: learn from pseudo-labels (e.g. masked tokens), then use small supervised data for downstream tasks.
  • Label
  • SupervisedYes (yy)
  • UnsupervisedNo
  • Self-SupervisedSelf-created target
  • Goal
  • SupervisedPredict yy (classification/regression)
  • UnsupervisedStructure, clusters, dimensionality reduction
  • Self-SupervisedRepresentation learning
  • Examples
  • SupervisedKNN, linear/logistic regression
  • UnsupervisedK-Means, dimension reduction
  • Self-SupervisedBERT, contrastive learning
By problem typeDefinition: supervised = (x,y) pairs; unsupervised = no label; self-supervised = self-created target. Task: Human-provided labels? → Supervised. No labels, only grouping/reduction? → Unsupervised. Labels derived from data (e.g. masked word)? → Self-supervised. Scenarios: spam classification (supervised), customer clustering (unsupervised), predict masked word (self-supervised).
One-line comparison — Supervised: "Learn from (question, answer) pairs." Unsupervised: "No answers—only group or reduce the data." Self-supervised: "Mask part of the data and predict the gap to learn representations." In problems, check whether labels exist and whether they are human-provided or data-derived to choose the type.