Ch.08
K-Means Clustering: Grouping Without Labels
K-Means is a classic unsupervised learning algorithm that groups data into K clusters using distance—no labels. You will see how the 'unsupervised' idea from Ch01 works in practice: concept → intuition → math → application. It reuses the distance formula from Ch02 (KNN) and shows how repeating 'assign to nearest center' and 'update centers' yields clear clusters.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
Assign each point to the nearest center, then move centers to the mean of assigned points; repeat.
① Data — unlabeled points in feature space
Point: Data
K-Means Clustering: Grouping Without Labels
What is K-Means? — With no labels , only data , K-Means partitions points into K groups by nearest centroid. Distance is Euclidean (as in Ch02). Each group has one centroid . The algorithm alternates: assign each point to the nearest center → set each center to the mean of its assigned points, until convergence.
K is the number of clusters — The user chooses K (e.g. K=2 → two groups). There are no 'correct' labels, only a partition. In practice, K is chosen by domain knowledge, the elbow method, or silhouette scores.
Objective: minimize SSE (distortion) — K-Means minimizes . The update (mean of assigned points) reduces each cluster's SSE.
If the formulas feel heavy — The distance formula is just 'length between a point and a center.' SSE is a single number for 'how tightly points sit around their center'; the algorithm moves centers to make smaller. The centroid update is literally 'average of the coordinates of points in that cluster.' The Formula guide below spells out each symbol step by step.
Ch01 unsupervised learning in action — K-Means is the go-to when you have no labels and want structure (e.g. customer segmentation, clustering documents or images, preprocessing for anomaly detection).
Customer segmentation — With only purchase history and no segment labels, K-Means groups similar customers; people then attach meaning (e.g. VIP, churn risk) to each cluster and use it for downstream tasks (Ch09, Ch12).
Simple and interpretable — Assign (nearest center) and update (mean) are easy to implement and visualize in 2D.
Clustering — Customer segmentation, topic/document grouping, image color compression, gene expression groups.
Preprocessing — Use cluster index as a new feature for supervised models, or keep only centroids to reduce data size.
Choosing K — The user sets K; compare SSE or silhouette across K to pick a value (e.g. elbow).
Summary
(1) Input: Unlabeled points and cluster count .
(2) Initialize: Place centroids at random or by heuristic.
(3) Assign: Each point goes to the nearest center's cluster.
(4) Update: Set each center to the mean of its assigned points.
(5) Repeat: Steps 3–4 until assignment and centers no longer change.
Objective: Minimize SSE (distortion) .
Centroid update:
See the table and examples below.
Terminology
- ItemDistance squared
- DescriptionFor two points , : . No need for the square root when comparing.
- ItemAssign
- DescriptionFor a point and centers, compute distance (or distance²) to each; the smallest center index (1-based) is that point's cluster.
- ItemCenter update
- DescriptionNew center = (mean of , mean of ) of points in that cluster; round if needed.
- ItemSSE
- DescriptionIn one cluster: (sum of squared distances to center).
| Item | Description |
|---|---|
| Distance squared | For two points , : . No need for the square root when comparing. |
| Assign | For a point and centers, compute distance (or distance²) to each; the smallest center index (1-based) is that point's cluster. |
| Center update | New center = (mean of , mean of ) of points in that cluster; round if needed. |
| SSE | In one cluster: (sum of squared distances to center). |
Example (assign)
Centers , ; point . Distance²: , ; tie → cluster 1. Answer 1
Example (center update)
Points , in cluster 1 → new center , . (2, 3)
Formula guide
① Distance
What it means — This formula measures the straight-line distance (ruler length) between one data point and one center .
Reading the symbols — is the point’s -th feature value, is the center’s -th coordinate. For example, with two features (e.g. height, weight), and are those values; you take , then the square root, which in 2D is the same as the Pythagorean “straight line” length.
Numeric example — Point , center . Differences: , . Squared distance , distance .
In K-Means — We only need to compare which center is closer, so we can skip the square root and use squared distance ; the smaller value is the closer center.
One line: Formula for (squared) straight-line distance between a point and a center.
② Objective (SSE)
What it means — The sum over all clusters of “squared distance from each point to its cluster center.” This total is called SSE (or distortion).
Reading the symbols — = cluster index (1, 2, …). = set of points in cluster . = coordinates of the -th point. = center of cluster . is the squared distance from point to that center.
Why squared? — Squaring makes “far” points count more and keeps everything positive, so clearly measures how spread out the clusters are. Smaller means points are tighter around their centers; the algorithm moves centers to reduce .
Example — If cluster 1 has three points whose squared distances to its center are 1, 4, and 9, that cluster’s contribution is . is the sum of such contributions over all clusters.
One line: A number that says how tight the clusters are; smaller is better.
③ Centroid update
What it means — Take the average of the coordinates of all points in cluster and use that as the new center.
Reading the symbols — = number of points in cluster . = sum of the and coordinates of those points. Multiplying by gives the mean.
Numeric example — Cluster 1 has points , , , so . Sum of : , sum of : . New center , → (3, 2).
Why the mean? — Moving the center to the mean of its points reduces the sum of squared distances (SSE) for that cluster; that’s why K-Means uses this update.
One line: New center = (mean , mean ) of the points in that cluster.