Ch.08

K-Means Clustering: Grouping Without Labels

K-Means is a classic unsupervised learning algorithm that groups data into K clusters using distance—no labels. You will see how the 'unsupervised' idea from Ch01 works in practice: concept → intuition → math → application. It reuses the distance formula from Ch02 (KNN) and shows how repeating 'assign to nearest center' and 'update centers' yields clear clusters.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Assign each point to the nearest center, then move centers to the mean of assigned points; repeat.

① Data — unlabeled points in feature space

Point: Data

K-Means Clustering: Grouping Without Labels

What is K-Means? — With no labels yy, only data x1,x2,\mathbf{x}_1, \mathbf{x}_2, \ldots, K-Means partitions points into K groups by nearest centroid. Distance is Euclidean d(x,μ)=j(xjμj)2d(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{\sum_j (x_j - \mu_j)^2} (as in Ch02). Each group has one centroid μk\boldsymbol{\mu}_k. The algorithm alternates: assign each point to the nearest center → set each center to the mean of its assigned points, until convergence.
K is the number of clusters — The user chooses K (e.g. K=2 → two groups). There are no 'correct' labels, only a partition. In practice, K is chosen by domain knowledge, the elbow method, or silhouette scores.
Objective: minimize SSE (distortion) — K-Means minimizes J=k=1KiCkxiμk2J = \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2. The update μk=1CkiCkxi\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_i (mean of assigned points) reduces each cluster's SSE.
If the formulas feel heavy — The distance formula is just 'length between a point and a center.' SSE JJ is a single number for 'how tightly points sit around their center'; the algorithm moves centers to make JJ smaller. The centroid update is literally 'average of the coordinates of points in that cluster.' The Formula guide below spells out each symbol step by step.
Ch01 unsupervised learning in action — K-Means is the go-to when you have no labels and want structure (e.g. customer segmentation, clustering documents or images, preprocessing for anomaly detection).
Customer segmentation — With only purchase history and no segment labels, K-Means groups similar customers; people then attach meaning (e.g. VIP, churn risk) to each cluster and use it for downstream tasks (Ch09, Ch12).
Simple and interpretable — Assign (nearest center) and update (mean) are easy to implement and visualize in 2D.
Clustering — Customer segmentation, topic/document grouping, image color compression, gene expression groups.
Preprocessing — Use cluster index as a new feature for supervised models, or keep only centroids to reduce data size.
Choosing K — The user sets K; compare SSE or silhouette across K to pick a value (e.g. elbow).
Summary
(1) Input: Unlabeled points and cluster count KK.
(2) Initialize: Place KK centroids at random or by heuristic.
(3) Assign: Each point goes to the nearest center's cluster.
(4) Update: Set each center to the mean of its assigned points.
(5) Repeat: Steps 3–4 until assignment and centers no longer change.
Objective: Minimize SSE (distortion) J=kiCkxiμk2J = \sum_{k}\sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2.
Centroid update: μk=1CkiCkxi\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_i
See the table and examples below.
Terminology
  • ItemDistance squared
  • DescriptionFor two points (x1,y1)(x_1,y_1), (x2,y2)(x_2,y_2): (x2x1)2+(y2y1)2(x_2-x_1)^2+(y_2-y_1)^2. No need for the square root when comparing.
  • ItemAssign
  • DescriptionFor a point and KK centers, compute distance (or distance²) to each; the smallest center index (1-based) is that point's cluster.
  • ItemCenter update
  • DescriptionNew center = (mean of xx, mean of yy) of points in that cluster; round if needed.
  • ItemSSE
  • DescriptionIn one cluster: J=iCkxiμk2J = \sum_{i \in C_k} \lVert\mathbf{x}_i - \boldsymbol{\mu}_k\rVert^2 (sum of squared distances to center).

Example (assign)
Centers μ1=(0,0)\mu_1=(0,0), μ2=(4,0)\mu_2=(4,0); point (2,0)(2,0). Distance²: d12=4d_1^2=4, d22=4d_2^2=4; tie → cluster 1. Answer 1

Example (center update)
Points (1,2)(1,2), (3,4)(3,4) in cluster 1 → new center xˉ=2\bar{x}=2, yˉ=3\bar{y}=3. (2, 3)
Formula guide
① Distance d(x,μ)=j(xjμj)2d(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{\sum_j (x_j - \mu_j)^2}
What it means — This formula measures the straight-line distance (ruler length) between one data point x\mathbf{x} and one center μ\boldsymbol{\mu}.
Reading the symbolsxjx_j is the point’s jj-th feature value, μj\mu_j is the center’s jj-th coordinate. For example, with two features (e.g. height, weight), x1x_1 and x2x_2 are those values; you take (x1μ1)2+(x2μ2)2(x_1-\mu_1)^2 + (x_2-\mu_2)^2, then the square root, which in 2D is the same as the Pythagorean “straight line” length.
Numeric example — Point (1,2)(1,2), center (4,6)(4,6). Differences: 41=34-1=3, 62=46-2=4. Squared distance =32+42=25= 3^2+4^2 = 25, distance =25=5= \sqrt{25}=5.
In K-Means — We only need to compare which center is closer, so we can skip the square root and use squared distance (x1μ1)2+(x2μ2)2(x_1-\mu_1)^2+(x_2-\mu_2)^2; the smaller value is the closer center.
One line: Formula for (squared) straight-line distance between a point and a center.

② Objective (SSE) J=kiCkxiμk2J = \sum_{k}\sum_{i \in C_k} \lVert\mathbf{x}_i - \boldsymbol{\mu}_k\rVert^2
What it means — The sum over all clusters of “squared distance from each point to its cluster center.” This total is called SSE (or distortion).
Reading the symbolskk = cluster index (1, 2, …). CkC_k = set of points in cluster kk. xi\mathbf{x}_i = coordinates of the ii-th point. μk\boldsymbol{\mu}_k = center of cluster kk. xiμk2\lVert \mathbf{x}_i - \boldsymbol{\mu}_k \rVert^2 is the squared distance from point ii to that center.
Why squared? — Squaring makes “far” points count more and keeps everything positive, so JJ clearly measures how spread out the clusters are. Smaller JJ means points are tighter around their centers; the algorithm moves centers to reduce JJ.
Example — If cluster 1 has three points whose squared distances to its center are 1, 4, and 9, that cluster’s contribution is 1+4+9=141+4+9=14. JJ is the sum of such contributions over all clusters.
One line: A number that says how tight the clusters are; smaller is better.

③ Centroid update μk=1CkiCkxi\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_i
What it means — Take the average of the coordinates of all points in cluster kk and use that as the new center.
Reading the symbolsCk|C_k| = number of points in cluster kk. iCkxi\sum_{i \in C_k} \mathbf{x}_i = sum of the xx and yy coordinates of those points. Multiplying by 1Ck\frac{1}{|C_k|} gives the mean.
Numeric example — Cluster 1 has points (1,2)(1,2), (3,4)(3,4), (5,0)(5,0), so Ck=3|C_k|=3. Sum of xx: 1+3+5=91+3+5=9, sum of yy: 2+4+0=62+4+0=6. New center xˉ=9/3=3\bar{x}=9/3=3, yˉ=6/3=2\bar{y}=6/3=2(3, 2).
Why the mean? — Moving the center to the mean of its points reduces the sum of squared distances (SSE) for that cluster; that’s why K-Means uses this update.
One line: New center = (mean xx, mean yy) of the points in that cluster.