Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.10

K-Means Clustering: Grouping Without Labels

K-Means is a classic unsupervised learning algorithm that groups data into K clusters using distance—no labels. You will see how the 'unsupervised' idea from Ch01 works in practice: concept → intuition → math → application. It reuses the distance formula from Ch02 (KNN) and shows how repeating 'assign to nearest center' and 'update centers' yields clear clusters.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Assign each point to the nearest center, then move centers to the mean of assigned points; repeat.

① Data — unlabeled points in feature space

Point: Data

K-Means Clustering: Grouping Without Labels

What is K-Means? — With no labels yyy, only data x1,x2,…\mathbf{x}_1, \mathbf{x}_2, \ldotsx1​,x2​,…, K-Means partitions points into K groups by nearest centroid. Distance is Euclidean d(x,μ)=∑j(xj−μj)2d(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{\sum_j (x_j - \mu_j)^2}d(x,μ)=∑j​(xj​−μj​)2​ (as in Ch02). Each group has one centroid μk\boldsymbol{\mu}_kμk​. The algorithm alternates: assign each point to the nearest center → set each center to the mean of its assigned points, until convergence.
K is the number of clusters — The user chooses K (e.g. K=2 → two groups). There are no 'correct' labels, only a partition. In practice, K is chosen by domain knowledge, the elbow method, or silhouette scores.
Objective: minimize SSE (distortion) — K-Means minimizes J=∑k=1K∑i∈Ck∥xi−μk∥2J = \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2J=∑k=1K​∑i∈Ck​​∥xi​−μk​∥2. The update μk=1∣Ck∣∑i∈Ckxi\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_iμk​=∣Ck​∣1​∑i∈Ck​​xi​ (mean of assigned points) reduces each cluster's SSE.
If the formulas feel heavy — The distance formula is just 'length between a point and a center.' SSE JJJ is a single number for 'how tightly points sit around their center'; the algorithm moves centers to make JJJ smaller. The centroid update is literally 'average of the coordinates of points in that cluster.' The Formula guide below spells out each symbol step by step.

Why it matters

Ch01 unsupervised learning in action — K-Means is the go-to when you have no labels and want structure (e.g. customer segmentation, clustering documents or images, preprocessing for anomaly detection).
Customer segmentation — With only purchase history and no segment labels, K-Means groups similar customers; people then attach meaning (e.g. VIP, churn risk) to each cluster and use it for downstream tasks (Ch09, Ch12).
Simple and interpretable — Assign (nearest center) and update (mean) are easy to implement and visualize in 2D.

How it is used

Clustering — Customer segmentation, topic/document grouping, image color compression, gene expression groups.
Preprocessing — Use cluster index as a new feature for supervised models, or keep only centroids to reduce data size.
Choosing K — The user sets K; compare SSE or silhouette across K to pick a value (e.g. elbow).