Ch.10

Classification Metrics: The Model's Detailed Report Card

Learn the 'detailed report card' that a classification AI model receives after its test. Beyond "how many did you get right?" (accuracy), we look at confusion matrix concepts that ask "which questions did you get wrong, and how?" In business settings where *how* the model is wrong can be critical—spam filters, cancer diagnosis AI—we explain how precision, recall, and F1 prove the model's real capability, with intuitive analogies.

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Fill the 2×2 confusion matrix with actual (rows) and predicted (columns), then compute accuracy, precision, recall, and F1.

Predicted positivePredicted negativeActual positiveActual negativeTPFNFPTNConfusion Matrix (2×2)Actual positive · Predicted positive → TPTP: True positive ✓FN: False negative (actual pos → predicted neg)FP: False positive (actual neg → predicted pos)TN: True negative ✓

Read the model's report card via the confusion matrix and choose metrics that match your goal.

Classification metrics: confusion matrix and the model's report card

What is the confusion matrix? The AI's detailed report card — Just as knowing only "how many correct" on an exam doesn't tell you whether a student is good at math or English, we need more for a classifier. The confusion matrix is a 2×2 table that compares the model's predictions (columns) with actual answers (rows). By reading the four cells, you can see what the model gets right and where it gets confused and stumbles.
The four cells: TP, TN, FP, FN — Think of the famous "boy who cried wolf." Here 'positive' means the boy cries wolf; 'negative' means peace.
* TP (True Positive): Wolf really came
(1) , boy cried wolf
(1) . Best outcome—village saved.
* TN (True Negative): No wolf (0), boy stayed quiet (0). Peace.
* FP (False Positive): No wolf (0), boy cried wolf
(1) . Villagers run out with pitchforks for nothing (false alarm).
* FN (False Negative): Wolf came
(1) , boy was asleep (0). Sheep get eaten—worst outcome (miss).
* Total count n=TP+TN+FP+FNn = \mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}.
Accuracy's dangerous trap — It is the fraction of correct answers: Accuracy=TP+TNn\text{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{n}. Intuitive but treacherous. Suppose 99 out of 100 days are peaceful and the wolf comes only once. A robot that closes its eyes and always says "No wolf!" still gets 99% accuracy. When positive cases are rare (imbalanced data), you must not trust accuracy alone.
Precision and recall: two rabbits to chase
* Precision (caution): "When I cried wolf, how often was it really the wolf?" The share of predicted positives that are truly positive. Precision=TPTP+FP\text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}. It goes up when you avoid false alarms (FP).
* Recall (sensitivity): "Of all the times the wolf actually came, how often did I notice and warn?" The share of actual positives that the model got right. Recall=TPTP+FN\text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}. It goes up when you miss fewer true wolves (FN).
F1 score: the golden balance of precision and recall — Precision and recall are like a seesaw: pushing one up often pushes the other down. F1 summarizes both in one number using the harmonic mean: F1=2TP2TP+FP+FN\text{F1} = \frac{2 \cdot \mathrm{TP}}{2\cdot\mathrm{TP}+\mathrm{FP}+\mathrm{FN}}. If either precision or recall is poor, F1 tanks. Use F1 when you want a model with good balance.
AUC (Area Under the ROC Curve): the model's ranker — When the model outputs a probability (e.g. "90% chance of wolf") rather than a bare yes/no, AUC measures how well true positives get higher scores than true negatives (discriminative power), on a 0–1 scale. 1 = perfect ranking; 0.5 = coin flip. Very useful to compare models before choosing a threshold.
Don't fall for 99% accuracy — Imagine a credit-card fraud detector: 1 fraudulent transaction in 100,000. A model that does nothing and always says "all normal" still has 99.999% accuracy—but 0% recall (catches no fraud). You must open the confusion matrix and inspect precision and recall to see if the model is doing its job or gaming the numbers.
In practice, it's a fierce trade-off: which mistake can you live with? — The metric you bet on depends on the business.
* Recall (don't miss) is life: Cancer screening. Better to have healthy people get extra tests (FP) than to miss a real case (FN) and delay treatment.
* Precision (fewer false alarms) is life: Spam filter. Missing a few spams (FN) is fine—delete and move on. Misclassifying the boss's email as spam (FP) can be career-threatening.
Final pass/fail for AI services (binary classification) — COVID-19 positive/negative, YouTube harmful-video block/allow, bank loan approve/reject: before deployment, real-world projects draw the confusion matrix and review precision, recall, and F1.
Tuning alarm sensitivity (threshold tuning) — Models usually output a probability. "At what % do we sound the alarm?" Adjusting this threshold tailors the model to the business: e.g. lower threshold for maximum recall (security-critical), higher for maximum precision (when too many false alarms annoy users).
Don't judge a classification model by correct count alone. Fill a confusion matrix (actual rows, predicted columns) with TP, TN, FP, FN. Accuracy = (TP+TN)/n. Precision = TP/(TP+FP). Recall = TP/(TP+FN). For imbalanced data, emphasize precision (fewer false alarms) or recall (fewer misses) by goal; F1 for balance. In practice, combine these for spam, diagnosis, fraud, and threshold choice.
Exact meaning of each (in words)TP: count where actual positive and predicted positive. TN: actual negative, predicted negative. FP: actual negative, predicted positive (false positive). FN: actual positive, predicted negative (miss). Accuracy: fraction of all samples that are correct. Precision: of predicted positives, fraction that are truly positive. Recall: of actual positives, fraction the model got. F1: harmonic mean of precision and recall. AUC: how well positives are ranked above negatives (0–1), independent of threshold.