Ch.10

Classification Metrics: The Model's Detailed Report Card

Learn the'detailed report card' that a classification AI model receives after its test. Beyond "how many did you get right?" (accuracy), we look at confusion matrix concepts that ask "which questions did you get wrong, and how?" In business settings where *how* the model is wrong can be critical—spam filters, cancer diagnosis AI—we explain how precision, recall, and F1 prove the model's real capability, with intuitive analogies.

Select a chapter to see its diagram below. View the machine learning flow at a glance.

Fill the 2×2 confusion matrix with actual (rows) and predicted (columns), then compute accuracy, precision, recall, and F1.

Read the model's report card via the confusion matrix and choose metrics that match your goal.

Classification metrics: confusion matrix and the model's report card

What is the confusion matrix? The AI's detailed report card — Just as knowing only "how many correct" on an exam doesn't tell you whether a student is good at math or English, we need more for a classifier. The confusion matrix is a 2\times2 table that compares the model's predictions (columns) with actual answers (rows) . By reading the four cells, you can see what the model gets right and where it gets confused and stumbles.

n = \mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}

\text{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{n}

\text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}

\text{F1} = \frac{2 \cdot \mathrm{TP}}{2\cdot\mathrm{TP}+\mathrm{FP}+\mathrm{FN}}

AUC (Area Under the ROC Curve): the model's ranker — When the model outputs a probability (e.g. "90% chance of wolf") rather than a bare yes/no, AUC measures how well true positives get higher scores than true negatives (discriminative power), on a 0-1 scale. 1 = perfect ranking; 0.5 = coin flip. Very useful to compare models before choosing a threshold.

Don't fall for 99% accuracy — Imagine a credit-card fraud detector: 1 fraudulent transaction in 100,000. A model that does nothing and always says "all normal" still has 99.999% accuracy—but 0% recall (catches no fraud). You must open the confusion matrix and inspect precision and recall to see if the model is doing its job or gaming the numbers.

In practice, it's a fierce trade-off: which mistake can you live with? — The metric you bet on depends on the business. * Recall (don't miss) is life: Cancer screening. Better to have healthy people get extra tests (FP) than to miss a real case (FN) and delay treatment. * Precision (fewer false alarms) is life: Spam filter. Missing a few spams (FN) is fine—delete and move on. Misclassifying the boss's email as spam (FP) can be career-threatening.

Final pass/fail for AI services (binary classification) — COVID-19 positive/negative, YouTube harmful-video block/allow, bank loan approve/reject: before deployment, real-world projects draw the confusion matrix and review precision, recall, and F1.

Tuning alarm sensitivity (threshold tuning) — Models usually output a probability. "At what % do we sound the alarm?" Adjusting this threshold tailors the model to the business: e.g. lower threshold for maximum recall (security-critical), higher for maximum precision (when too many false alarms annoy users).

Don't judge a classification model by correct count alone. Fill a confusion matrix (actual rows, predicted columns) with TP, TN, FP, FN. Accuracy = (TP+TN)/n. Precision = TP/(TP+FP). Recall = TP/(TP+FN). For imbalanced data, emphasize precision (fewer false alarms) or recall (fewer misses) by goal; F1 for balance. In practice, combine these for spam, diagnosis, fraud, and threshold choice.

Exact meaning of each (in words) — TP: count where actual positive and predicted positive. TN: actual negative, predicted negative. FP: actual negative, predicted positive (false positive). FN: actual positive, predicted negative (miss). Accuracy: fraction of all samples that are correct. Precision: of predicted positives, fraction that are truly positive. Recall: of actual positives, fraction the model got. F1: harmonic mean of precision and recall. AUC: how well positives are ranked above negatives (0–1), independent of threshold.