Ch.01
Missing Value Handling: Strategies to Fill Data Gaps
Real-world data often has missing values—empty cells like in a spreadsheet. Ignoring them can halt training or yield biased results. This chapter walks through filling those gaps, screening extreme values (outliers), and correcting skewed class ratios (class imbalance)—a practical data quality pipeline that underpins reliable machine learning.
ML diagram by chapter
Select a chapter to see its diagram below. View the machine learning flow at a glance.
Missing Value Handling: preprocessing that reduces gaps and raises trust
What is a missing value? An empty cell in a data table—like a puzzle with a tooth missing. In practice they come from skipped survey answers, sensor failures, data transfer loss, and more.
Missingness mechanisms (MCAR/MAR/MNAR) ask *why* the blank appeared. MCAR (*Missing Completely at Random*) is like coffee spilled on a form—chance alone. MAR (*Missing at Random*) is like male respondents leaving “cosmetics spend” empty—linked to *other* observed variables. MNAR (*Missing Not at Random*) is like low-income people leaving “income” blank—the missingness itself carries meaning.
Handling strategies fall into three broad types: listwise deletion, single imputation (fill with one value), and multiple imputation (fill several times and pool). Each trades off how much data you keep, speed, and statistical rigor—pick to fit the situation.
Single vs multiple imputation: Single imputation fills each gap once with e.g. the mean or mode—fast but risky. Multiple imputation builds several plausible completed datasets (parallel “worlds”) and pools results for a more careful conclusion.
Two views on outliers: Univariate detection (box plot) flags extreme values in one variable; multivariate detection (Mahalanobis / Isolation Forest / SVDD) flags odd *combinations* across variables. They answer different questions—in practice you often check both.
Class imbalance correction: When one class dominates, models may behave as if the rare class barely exists. Practitioners combine Tomek Links (boundary cleaning), SMOTE/ADASYN (synthetic minority samples), and SMOTE+Tomek (synthesize then clean).
Core message: Missing-value handling is not a standalone trick—it is one pipeline design problem tied to outlier checks and imbalance correction.
Common Single-Imputation Values/Methods
A compact table of common single-imputation methods with definitions and formulas.
| Value/Method | Definition (short formula) |
|---|---|
| Mean | Impute with sample mean: |
| Median | Impute with median: |
| Mode | Impute with most frequent value: |
| Regression · KNN · Hot-deck | Regression: , KNN: , Hot-deck: |
Intuition
Systems hate blanks. If you leave gaps, the pipeline may error—like an OMR sheet that cannot be scored without marks.
Bad fills mislead. Filling everything with 0 or the mean breaks the true distribution; the model may treat imputed values as real and become overconfident.
Preprocessing is a set menu. Filling missing values is not the end—you should plan outlier screening and imbalance handling in the same breath so the model behaves in production.
Fairness and safety: If missingness differs by group (MAR/MNAR), careless imputation can widen performance gaps between groups—check bias signals early.
It beats model choice to the punch: With the same algorithm, better preprocessing can change outcomes more than swapping models—often “good data flow” wins over “good model name.”
Deployment stability: If you define rules for missingness, outliers, and imbalance up front, new data can be handled consistently—retraining and monitoring get easier.
Math
End-to-end flow: EDA → hypothesize why values are missing → choose imputation → catch extremes (outlier detection, e.g. box plot) → adjust class mix (imbalance correction, e.g. SMOTE) → then train and evaluate.
Single-imputation formulas: Mean fill: ; median fill: .
Multiple imputation: Build completed datasets (“parallel worlds”), then pool estimates : .
Box plot (IQR) rule: Fences from to ; points outside are outlier *candidates*.
Covariance: Measures how two variables move together—e.g. do taller people tend to weigh more? . Stacking covariances yields , which sets the orientation and stretch of the multivariate “cloud” (ellipses).
Mahalanobis distance: Not plain Euclidean distance—it uses to weight directions by spread: (covariance is central).
Isolation Forest: Outliers are points that become isolated quickly under random splits—few splits needed to separate them (short path length), often in high dimensions with weak distributional assumptions.
SVDD (one-class): Learn a boundary around normal data (minimum-volume sphere or kernel-shaped region) and flag points outside as outliers—common in one-class anomaly detection.
Class imbalance: With a very rare positive class, accuracy can look high while the model ignores positives—use Recall, Precision, F1, PR-AUC together and resample when needed.
Tomek Links: Pairs of opposite-class mutual nearest neighbors near the boundary—often remove the majority point (or both) to clean overlap (undersampling-based cleaning).
SMOTE: Interpolate between a minority point and a neighbor : , —richer than copy-paste but can add bad samples if the boundary is noisy.
Hybrid resampling (e.g. SMOTE+Tomek): Oversample the minority with SMOTE, then clean ambiguous boundary pairs with Tomek—think oversample → clean.