Intermediate ML: Real-World Data Limits and Model Optimization

Building on basic ML—data, features, training, and evaluation—this chapter introduces working with messy real-world tables and refining models in practice.

Intermediate ML diagram by chapter

Select a chapter to see its diagram below. View the intermediate ML flow at a glance.

What you learn in Ch01–Ch20

Intermediate ML connects real-world preprocessing with model and hyperparameter tuning. You handle scaling, encoding, missing data, and imbalance, then SVM, PCA, boosting, and clustering, and finally pipelines with grid, random, and Bayesian (Optuna) search.

Ch.01
Data Scaling and Distribution Transformation
Ch.02
Categorical Encoding
Ch.03
Missing Data and Imputation
Ch.04
Imbalanced Data Basics
Ch.05
Advanced Cross Validation
Ch.06
Multiclass Evaluation and ROC-AUC
Ch.07
SVM Basics: Decision Boundary and Margin
Ch.08
Kernel Trick: Nonlinear SVM
Ch.09
Dimensionality Reduction 1: PCA
Ch.10
Ensemble: Bagging and Pasting
Ch.11
Boosting Basics: AdaBoost
Ch.12
Gradient Boosting Machine (GBM)
Ch.13
Density-Based Clustering (DBSCAN)
Ch.14
Hierarchical Clustering and Dendrogram
Ch.15
Gaussian Mixture Model (GMM)
Ch.16
Anomaly Detection Basics
Ch.17
Pipeline: Modeling Automation
Ch.18
Hyperparameter Tuning 1: Grid and Random Search
Ch.19
Hyperparameter Tuning 2: Bayesian Optimization (Optuna)
Ch.20
Intermediate ML Summary

Real-world data, preprocessing, and tuning

\mathbf{X}

Why it matters

y \approx f(\mathbf{x})

How it is used

Order matters in practice — Explore the data, split into train, validation, and test, fit preprocessors on training only, train the model, tune hyperparameters against validation, and report on the held-out test set at the end. That sequence keeps evaluation closer to real generalization. How this course is organized — Early chapters cover scaling, encoding, missing data, imbalance, cross-validation, and multiclass metrics. The middle adds SVM, PCA, ensembles, clustering, and anomaly detection. Later chapters cover pipelines and grid, random, and Bayesian search. Preview each title in the roadmap below. It extends basic ML — If you already studied data and features, missing values, and cross-validation, intermediate ML applies the same ideas to one realistic table. The goal is not a formula list but a calm understanding of why cleaning matters, where metrics mislead, and how to run sound experiments.

Real-world data, preprocessing, and tuning

Real-world data is not a practice CSV — Tables in basic courses are often tidy. In production you see missing cells, text categories such as region or gender, and numeric features on different scales. Labels can be rare, as in fraud detection. Models still consume matrices

\mathbf{X}

and labels

\mathbf{y}

, so the first job is to turn messy tables into feature vectors.

Preprocessing prepares data for the model — Scaling aligns units, encoding turns text into numbers, and imputation fills gaps. Resampling can rebalance skewed classes. What basic Ch.00 called "choosing good features" becomes a repeatable set of steps in real projects.

Tuning and pipelines stabilize experiments — Values that change during training (weights, tree splits) differ from values you set in advance (tree depth, SVM

C

, etc.). The latter are hyperparameters. A pipeline chains preprocessing and training so new data is handled in the same order every time.