Intermediate ML: Real-World Data Limits and Model Optimization
Building on basic ML—data, features, training, and evaluation—this chapter introduces working with messy real-world tables and refining models in practice.
Intermediate ML diagram by chapter
Select a chapter to see its diagram below. View the intermediate ML flow at a glance.
What you learn in Ch01–Ch20
Intermediate ML connects real-world preprocessing with model and hyperparameter tuning. You handle scaling, encoding, missing data, and imbalance, then SVM, PCA, boosting, and clustering, and finally pipelines with grid, random, and Bayesian (Optuna) search.
- Ch.01Data Scaling and Distribution Transformation
- Ch.02Categorical Encoding
- Ch.03Missing Data and Imputation
- Ch.04Imbalanced Data Basics
- Ch.05Advanced Cross Validation
- Ch.06Multiclass Evaluation and ROC-AUC
- Ch.07SVM Basics: Decision Boundary and Margin
- Ch.08Kernel Trick: Nonlinear SVM
- Ch.09Dimensionality Reduction 1: PCA
- Ch.10Ensemble: Bagging and Pasting
- Ch.11Boosting Basics: AdaBoost
- Ch.12Gradient Boosting Machine (GBM)
- Ch.13Density-Based Clustering (DBSCAN)
- Ch.14Hierarchical Clustering and Dendrogram
- Ch.15Gaussian Mixture Model (GMM)
- Ch.16Anomaly Detection Basics
- Ch.17Pipeline: Modeling Automation
- Ch.18Hyperparameter Tuning 1: Grid and Random Search
- Ch.19Hyperparameter Tuning 2: Bayesian Optimization (Optuna)
- Ch.20Intermediate ML Summary
Real-world data, preprocessing, and tuning
Real-world data is not a practice CSV — Tables in basic courses are often tidy. In production you see missing cells, text categories such as region or gender, and numeric features on different scales. Labels can be rare, as in fraud detection. Models still consume matrices and labels , so the first job is to turn messy tables into feature vectors.
Preprocessing prepares data for the model — Scaling aligns units, encoding turns text into numbers, and imputation fills gaps. Resampling can rebalance skewed classes. What basic Ch.00 called "choosing good features" becomes a repeatable set of steps in real projects.
Tuning and pipelines stabilize experiments — Values that change during training (weights, tree splits) differ from values you set in advance (tree depth, SVM , etc.). The latter are hyperparameters. A pipeline chains preprocessing and training so new data is handled in the same order every time.
Why it matters
Data quality and scale shape performance — Biased data or one dominant feature scale can make look strong in validation yet fail in production. Distance-based models such as KNN and SVM change their notion of "close" when scales drift. Normalization from basic KNN becomes a daily habit here.
Leakage inflates scores — If test information enters training or preprocessing, validation looks great while live performance drops. Fitting a scaler on all data before cross-validation is the same trap. Split first, fit statistics on training only, then transform validation and test with those statistics.
Imbalance and metrics go together — Accuracy alone can stay high when the model always predicts the majority class. For rare events you also need precision, recall, and ROC-AUC. Hyperparameter tuning is also about balancing overfitting and underfitting for better generalization.
How it is used
Order matters in practice — Explore the data, split into train, validation, and test, fit preprocessors on training only, train the model, tune hyperparameters against validation, and report on the held-out test set at the end. That sequence keeps evaluation closer to real generalization.
How this course is organized — Early chapters cover scaling, encoding, missing data, imbalance, cross-validation, and multiclass metrics. The middle adds SVM, PCA, ensembles, clustering, and anomaly detection. Later chapters cover pipelines and grid, random, and Bayesian search. Preview each title in the roadmap below.
It extends basic ML — If you already studied data and features, missing values, and cross-validation, intermediate ML applies the same ideas to one realistic table. The goal is not a formula list but a calm understanding of why cleaning matters, where metrics mislead, and how to run sound experiments.