What is machine learning?

Machine learning learns patterns from data to make predictions. Start with https://mdooai.com/en/learn/ml/mlSupervisedUnsupervisedSelf.

What is the difference between ML and DL?

Deep learning is a subset of machine learning focused on neural networks. Build foundations at https://mdooai.com/en/learn/ml/mlDataFeature first.

How do I start hyperparameter tuning?

Use cross-validation while narrowing search ranges. Start at https://mdooai.com/en/learn/ml/mlCrossValidation.

Ch.08

XGBoost, LightGBM, CatBoost: Tabular ML Powerhouses

When you work with spreadsheet-like tabular data, a family of models often beats even heavy deep learning: gradient boosting . Boosting lines up many "average students" (weak learners) in order; each one studies the mistakes the previous models still make, until the team acts like a single strong predictor . This chapter dissects XGBoost, LightGBM, and CatBoost —the trio behind countless production systems and Kaggle solutions—and gives you clear rules for which tool fits your dataset .

ML diagram by chapter

Select a chapter to see its diagram below. View the machine learning flow at a glance.

These libraries grow trees differently: level-wise, leaf-wise, and symmetric (oblivious). Watch each panel animate in turn.

CH07 — The boosting trio: mastering residuals one tree at a time

F_t(x)=F_{t-1}(x)+\eta h_t(x)

Why It Matters

The go-to baseline for tabular work For many database / CSV problems, gradient boosting is fast, accurate, and simpler to iterate than a full deep-learning stack. Teams routinely reach for it before designing exotic neural nets. Pick the weapon to match the data - Need stability and mature tooling on medium-sized data? XGBoost - Need training speed and memory efficiency at very large scale? LightGBM - Drowning in categorical columns and want sane defaults? CatBoost Hyperparameters are the steering wheel `learning_rate`, tree depth / leaves, `n_estimators`, early stopping—these jointly control the bias-variance trade-off and compute cost. Understanding how they interact lets you tune without guessing .

How It Is Used

\rightarrow

CH07 — The boosting trio: mastering residuals one tree at a time

1. Core idea: sequential error notebooks

Concept: Boosting chains decision trees in sequence. Each new tree focuses on the residuals (errors) left by the ensemble so far.

Intuition: Picture a study group before an exam. Student 1 takes a practice test and writes an error notebook of mistakes. Student 2 drills only those questions. Student 3 fixes what student 2 still misses. Repeat many rounds and the group’s combined score skyrockets.

Key update:

F_t(x)=F_{t-1}(x)+\eta h_t(x)

F_t(x)

: prediction after stage

t

F_{t-1}(x)

: prediction before adding the latest tree

h_t(x)

: new tree trained to reduce the remaining error

\eta

: learning rate—how aggressively you trust the new tree (smaller

\eta

often means you need more trees but can be more stable)

Practice: Loan default, churn, CTR, and many other row-and-column tasks still treat boosting as a top-tier baseline.

2. XGBoost: stable, regularized workhorse

Concept: The library that popularized modern gradient boosting. It optimizes loss while penalizing overly complex trees through built-in regularization terms, which tends to make training predictable and robust.

Intuition: A strict teacher who cares about progress and about stopping you from "memorizing" the textbook—penalties kick in when the model gets too wiggly (overfitting).

3. LightGBM: speed for huge datasets

Concept: Built for scale when millions of rows made classic boosting slow. It uses histogram-based binning to cut computation and usually grows trees leaf-wise—splitting the leaf that most reduces loss—instead of expanding an entire level at a time (level-wise).

Intuition: Like skipping chapters you already know and camping on the one chapter most likely to be on the exam: maximum efficiency, but you can over-drill one corner of the space.

Caveat: Leaf-wise trees overfit more easily on small data. Tune `max_depth`, `min_data_in_leaf`, and related knobs.

4. CatBoost: categoricals without the headache

Concept: From Yandex—the name merges Category and Boost. It is strong on high-cardinality categorical features (city, job title, product ID) with less manual encoding drama.

Intuition: Think of taking an exam: you should solve each question without peeking at later answers. In tabular ML, mixing future information into training causes target leakage and inflated scores. CatBoost’s design (including ordered / permutation-based ideas) is built to reduce that "peeking ahead" risk. That is why it often works well even with strong default settings.

5. Reading the formulas easily (symbols + mini examples)

These are the 3 equations you see most in boosting/XGBoost. For each one, read in this order: (a) what each symbol means, then (b) a tiny numeric example.

(1) Additive prediction update

F_t(x)=F_{t-1}(x)+\eta h_t(x)

F_{t-1}(x)

: prediction before adding the new tree

h_t(x)

: correction suggested by the newly added tree

\eta

: learning rate (how strongly we trust that correction)

Interpretation: We keep the old prediction and add a scaled correction, so error shrinks stage by stage.

Mini example: if old prediction is 10, new tree output is +4, and

\eta=0.25

F_t=10+0.25\times4=11

(2) Objective = data fit + complexity penalty

\mathcal{L}=\sum_i l(y_i,\hat y_i)+\Omega(f)

\sum_i l(y_i,\hat y_i)

: total prediction error over samples

\Omega(f)

: regularization term that penalizes an overly complex model

Interpretation: We optimize accuracy, but we also penalize complexity to reduce overfitting.

Mini example: if fit loss is 18 and regularization is 3,

\mathcal{L}=18+3=21

(3) Derivatives used for split gain

g_i=\partial_{\hat y}l

h_i=\partial^2_{\hat y}l

g_i

: first derivative (gradient) - direction/strength to reduce error

h_i

: second derivative (curvature) - how sharply loss changes, used for stabilization

Interpretation: XGBoost-style methods use both

g_i

and

h_i

to compute split gain more stably than using only first-order information.

Intuition:

- large

|g_i|

often means that sample is still poorly predicted,

h_i

acts like a damping signal that helps avoid overly aggressive updates.

🔵 Shared parameters: volume knob and magnifier

① `learning_rate`: think of this as a volume knob. It controls how strongly a new tree's correction is applied. Lower values are steadier but usually need more rounds (`n_estimators`). Higher values can learn faster but may become unstable or overfit.

② `n_estimators` / `iterations`: how many correction rounds (trees) to stack.

③ `max_depth` / `depth`: the magnification level of the tree. Deeper trees can capture fine patterns, but also memorize noise more easily. A practical start is `learning_rate=0.03~0.1` and depth around `4~8`.

🟣 XGBoost: balancing accuracy and conservativeness

① `subsample`: use only part of rows per tree to reduce overfitting.

② `colsample_bytree`: use only part of features per tree to avoid over-reliance on a few columns.

③ `min_child_weight`: blocks weak splits with too little evidence.

④ `reg_lambda` / `reg_alpha`: strong regularization brakes to keep model complexity under control.

🟢 LightGBM: leaf-wise growth, so leaf control is everything

① `num_leaves`: max number of leaves. Larger values increase modeling power but also overfitting risk (often set below

2^{\text{max\_depth}}

② `min_data_in_leaf`: minimum samples per leaf; prevents tiny, unstable leaves.

③ `feature_fraction` / `bagging_fraction`: sampling controls analogous to XGBoost's `colsample_bytree` and `subsample`; lowering from 1.0 often helps when overfitting appears.

🟠 CatBoost: category-aware model

① `cat_features`: most important. Explicitly mark which columns are categorical so CatBoost can apply its strengths.

② `depth` and `iterations`: same idea as other boosters, but CatBoost's symmetric trees can be more sensitive to depth.

③ `l2_leaf_reg`: smooths overly extreme predictions with regularization.