Ch.02

Optimization: Momentum and Adaptive Learning Rate

Training an AI model is like wearing a blindfold while hiking a huge mountain range toward the deepest valley (the minimum error) . Optimization is the navigation that picks which direction and how large a step to take from where you stand. After Ch.01 set the starting point, this chapter teaches skills to descend safely and quickly: walking step by step with SGD, sledding with Momentum, and self-driving with Adam that adapts its stride to the terrain. We unpack the core optimizers you will use every day—intuitively and clearly.

Blindfolded on the same loss mountain, SGD, Momentum, and Adam pick different routes — simplified valley comparison below.

SGDMomentumAdam

Red (SGD) zig-zags more while descending and its sideways wiggle lasts longer. Green (momentum) damps the oscillation but still ends slightly off the valley center. Blue (Adam) reaches the bottom center fastest—so descent speed and final x differ clearly (illustrative).

Flow: forward → loss → backward → optimizer step

\theta \leftarrow \theta - \eta \cdot(\text{step from Adam, etc.})

v

Optimization Algorithms: Tuning Speed and Direction Wisely

1. Gradient descent & SGD: walk against the uphill gradient

Concept: The reliable way downhill is to feel the slope under your feet and take steps along the steepest descent — that is the heart of gradient descent.

Intuition: Picture descending Hallasan in thick fog. If your stride (learning rate) is too wide, you may fall off a cliff or bounce onto the opposite ridge; if it is too narrow, sunset may arrive before you reach the valley.

Core equation:

\theta \leftarrow \theta - \eta \nabla L(\theta)

- $\theta$ : where you stand (model weights)

- $\eta$ : step size — the learning rate (often 0.01, 0.001, …)

- $\nabla L$ : the slope (gradient) at the current point

Practical tip: Scanning the full map every time is slow, so we usually follow stochastic gradient descent (SGD) — pick a minibatch, estimate

\hat{g}

, and step quickly.

2. Momentum: a bowling ball on ice

Concept: Plain SGD only looks at the local slope, so in a bumpy narrow valley it zig-zags and wastes time. Momentum adds inertia from past moves.

Intuition: A paper cup turns at every pebble; a heavy bowling ball keeps rolling through small bumps. Momentum gives the optimizer that kind of “mass.”

Core updates:

v \leftarrow \beta v + (1-\beta)g

\theta \leftarrow \theta - \eta v

- $v$ : velocity (accumulated direction)

- $\beta$ : how much past motion to keep (often 0.9 — keep ~90% of the old velocity)

- $g$ : gradient at the current point

Extra: Nesterov momentum evaluates

g

at a lookahead point along

v

3. Adaptive optimizers (AdaGrad, RMSProp, Adam): brake each wheel separately

Concept: Some parameters are almost there; others still have far to go. Instead of one global

\eta

, adaptive methods rescale each coordinate from gradient statistics.

How they evolved:

- AdaGrad: “Paths we walked a lot — shrink the step there.” It accumulates squared gradients so busy coordinates slow down.

- RMSProp: Fixes AdaGrad’s issue (steps can shrink to ~0 forever) by forgetting very old history with an EMA.

- Adam: Combines momentum (direction) and RMSProp-like scaling — a default choice in modern deep learning.

Practical tip: Papers often use AdamW, which decouples weight decay from the loss for better regularization.

4. Three goals: stability, speed, generalization

Concept: Choosing an optimizer is not only about reaching the bottom fast. Which valley you land in changes test performance.

Intuition: A bullet train (Adam) may arrive first; a local train (SGD+momentum) can discover quieter minima with better generalization — both stories appear in practice.

Practical tip: Pair optimizers with warmup (gentle strides early) and learning-rate schedulers (smaller steps near the end).

Formulas in plain words

SGD step

\theta \leftarrow \theta - \eta \hat{g}

—

\hat{g}

is a minibatch estimate;

\eta

is step size.

Momentum

v \leftarrow \beta v + (1-\beta)g

\theta \leftarrow \theta - \eta v

— past directions accumulate in

v

to smooth zig-zags.

Adam (idea) — EMA of gradients and squared gradients per coordinate; bias correction in early steps.

Adaptive intuition — large historical gradients → smaller effective steps per coordinate.

Why it matters

Time and money

If the learning rate is too large, optimization may diverge; if too small, a one-hour run can stretch to a week. Good optimizer + LR settings are the “magic” that saves GPU bills and late nights.

Generalization — your “test score”

With the same data, different optimizers can yield different quality. Which minimum you settle in changes test accuracy. Strong engineers match the tool to the problem.

First thermometer when the model “gets sick”

If loss won’t drop or NaNs appear, suspect learning rate and optimizer first. Knowing this lets you debug calmly instead of panicking.

How it is used

① Keep a lab notebook — change one knob at a time

APIs differ by library, but the workflow is similar: record learning rate, batch size, optimizer, and random seed. When training misbehaves, change one setting at a time to isolate the cause. Jittery loss → revisit batch, LR, and momentum; updates that fade away after many epochs → consider moving from AdaGrad-style accumulators to RMSProp / Adam. Practice pairing symptoms with levers.

② Optimizer cheat sheet

Situation $Need a quick baseline$
Pick $`Adam` or `AdamW`$
Why $Adaptive steps — less sensitive to initial LR$

Situation $NLP / transformers$
Pick $`AdamW`$
Why $Often very stable on sparse, structured objectives$

Situation $Push CNN accuracy to the limit$
Pick $`SGD + Momentum`$
Why $Harder to tune but can generalize better at the sweet spot$

Situation	Pick	Why
Need a quick baseline	`Adam` or `AdamW`	Adaptive steps — less sensitive to initial LR
NLP / transformers	`AdamW`	Often very stable on sparse, structured objectives
Push CNN accuracy to the limit	`SGD + Momentum`	Harder to tune but can generalize better at the sweet spot

③ Monitoring — don’t look away

Launch isn’t the end of the flight. Watch the loss curve live (TensorBoard, Weights & Biases). If it saws like a gear, it may be time to lower the learning rate.

Summary

Optimization converts gradient information into update steps to reduce loss

L(\theta)

SGD updates with minibatch gradient

\hat{g}

, Momentum smooths zig-zag via velocity

v

, and Adam/AdamW adapts per-coordinate step size using first/second moments.

Practical debugging summary (symptom → first checks)

- Loss oscillation: `lr`, momentum, batch size

- Early divergence/NaN: initialization, `lr`, `grad_norm`, clipping

- Slow/plateaued learning: scheduler (with warmup), optimizer switch (SGD↔AdamW)

- Validation stagnation: weight decay, augmentation, early stopping

Tuning order (quick decision)

1) Validate logs → 2) tune `lr` first → 3) choose optimizer → 4) combine with scheduler → 5) add stabilizers → 6) pick by mean performance + variance + reproducibility

Operating rule: change one variable at a time, and record `optimizer/lr/batch_size/weight_decay/seed/scheduler` for comparison.

How to approach problems

Optimization is the process of deciding how to update parameters

\theta

using gradients from backpropagation to reduce the loss

L(\theta)

. Basic SGD takes a step

\theta \leftarrow \theta - \eta \hat{g}

with a minibatch gradient

\hat{g}

, and the learning rate $\eta$ sets the step size. Momentum accumulates velocity

v

to reduce zig-zag in narrow valleys, while Adam/AdamW adapt per-coordinate steps using first and second moments. When loss oscillates or diverges, check learning rate, batch size, and LR scheduler together—not only the optimizer name.

Example (definition)

"What is the core role of Momentum?

① It sets LR to 0

② It accumulates past directions to reduce oscillation

③ It skips backprop"

Momentum keeps directional inertia through velocity

v

. → Answer 2

Example (scenario)

"When training loss oscillates heavily, what should be checked first?

① Learning rate, momentum, batch size

② Zero training data

③ Delete all layers"

Oscillation is tied to step size and gradient noise, so check

① first. → Answer 1

Example (calculation)

\eta=0.001

and

g=20

, what is the SGD update magnitude

\eta g

0.001 \times 20 = 0.02

. → Answer 0.02

Definition example — "Which does Adam use together?

① 1st and 2nd moments

② Batch index only

③ Dropout mask only" → Adam uses first and second moments. Answer 1

True/False example — "RMSProp uses an EMA of squared gradients." → True. Answer 1

Application example — "If early training is unstable, what to check first?

① warmup + LR schedule

② disable backprop

③ delete data" → Check warmup and schedule first. Answer 1

Choice example — "A defining trait of Nesterov is?

① gradient at lookahead point

② current point only

③ no gradient" → Nesterov uses lookahead. Answer 1

Concept example — "In AdaGrad, the effective step size of frequently updated coordinates tends to?

① decrease

② stay constant

③ increase" → It tends to decrease due to accumulation. Answer 1

Calculation example — "If sample count is 64 and batch size is 16, how many steps per epoch?" →

64/16=4

. Answer 4