Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.02

Optimization: Momentum and Adaptive Learning Rate

Training an AI model is like wearing a blindfold while hiking a huge mountain range toward the deepest valley (the minimum error). Optimization is the navigation that picks which direction and how large a step to take from where you stand.
After Ch.01 set the starting point, this chapter teaches skills to descend safely and quickly: walking step by step with SGD, sledding with Momentum, and self-driving with Adam that adapts its stride to the terrain. We unpack the core optimizers you will use every day—intuitively and clearly.
Blindfolded on the same loss mountain, SGD, Momentum, and Adam pick different routes — simplified valley comparison below.
loss
SGDMomentumAdam
Red (SGD) zig-zags more while descending and its sideways wiggle lasts longer. Green (momentum) damps the oscillation but still ends slightly off the valley center. Blue (Adam) reaches the bottom center fastest—so descent speed and final x differ clearly (illustrative).

Flow: forward → loss → backward → optimizer step

Update: θ←θ−η⋅(step from Adam, etc.)\theta \leftarrow \theta - \eta \cdot(\text{step from Adam, etc.})θ←θ−η⋅(step from Adam, etc.)

  • ① SGD: step opposite the gradient each time (noisy minibatches → zig-zag).
  • ② Momentum: accumulate velocity vvv — smoother turns.
  • ③ Adam: adaptive per-coordinate step sizes.
  • ④ Practice: tune with logs, schedules, initialization (Ch.01).

Optimization Algorithms: Tuning Speed and Direction Wisely

1. Gradient descent & SGD: walk against the uphill gradient
Concept: The reliable way downhill is to feel the slope under your feet and take steps along the steepest descent — that is the heart of gradient descent.
Intuition: Picture descending Hallasan in thick fog. If your stride (learning rate) is too wide, you may fall off a cliff or bounce onto the opposite ridge; if it is too narrow, sunset may arrive before you reach the valley.
Core equation:
θ←θ−η∇L(θ)\theta \leftarrow \theta - \eta \nabla L(\theta)θ←θ−η∇L(θ)
- θ\thetaθ: where you stand (model weights)
- η\etaη: step size — the learning rate (often 0.01, 0.001, …)
- ∇L\nabla L∇L: the slope (gradient) at the current point
Practical tip: Scanning the full map every time is slow, so we usually follow stochastic gradient descent (SGD) — pick a minibatch, estimate g^\hat{g}g^​, and step quickly.
2. Momentum: a bowling ball on ice
Concept: Plain SGD only looks at the local slope, so in a bumpy narrow valley it zig-zags and wastes time. Momentum adds inertia from past moves.
Intuition: A paper cup turns at every pebble; a heavy bowling ball keeps rolling through small bumps. Momentum gives the optimizer that kind of “mass.”
Core updates:
v←βv+(1−β)gv \leftarrow \beta v + (1-\beta)gv←βv+(1−β)g
θ←θ−ηv\theta \leftarrow \theta - \eta vθ←θ−ηv
- vvv: velocity (accumulated direction)
- β\betaβ: how much past motion to keep (often 0.9 — keep ~90% of the old velocity)
- ggg: gradient at the current point
Extra: Nesterov momentum evaluates ggg at a lookahead point along vvv.
3. Adaptive optimizers (AdaGrad, RMSProp, Adam): brake each wheel separately
Concept: Some parameters are almost there; others still have far to go. Instead of one global η\etaη, adaptive methods rescale each coordinate from gradient statistics.
How they evolved:
- AdaGrad: “Paths we walked a lot — shrink the step there.” It accumulates squared gradients so busy coordinates slow down.
- RMSProp: Fixes AdaGrad’s issue (steps can shrink to ~0 forever) by forgetting very old history with an EMA.
- Adam: Combines momentum (direction) and RMSProp-like scaling — a default choice in modern deep learning.
Practical tip: Papers often use AdamW, which decouples weight decay from the loss for better regularization.
4. Three goals: stability, speed, generalization
Concept: Choosing an optimizer is not only about reaching the bottom fast. Which valley you land in changes test performance.
Intuition: A bullet train (Adam) may arrive first; a local train (SGD+momentum) can discover quieter minima with better generalization — both stories appear in practice.
Practical tip: Pair optimizers with warmup (gentle strides early) and learning-rate schedulers (smaller steps near the end).

Formulas in plain words

SGD step θ←θ−ηg^\theta \leftarrow \theta - \eta \hat{g}θ←θ−ηg^​ — g^\hat{g}g^​ is a minibatch estimate; η\etaη is step size.
Momentum v←βv+(1−β)gv \leftarrow \beta v + (1-\beta)gv←βv+(1−β)g, θ←θ−ηv\theta \leftarrow \theta - \eta vθ←θ−ηv — past directions accumulate in vvv to smooth zig-zags.
Adam (idea) — EMA of gradients and squared gradients per coordinate; bias correction in early steps.
Adaptive intuition — large historical gradients → smaller effective steps per coordinate.

Why it matters

Time and money
If the learning rate is too large, optimization may diverge; if too small, a one-hour run can stretch to a week. Good optimizer + LR settings are the “magic” that saves GPU bills and late nights.
Generalization — your “test score”
With the same data, different optimizers can yield different quality. Which minimum you settle in changes test accuracy. Strong engineers match the tool to the problem.
First thermometer when the model “gets sick”
If loss won’t drop or NaNs appear, suspect learning rate and optimizer first. Knowing this lets you debug calmly instead of panicking.

How it is used

① Keep a lab notebook — change one knob at a time
APIs differ by library, but the workflow is similar: record learning rate, batch size, optimizer, and random seed. When training misbehaves, change one setting at a time to isolate the cause. Jittery loss → revisit batch, LR, and momentum; updates that fade away after many epochs → consider moving from AdaGrad-style accumulators to RMSProp / Adam. Practice pairing symptoms with levers.
② Optimizer cheat sheet
  • SituationNeed a quick baseline
  • Pick`Adam` or `AdamW`
  • WhyAdaptive steps — less sensitive to initial LR
  • SituationNLP / transformers
  • Pick`AdamW`
  • WhyOften very stable on sparse, structured objectives
  • SituationPush CNN accuracy to the limit
  • Pick`SGD + Momentum`
  • WhyHarder to tune but can generalize better at the sweet spot
SituationPickWhy
Need a quick baseline`Adam` or `AdamW`Adaptive steps — less sensitive to initial LR
NLP / transformers`AdamW`Often very stable on sparse, structured objectives
Push CNN accuracy to the limit`SGD + Momentum`Harder to tune but can generalize better at the sweet spot
③ Monitoring — don’t look away
Launch isn’t the end of the flight. Watch the loss curve live (TensorBoard, Weights & Biases). If it saws like a gear, it may be time to lower the learning rate.

Summary

Optimization converts gradient information into update steps to reduce loss L(θ)L(\theta)L(θ).
SGD updates with minibatch gradient g^\hat{g}g^​, Momentum smooths zig-zag via velocity vvv, and Adam/AdamW adapts per-coordinate step size using first/second moments.
Practical debugging summary (symptom → first checks)
- Loss oscillation: `lr`, momentum, batch size
- Early divergence/NaN: initialization, `lr`, `grad_norm`, clipping
- Slow/plateaued learning: scheduler (with warmup), optimizer switch (SGD↔AdamW)
- Validation stagnation: weight decay, augmentation, early stopping
Tuning order (quick decision)
1) Validate logs → 2) tune `lr` first → 3) choose optimizer → 4) combine with scheduler → 5) add stabilizers → 6) pick by mean performance + variance + reproducibility
Operating rule: change one variable at a time, and record `optimizer/lr/batch_size/weight_decay/seed/scheduler` for comparison.

How to approach problems

Optimization is the process of deciding how to update parameters θ\thetaθ using gradients from backpropagation to reduce the loss L(θ)L(\theta)L(θ). Basic SGD takes a step θ←θ−ηg^\theta \leftarrow \theta - \eta \hat{g}θ←θ−ηg^​ with a minibatch gradient g^\hat{g}g^​, and the learning rate η\etaη sets the step size. Momentum accumulates velocity vvv to reduce zig-zag in narrow valleys, while Adam/AdamW adapt per-coordinate steps using first and second moments. When loss oscillates or diverges, check learning rate, batch size, and LR scheduler together—not only the optimizer name.
Example (definition)
"What is the core role of Momentum?
① It sets LR to 0
② It accumulates past directions to reduce oscillation
③ It skips backprop"
Momentum keeps directional inertia through velocity vvv. → Answer 2

Example (scenario)
"When training loss oscillates heavily, what should be checked first?
① Learning rate, momentum, batch size
② Zero training data
③ Delete all layers"
Oscillation is tied to step size and gradient noise, so check
① first. → Answer 1

Example (calculation)
If η=0.001\eta=0.001η=0.001 and g=20g=20g=20, what is the SGD update magnitude ηg\eta gηg?
0.001×20=0.020.001 \times 20 = 0.020.001×20=0.02. → Answer 0.02
Definition example — "Which does Adam use together?
① 1st and 2nd moments
② Batch index only
③ Dropout mask only" → Adam uses first and second moments. Answer 1

True/False example — "RMSProp uses an EMA of squared gradients." → True. Answer 1

Application example — "If early training is unstable, what to check first?
① warmup + LR schedule
② disable backprop
③ delete data" → Check warmup and schedule first. Answer 1

Choice example — "A defining trait of Nesterov is?
① gradient at lookahead point
② current point only
③ no gradient" → Nesterov uses lookahead. Answer 1

Concept example — "In AdaGrad, the effective step size of frequently updated coordinates tends to?
① decrease
② stay constant
③ increase" → It tends to decrease due to accumulation. Answer 1

Calculation example — "If sample count is 64 and batch size is 16, how many steps per epoch?" → 64/16=464/16=464/16=4. Answer 4