Everyone's AI
Machine learningAI papers
Loading...

Learn

🏅My achievements

Ch.08

Directional Derivative and Gradient: Steepest Ascent in Multidimensional Space

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of intermediate math at a glance.

Directional Derivative and Gradient: Finding the Way Up

On a contour map, the gradient ∇f\nabla f∇f is at right angles to the level curves and points straight up the steepest hill. The directional derivative DufD_{\mathbf{u}} fDu​f asks “how much do I climb if I step in direction u\mathbf{u}u?”—one dot product: ∇f⋅u\nabla f \cdot \mathbf{u}∇f⋅u.

How to read the figure: red/yellow = big error (peaks); blue/purple = small error (valleys). The black line is the path downhill to reduce error (gradient descent). The purple arrow is the steepest uphill direction (∇f\nabla f∇f) at that spot.
Imagine an explorer in thick fog who cannot see a step ahead and must find the deepest valley of a rugged mountain range—the place where error is smallest. With no view of the terrain, they can only rely on the slope felt underfoot to descend. An AI learning from vast data is exactly this precarious climb. Each mistaken prediction builds a loss that forms a huge three-dimensional mountain range.
At that spot, the measure that answers “if I go east from here, how steep is it?” is the directional derivative. The miraculous compass that, among all directions around you, points without error to the steepest uphill is the gradient ∇f\nabla f∇f. This chapter explains in rich, terrain-map metaphors how the gradient—often called the flower of calculus—guides AI safely down complex error mountains, without hiding behind a wall of symbols.

Directional Derivative and Gradient: Finding the Way Up

1. Multivariable functions and contours: reading 3D terrain in 2D
A flat 2D map becomes informative when winding contour lines show peaks and sunken valleys. Lines packed tight mean a cliff you would sweat to climb; lines spread wide mean gentle, easy ground. The loss AI computes while learning forms the same kind of vast, rugged multidimensional range. Through math we read those invisible contours and see intuitively whether error is rising sharply or settling lower.
2. Partial derivatives: measuring slope along only one axis
Stop on a rugged hillside. Ask: “If I ignore every other direction and take one step due east (xxx-axis only), what is the slope?” Or: “If I walk due north (yyy-axis only), uphill or downhill?” Measuring slope along one axis only is a partial derivative, written ∂f∂x\frac{\partial f}{\partial x}∂x∂f​ with the round ∂\partial∂ symbol. It is narrow but foundational for every calculation.
3. Directional derivative: slope along the path you are actually facing
An explorer need not walk only the four cardinal directions. You may choose northeast at 30°, or a slanting southwest path—any heading in 360°. The instantaneous rate at which height changes when you take a tiny step in that chosen direction is the directional derivative DufD_{\mathbf{u}} fDu​f: the felt slope of the trail you are looking at right now.
4. Gradient: the miraculous compass to the steepest uphill
Among all directions around you, exactly one points up the most brutally steep climb toward the summit. Combine the xxx- and yyy-slopes into one arrow (vector): the gradient ∇f\nabla f∇f. This arrow always shoots perpendicular to contours—the shortest way across level curves. Its direction is steepest uphill nearby; its length is how steep that climb is (maximum slope).
5. The dot product’s “magic” link to the gradient
You need not redo heavy calculations for every direction. Dot the gradient (your best compass) with your direction vector and the slope along that direction appears: Duf=∇f⋅uD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}Du​f=∇f⋅u. Walk exactly where the gradient points and you face the steepest uphill on earth at that spot.
In one line: read the loss range like a contour map; any-direction slope Duf=∇f⋅uD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}Du​f=∇f⋅u; steepest uphill ∇f\nabla f∇f; one descent step wk+1=wk−η∇L\mathbf{w}_{k+1}=\mathbf{w}_k-\eta\nabla Lwk+1​=wk​−η∇L.
Training (optimization) to make an AI smarter is a hard journey to the calm floor of the deepest valley—where error is smallest—in a vast mountain range. With hundreds of thousands of weights, the terrain becomes millions of dimensions beyond imagination. Blind walking in such fog can wander forever without finding the valley.
Then the gradient ∇L\nabla L∇L acts like a miraculous navigator: it points precisely to where error would explode uphill fastest. The AI only turns opposite that finger and takes quiet steps downhill. Without this mathematical compass, deep learning cannot train; countless weights would wander with no idea which way to change.
1. The beating heart of deep learning: gradient descent
All these ideas converge in one great algorithm: gradient descent. The core update wk+1=wk−η∇L(wk)\mathbf{w}_{k+1} = \mathbf{w}_k - \eta \nabla L(\mathbf{w}_k)wk+1​=wk​−η∇L(wk​) shows how AI moves. Here ∇L\nabla L∇L is the “steepest uphill” direction; the minus sign means “I will descend carefully opposite that uphill.”
Learning rate η\etaη is the explorer’s stride. Too large a stride leaps over the target valley onto the opposite peak; too small an ant-like stride never reaches the bottom before training ends. In practice, tuning stride for the environment often decides success.
2. Visualizing loss surfaces to check model health
Papers and data-science work often show colorful 3D mountain plots or contour heatmaps that darken and lighten. Researchers compress billions of unknown weights into 2–3 visible dimensions and sketch the error surface to see whether training is smooth. They watch whether the descent path zigzags nervously on the heatmap or glides down like a sled, diagnosing model health and learning structure by eye.
Hold three formulas:
① Duf=∇f⋅uD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}Du​f=∇f⋅u (unit u\mathbf{u}u).
② ∇f\nabla f∇f ⊥ contours, steepest uphill.
③ wk+1=wk−η∇L\mathbf{w}_{k+1}=\mathbf{w}_k-\eta\nabla Lwk+1​=wk​−η∇L (check minus and η\etaη). Workflow: ∇f\nabla f∇f or ∇L\nabla L∇L → normalize u\mathbf{u}u → dot product.
  • In wordsDirectional derivative
  • MeaningSlope along direction u\mathbf{u}u
  • In wordsGradient
  • MeaningSteepest uphill ∇f\nabla f∇f
  • In wordsKey formula
  • MeaningDuf=∇f⋅uD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}Du​f=∇f⋅u
  • In wordsContours
  • Meaning∇f\nabla f∇f ⊥ contours; along a contour, slope 0
  • In wordsGradient descent
  • Meaningwk+1=wk−η∇L\mathbf{w}_{k+1}=\mathbf{w}_k-\eta\nabla Lwk+1​=wk​−η∇L
  • In wordsFlat spot
  • Meaning∇f≈0\nabla f\approx\mathbf{0}∇f≈0 → critical-point candidate
In wordsMeaning
Directional derivativeSlope along direction u\mathbf{u}u
GradientSteepest uphill ∇f\nabla f∇f
Key formulaDuf=∇f⋅uD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}Du​f=∇f⋅u
Contours∇f\nabla f∇f ⊥ contours; along a contour, slope 0
Gradient descentwk+1=wk−η∇L\mathbf{w}_{k+1}=\mathbf{w}_k-\eta\nabla Lwk+1​=wk​−η∇L
Flat spot∇f≈0\nabla f\approx\mathbf{0}∇f≈0 → critical-point candidate
① ∇f\nabla f∇f first.
② Unit u\mathbf{u}u.
③ Duf>0D_{\mathbf{u}} f>0Du​f>0 = uphill.
④ η\etaη too big jumps; too small crawls.

Worked examples

Example 1 — Definition
Question: For f(x,y)=x2+y2f(x,y)=x^2+y^2f(x,y)=x2+y2 at (1,1)(1,1)(1,1), what are the gradient and its length?
Solution: ∇f=(2,2)\nabla f=(2,2)∇f=(2,2), ∥∇f∥=22\|\nabla f\|=2\sqrt{2}∥∇f∥=22​.

Example 2 — Directional derivative
Question: At the same point, with unit u=(1/2,0)\mathbf{u}=(1/\sqrt{2},0)u=(1/2​,0), what is DufD_{\mathbf{u}} fDu​f?
Solution: Duf=(2,2)⋅(1/2,0)=2D_{\mathbf{u}} f=(2,2)\cdot(1/\sqrt{2},0)=\sqrt{2}Du​f=(2,2)⋅(1/2​,0)=2​ (dot-product formula).

Example 3 — Maximal ascent
Question: Among unit directions u\mathbf{u}u, which direction maximizes DufD_{\mathbf{u}} fDu​f, and what is the maximum?
Solution: Direction same as ∇f\nabla f∇f; maximum value ∥∇f∥\|\nabla f\|∥∇f∥.

Example 4 — Contours
Question: Walking along a contour, what is DufD_{\mathbf{u}} fDu​f? How does ∇f\nabla f∇f relate to the contour?
Solution: Duf=0D_{\mathbf{u}} f=0Du​f=0 along a tangent; ∇f\nabla f∇f is perpendicular to the contour.

Example 5 — Gradient descent
Question: L=w12+w22L=w_1^2+w_2^2L=w12​+w22​, w=(2,1)\mathbf{w}=(2,1)w=(2,1), η=0.25\eta=0.25η=0.25 — what is w\mathbf{w}w after one step?
Solution: ∇L=(4,2)\nabla L=(4,2)∇L=(4,2), so wnew=(2,1)−0.25(4,2)=(1,0.5)\mathbf{w}_{\text{new}}=(2,1)-0.25(4,2)=(1,0.5)wnew​=(2,1)−0.25(4,2)=(1,0.5).

Example 6 — Stuck training
Question: If loss stops decreasing, what might the gradient suggest?
Solution: You may be near ∇L≈0\nabla L\approx\mathbf{0}∇L≈0 on flat ground (a critical-point candidate).

Practice

At ∇f(x)=0\nabla f(\mathbf{x})=0∇f(x)=0, which holds?
1 / 10