Chapter 12

Gradient (Backpropagation)

The direction and rate of change of the loss with respect to parameters.

Deep learning diagram by chapter

As you complete each chapter, the diagram below fills in. This is the structure so far.

Y → H → X

Gradient in deep learning

The gradient tells you 'if you change a weight (parameter) slightly, how much and in which direction does the loss (error) change.' Think of it as a compass pointing toward 'which way to go to reduce error.'

Analogy: Imagine walking down a mountain blindfolded. You feel the slope (gradient) under your feet and step toward the downhill direction. Walking opposite to the gradient leads you to the valley (minimum loss). This is gradient descent.

Backpropagation passes gradients from the output back toward the input, one layer at a time. Using the chain rule from calculus, it efficiently computes the gradient for every weight in every layer in one pass.

AI training = looking at gradients and updating weights. Without gradients, there's no way to know 'which direction to adjust,' making learning impossible. The gradient is the heart of deep learning training.

Learning rate controls 'how far to step each time.' Too large → overshoot the valley (diverge); too small → takes forever to arrive. Optimizers like Adam automatically adjust the step size based on gradient magnitude.

If gradients get too large (gradient explosion), training becomes unstable; if they get too small (gradient vanishing), early layers barely learn. Techniques like gradient clipping, batch normalization, and skip connections are used to prevent this.

Every trained AI model: ChatGPT, image recognition, recommendation systems—every model computes gradients to update weights. Forward pass → compute loss → backward pass for gradients → update weights. Repeating these 4 steps millions of times is training.

Forward and backward: Forward computes Z = W·X going forward; backward propagates gradients dW, dX going backward. They always work as a pair.

Fine-tuning: When adapting ChatGPT for a specific use case, new data is used to compute gradients and slightly adjust weights. Thanks to gradients, a pre-trained model can quickly adapt to new purposes.

Problem format: The equation is either forward Z = W·X or backward dZ = dW·X. The blank (?) is one entry of X or one entry of Z (or dZ). W and dW are always fully given.

Forward (Z = W·X): Each entry of Z = dot product of one row of W with X. If the blank is in Z, multiply that row of W by X and sum. If the blank is in X, use the other Z entries and rows of W to set up an equation and solve for that X entry.

Backward (dZ = dW·X): Same computation as forward. Each entry of dZ = dot product of one row of dW with X. If the blank is in dZ, dot that row of dW with X. If the blank is in X, solve from the equation.

The gradient is a vector that shows the direction and rate of change of a function. To reduce loss, we update parameters in the opposite direction. Forward: $Z = W \cdot X$ ; backward: $dZ = dW \cdot X$ .

Forward $Z = W \cdot X$ →Backward $dZ = dW \cdot X$

Forward

-1

→

-4

Backward

←

-1

Forward: Z = W·X (each row of W dotted with X)

Z1 = (W row 1)·X = 1×(-1)+2×5+(-1)×3 = 6

Z2 = (W row 2)·X = 2×(-1)+(-1)×5+1×3 = -4

Z3 = (W row 3)·X = 0×(-1)+1×5+(-1)×3 = 2

Backward: dZ = dW·X (same structure, gradient values)

dZ1 = (dW row 1)·X = 2×(-1)+(-1)×5+3×3 = 2

dZ2 = (dW row 2)·X = 1×(-1)+2×5+(-1)×3 = 6

The same idea applies to linear layers, hidden layers, and so on.

In the problems, the blank (?) is one entry of X or one entry of Z (forward) / dZ (backward).

Problem

Fill in the blank (?) in $Z = W \cdot X$ or $dZ = dW \cdot X$ .

dZ = dW · X

-2

-1

1 / 20