Chapter 07
Chain Rule
When you differentiate a function inside another, multiply outer derivative × inner derivative. That's the core of backprop.
Math diagram by chapter
Select a chapter to see its diagram below. View the flow of basic math at a glance.
A nested function is a chain → inner → outer → . Multiply outer derivative × inner derivative to get the total derivative.
Example: calculation order (one step highlighted at a time)
1.Example: as in the graphs above, and , so . Differentiate with respect to .
2.① Inner derivative (left graph): → derivative w.r.t. is
3.② Outer derivative (right graph): → derivative w.r.t. is
4.③ Multiply: → answer
As the dot moves along the chain, rates multiply along the way. Backprop is the same: multiply at each step.
What is the chain rule?
In one sentence A composite function is one function nested in another: you put into the first, then put that result into the second to get the final output. The chain rule is how we differentiate such nested functions.
In deep learning, many layers are stacked. Input goes through layer 1, then 2, … and finally the loss. We need to differentiate the loss with respect to each weight.
Backpropagation sends derivatives backward from the loss to the input, step by step. At each step we multiply the incoming derivative by that step’s local derivative. That multiplication is the chain rule.
So the chain rule is the backbone of backprop. If you know derivatives (Ch06), here you just learn to apply them to nested functions.
In general whenever one thing depends on another in a chain, the total rate of change is found by multiplying the rates along the chain. The table below shows examples from various areas.
| Situation | What we find | Chain rule (total rate) |
|---|---|---|
| Cost depends on output, output on time | How fast cost changes w.r.t. time | (cost/output) (output/time) |
| Balloon radius changes with time | How fast volume changes w.r.t. time | (volume/radius) (radius/time) |
| Velocity depends on position, position on time | Link to acceleration | (velocity/position) (position/time) |
In AI training, the loss goes through many layers, so we use the chain rule at each layer when differentiating with respect to each weight. After this chapter you can move on to Ch08 partial derivatives and gradient.
For a nested function, treat the inner part as one block, then multiply the derivative of the outer (with respect to that block) by the derivative of the inner. If the inner is itself nested, repeat. Tip: Set inner = something, differentiate the outer only, then multiply by the derivative of the inner w.r.t. .