Chapter 08
Partial Derivatives and Gradient: A World of Many Variables, the Direction of Gradient Descent
When there are several variables, partial derivative is the derivative w.r.t. one variable with others fixed. The gradient is the vector of those partial derivatives. It's the basis of gradient descent.
Math diagram by chapter
Select a chapter to see its diagram below. View the flow of basic math at a glance.
The slope when only x moves and the slope when only y moves are the partial derivatives. The gradient is the combined direction of those two.
Horizontal arrow = slope when only x changes; vertical = slope when only y changes. The diagonal is the gradient (combined) — the direction of steepest increase.
- Horizontal arrow: slope when only moves (with fixed) → partial derivative
- Vertical arrow: slope when only moves (with fixed) → partial derivative
- Diagonal arrow: combined direction of the two partials → gradient (direction of steepest increase)
What are partial derivatives and the gradient?
For a function of several variables, the partial derivative is the derivative w.r.t. one variable with the others held constant. The gradient is the vector of all partial derivatives. One key formula: .
Intuition: Climbing a hill (height , coordinates ): the slope when you step east () and when you step north () can differ. Partial derivatives are those slopes. The gradient is the vector that points in the steepest uphill direction.
Example: for , differentiating in only (with constant) gives ; in only gives . So .
The gradient points in the direction of steepest increase. Moving opposite to it decreases the function fastest. So gradient descent updates parameters in the opposite direction of the gradient.
Example with numbers: For , when increases by 1, increases by 1 (partial w.r.t. is 1); when increases by 1, increases by 2 (partial w.r.t. is 2). So . On the hill, the slope in the direction is twice that in the direction; the steepest climb is in the direction.
Because it is the learning principle of deep learning models. AI models have tens of millions or hundreds of millions of parameters (weights ). We need to know "which weight to adjust how so that error goes down," but thinking about hundreds of millions of variables at once is too hard. Partial derivatives let us break this down: "hold all other weights fixed and change only —what is the effect?"
The gradient is the instruction manual for all weights. Once we have the vector that encodes "increase a bit, decrease a lot," etc., the AI can in one computation update hundreds of millions of weights in the right direction at once.
Partial derivatives and the gradient are the basic language of multi-variable optimization. Finding the gradient of the loss and moving one step at a time in the opposite direction is gradient descent; that is how AI finds a path toward the answer even in complex data.
They are the engine of gradient descent. Like walking downhill with your eyes closed, feeling the slope under your feet and stepping in the direction that goes down most. The gradient points in the direction of fastest increase, so to reduce error we must go in the opposite direction (minus). We update by new parameter = previous parameter − (learning rate gradient). The minus sign is there because we want to move toward lower error.
- SituationReducing error
- What we useTake the partial derivative of the loss w.r.t. each weight () to see whether that weight is a main cause of error or not.
- SituationFinding the best direction
- What we useForm the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."
- SituationEfficient large-scale training
- What we useSGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.
- SituationMulti-variable effect
- What we useIn economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"
| Situation | What we use |
|---|---|
| Reducing error | Take the partial derivative of the loss w.r.t. each weight () to see whether that weight is a main cause of error or not. |
| Finding the best direction | Form the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley." |
| Efficient large-scale training | SGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly. |
| Multi-variable effect | In economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?" |
AI auto-training: When we call `loss.backward()` in PyTorch or TensorFlow, the system computes partial derivatives for all weights and gives us the gradient vector. Only with this gradient can the optimizer update the weights. From large language models like ChatGPT to image recognition, all modern AI gets smarter by following this gradient.
For a partial derivative, treat only the variable you differentiate as the variable; treat the rest as constants. The gradient is the vector of partial derivatives in order. Tip: means differentiate in with fixed.
Simplest example: . When differentiating in only, treat as a constant → . When differentiating in only, treat as a constant → . So . At the gradient is still .
From easy to varied examples in the table below. For one variable at a time, the same Ch06 derivative rules apply.
- Problem,
- Solution constant → 3
- Problem,
- Solution constant → 2
- Problem,
- Solution constant →
- Problem,
- Solution
| Problem | Solution |
|---|---|
| , | constant → 3 |
| , | constant → 2 |
| , | constant → |
| , |
Problem types and how to solve
- TypePartial in x
- Description
- How to get the answerTreat as constant, differentiate in . Linear → coefficient of ; → .
- TypePartial in y
- Description
- How to get the answerTreat as constant, differentiate in .
- TypeGradient
- Description
- How to get the answerVector of the two partials in order. At substitute , .
| Type | Description | How to get the answer |
|---|---|---|
| Partial in x | Treat as constant, differentiate in . Linear → coefficient of ; → . | |
| Partial in y | Treat as constant, differentiate in . | |
| Gradient | Vector of the two partials in order. At substitute , . |
Example (partial in x)
For , find and its value at .
Solution
With constant, . At still 3. → Answer 3
Example (partial in y)
For , find and its value at .
Solution
With constant, . At still 2. → Answer 2
Example (gradient)
For , find and the gradient at .
Solution
, . So . At → . → Answer (2, 4) or components 2, 4