Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Chapter 08

Partial Derivatives and Gradient: A World of Many Variables, the Direction of Gradient Descent

When there are several variables, partial derivative is the derivative w.r.t. one variable with others fixed. The gradient is the vector of those partial derivatives. It's the basis of gradient descent.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

The slope when only x moves and the slope when only y moves are the partial derivatives. The gradient is the combined direction of those two.

xy(1, 1)x onlyy only∇f (combined)

Horizontal arrow = slope when only x changes; vertical = slope when only y changes. The diagonal is the gradient (combined) — the direction of steepest increase.

  • Horizontal arrow: slope when only xxx moves (with yyy fixed) → partial derivative ∂f∂x\frac{\partial f}{\partial x}∂x∂f​
  • Vertical arrow: slope when only yyy moves (with xxx fixed) → partial derivative ∂f∂y\frac{\partial f}{\partial y}∂y∂f​
  • Diagonal arrow: combined direction of the two partials → gradient ∇f\nabla f∇f (direction of steepest increase)

What are partial derivatives and the gradient?

For a function of several variables, the partial derivative is the derivative w.r.t. one variable with the others held constant. The gradient is the vector of all partial derivatives. One key formula: ∇f=(∂f∂x,∂f∂y)\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})∇f=(∂x∂f​,∂y∂f​).
Intuition: Climbing a hill (height zzz, coordinates x,yx,yx,y): the slope when you step east (xxx) and when you step north (yyy) can differ. Partial derivatives are those slopes. The gradient ∇f=(∂f∂x,∂f∂y)\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})∇f=(∂x∂f​,∂y∂f​) is the vector that points in the steepest uphill direction.
Example: for f(x,y)=x2+y2f(x,y)=x^{2}+y^{2}f(x,y)=x2+y2, differentiating in xxx only (with yyy constant) gives 2x2x2x; in yyy only gives 2y2y2y. So ∇f=(2x,2y)\nabla f = (2x, 2y)∇f=(2x,2y).
The gradient points in the direction of steepest increase. Moving opposite to it decreases the function fastest. So gradient descent updates parameters in the opposite direction of the gradient.
Example with numbers: For f(x,y)=x+2yf(x,y)=x+2yf(x,y)=x+2y, when xxx increases by 1, fff increases by 1 (partial w.r.t. xxx is 1); when yyy increases by 1, fff increases by 2 (partial w.r.t. yyy is 2). So ∇f=(1,2)\nabla f = (1, 2)∇f=(1,2). On the hill, the slope in the yyy direction is twice that in the xxx direction; the steepest climb is in the (1,2)(1,2)(1,2) direction.
Because it is the learning principle of deep learning models. AI models have tens of millions or hundreds of millions of parameters (weights w1,w2,...w_1, w_2, ...w1​,w2​,...). We need to know "which weight to adjust how so that error goes down," but thinking about hundreds of millions of variables at once is too hard. Partial derivatives let us break this down: "hold all other weights fixed and change only w1w_1w1​—what is the effect?"
The gradient is the instruction manual for all weights. Once we have the vector ∇L\nabla L∇L that encodes "increase w1w_1w1​ a bit, decrease w2w_2w2​ a lot," etc., the AI can in one computation update hundreds of millions of weights in the right direction at once.
Partial derivatives and the gradient are the basic language of multi-variable optimization. Finding the gradient of the loss and moving one step at a time in the opposite direction is gradient descent; that is how AI finds a path toward the answer even in complex data.
They are the engine of gradient descent. Like walking downhill with your eyes closed, feeling the slope under your feet and stepping in the direction that goes down most. The gradient points in the direction of fastest increase, so to reduce error we must go in the opposite direction (minus). We update by new parameter = previous parameter − (learning rate ×\times× gradient). The minus sign is there because we want to move toward lower error.
  • SituationReducing error
  • What we useTake the partial derivative of the loss LLL w.r.t. each weight www (∂L∂w\frac{\partial L}{\partial w}∂w∂L​) to see whether that weight is a main cause of error or not.
  • SituationFinding the best direction
  • What we useForm the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."
  • SituationEfficient large-scale training
  • What we useSGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.
  • SituationMulti-variable effect
  • What we useIn economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"
SituationWhat we use
Reducing errorTake the partial derivative of the loss LLL w.r.t. each weight www (∂L∂w\frac{\partial L}{\partial w}∂w∂L​) to see whether that weight is a main cause of error or not.
Finding the best directionForm the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."
Efficient large-scale trainingSGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.
Multi-variable effectIn economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"
AI auto-training: When we call `loss.backward()` in PyTorch or TensorFlow, the system computes partial derivatives for all weights and gives us the gradient vector. Only with this gradient can the optimizer update the weights. From large language models like ChatGPT to image recognition, all modern AI gets smarter by following this gradient.
For a partial derivative, treat only the variable you differentiate as the variable; treat the rest as constants. The gradient is the vector of partial derivatives in order. Tip: ∂f∂x\frac{\partial f}{\partial x}∂x∂f​ means differentiate in xxx with yyy fixed.
Simplest example: f=3x+2yf=3x+2yf=3x+2y. When differentiating in xxx only, treat yyy as a constant → ∂f/∂x=3\partial f/\partial x = 3∂f/∂x=3. When differentiating in yyy only, treat xxx as a constant → ∂f/∂y=2\partial f/\partial y = 2∂f/∂y=2. So ∇f=(3,2)\nabla f = (3, 2)∇f=(3,2). At (1,1)(1,1)(1,1) the gradient is still (3,2)(3, 2)(3,2).
From easy to varied examples in the table below. For one variable at a time, the same Ch06 derivative rules apply.
  • Problemf=3x+2yf=3x+2yf=3x+2y, ∂f/∂x\partial f/\partial x∂f/∂x
  • Solutionyyy constant → 3
  • Problemf=3x+2yf=3x+2yf=3x+2y, ∂f/∂y\partial f/\partial y∂f/∂y
  • Solutionxxx constant → 2
  • Problemf=x2yf=x^{2}yf=x2y, ∂f/∂x\partial f/\partial x∂f/∂x
  • Solutionyyy constant → 2xy2xy2xy
  • Problemf=x2+y2f=x^{2}+y^{2}f=x2+y2, ∇f\nabla f∇f
  • Solution(2x,2y)(2x, 2y)(2x,2y)
ProblemSolution
f=3x+2yf=3x+2yf=3x+2y, ∂f/∂x\partial f/\partial x∂f/∂xyyy constant → 3
f=3x+2yf=3x+2yf=3x+2y, ∂f/∂y\partial f/\partial y∂f/∂yxxx constant → 2
f=x2yf=x^{2}yf=x2y, ∂f/∂x\partial f/\partial x∂f/∂xyyy constant → 2xy2xy2xy
f=x2+y2f=x^{2}+y^{2}f=x2+y2, ∇f\nabla f∇f(2x,2y)(2x, 2y)(2x,2y)
Problem types and how to solve
  • TypePartial in x
  • Description∂f∂x\frac{\partial f}{\partial x}∂x∂f​
  • How to get the answerTreat yyy as constant, differentiate in xxx. Linear → coefficient of xxx; x2yx^2 yx2y → 2xy2xy2xy.
  • TypePartial in y
  • Description∂f∂y\frac{\partial f}{\partial y}∂y∂f​
  • How to get the answerTreat xxx as constant, differentiate in yyy.
  • TypeGradient
  • Description∇f=(∂f∂x,∂f∂y)\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})∇f=(∂x∂f​,∂y∂f​)
  • How to get the answerVector of the two partials in order. At (a,b)(a,b)(a,b) substitute x=ax=ax=a, y=by=by=b.
TypeDescriptionHow to get the answer
Partial in x∂f∂x\frac{\partial f}{\partial x}∂x∂f​Treat yyy as constant, differentiate in xxx. Linear → coefficient of xxx; x2yx^2 yx2y → 2xy2xy2xy.
Partial in y∂f∂y\frac{\partial f}{\partial y}∂y∂f​Treat xxx as constant, differentiate in yyy.
Gradient∇f=(∂f∂x,∂f∂y)\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})∇f=(∂x∂f​,∂y∂f​)Vector of the two partials in order. At (a,b)(a,b)(a,b) substitute x=ax=ax=a, y=by=by=b.

Example (partial in x)
For f=3x+2yf=3x+2yf=3x+2y, find ∂f∂x\frac{\partial f}{\partial x}∂x∂f​ and its value at (1,1)(1,1)(1,1).
Solution
With yyy constant, ∂f/∂x=3\partial f/\partial x=3∂f/∂x=3. At (1,1)(1,1)(1,1) still 3. → Answer 3

Example (partial in y)
For f=3x+2yf=3x+2yf=3x+2y, find ∂f∂y\frac{\partial f}{\partial y}∂y∂f​ and its value at (1,1)(1,1)(1,1).
Solution
With xxx constant, ∂f/∂y=2\partial f/\partial y=2∂f/∂y=2. At (1,1)(1,1)(1,1) still 2. → Answer 2

Example (gradient)
For f=x2+y2f=x^2+y^2f=x2+y2, find ∇f\nabla f∇f and the gradient at (1,2)(1,2)(1,2).
Solution
∂f/∂x=2x\partial f/\partial x=2x∂f/∂x=2x, ∂f/∂y=2y\partial f/\partial y=2y∂f/∂y=2y. So ∇f=(2x,2y)\nabla f=(2x,2y)∇f=(2x,2y). At (1,2)(1,2)(1,2) → (2,4)(2,4)(2,4). → Answer (2, 4) or components 2, 4