Chapter 08

Partial Derivatives and Gradient: A World of Many Variables, the Direction of Gradient Descent

When there are several variables, partial derivative is the derivative w.r.t. one variable with others fixed. The gradient is the vector of those partial derivatives. It's the basis of gradient descent.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

The slope when only x moves and the slope when only y moves are the partial derivatives. The gradient is the combined direction of those two.

Horizontal arrow = slope when only x changes; vertical = slope when only y changes. The diagonal is the gradient (combined) — the direction of steepest increase.

Horizontal arrow: slope when only $x$ moves (with $y$ fixed) → partial derivative $\frac{\partial f}{\partial x}$
Vertical arrow: slope when only $y$ moves (with $x$ fixed) → partial derivative $\frac{\partial f}{\partial y}$
Diagonal arrow: combined direction of the two partials → gradient $\nabla f$ (direction of steepest increase)

What are partial derivatives and the gradient?

For a function of several variables, the partial derivative is the derivative w.r.t. one variable with the others held constant. The gradient is the vector of all partial derivatives. One key formula:

\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})

Intuition: Climbing a hill (height

z

, coordinates

x,y

): the slope when you step east (

x

) and when you step north (

y

) can differ. Partial derivatives are those slopes. The gradient

\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})

is the vector that points in the steepest uphill direction.

Example: for

f(x,y)=x^{2}+y^{2}

, differentiating in

x

only (with

y

constant) gives

2x

; in

y

only gives

2y

. So

\nabla f = (2x, 2y)

The gradient points in the direction of steepest increase. Moving opposite to it decreases the function fastest. So gradient descent updates parameters in the opposite direction of the gradient.

Example with numbers: For

f(x,y)=x+2y

, when

x

increases by 1,

f

increases by 1 (partial w.r.t.

x

is 1); when

y

increases by 1,

f

increases by 2 (partial w.r.t.

y

is 2). So

\nabla f = (1, 2)

. On the hill, the slope in the

y

direction is twice that in the

x

direction; the steepest climb is in the

(1,2)

direction.

Because it is the learning principle of deep learning models. AI models have tens of millions or hundreds of millions of parameters (weights

w_1, w_2, ...

). We need to know "which weight to adjust how so that error goes down," but thinking about hundreds of millions of variables at once is too hard. Partial derivatives let us break this down: "hold all other weights fixed and change only

w_1

—what is the effect?"

The gradient is the instruction manual for all weights. Once we have the vector

\nabla L

that encodes "increase

w_1

a bit, decrease

w_2

a lot," etc., the AI can in one computation update hundreds of millions of weights in the right direction at once.

Partial derivatives and the gradient are the basic language of multi-variable optimization. Finding the gradient of the loss and moving one step at a time in the opposite direction is gradient descent; that is how AI finds a path toward the answer even in complex data.

They are the engine of gradient descent. Like walking downhill with your eyes closed, feeling the slope under your feet and stepping in the direction that goes down most. The gradient points in the direction of fastest increase, so to reduce error we must go in the opposite direction (minus). We update by new parameter = previous parameter − (learning rate $\times$ gradient). The minus sign is there because we want to move toward lower error.

Situation $Reducing error$
What we use $L$

Situation $Finding the best direction$
What we use $Form the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."$

Situation $Efficient large-scale training$
What we use $SGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.$

Situation $Multi-variable effect$
What we use $In economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"$

Situation	What we use
Reducing error	Take the partial derivative of the loss $L$ w.r.t. each weight $w$ ( $\frac{\partial L}{\partial w}$ ) to see whether that weight is a main cause of error or not.
Finding the best direction	Form the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."
Efficient large-scale training	SGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.
Multi-variable effect	In economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"

AI auto-training: When we call `loss.backward()` in PyTorch or TensorFlow, the system computes partial derivatives for all weights and gives us the gradient vector. Only with this gradient can the optimizer update the weights. From large language models like ChatGPT to image recognition, all modern AI gets smarter by following this gradient.

For a partial derivative, treat only the variable you differentiate as the variable; treat the rest as constants. The gradient is the vector of partial derivatives in order. Tip:

\frac{\partial f}{\partial x}

means differentiate in

x

with

y

fixed.

Simplest example:

f=3x+2y

. When differentiating in

x

only, treat

y

as a constant →

\partial f/\partial x = 3

. When differentiating in

y

only, treat

x

as a constant →

\partial f/\partial y = 2

. So

\nabla f = (3, 2)

. At

(1,1)

the gradient is still

(3, 2)

From easy to varied examples in the table below. For one variable at a time, the same Ch06 derivative rules apply.

Problem $f=3x+2y$
Solution $y$

Problem $f=3x+2y$
Solution $x$

Problem $f=x^{2}y$
Solution $y$

Problem $f=x^{2}+y^{2}$
Solution $(2x, 2y)$

Problem	Solution
$f=3x+2y$ , $\partial f/\partial x$	$y$ constant → 3
$f=3x+2y$ , $\partial f/\partial y$	$x$ constant → 2
$f=x^{2}y$ , $\partial f/\partial x$	$y$ constant → $2xy$
$f=x^{2}+y^{2}$ , $\nabla f$	$(2x, 2y)$

Problem types and how to solve

Type $Partial in x$
Description $\frac{\partial f}{\partial x}$
How to get the answer $y$

Type $Partial in y$
Description $\frac{\partial f}{\partial y}$
How to get the answer $x$

Type $Gradient$
Description $\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$
How to get the answer $(a,b)$

Type	Description	How to get the answer
Partial in x	$\frac{\partial f}{\partial x}$	Treat $y$ as constant, differentiate in $x$ . Linear → coefficient of $x$ ; $x^2 y$ → $2xy$ .
Partial in y	$\frac{\partial f}{\partial y}$	Treat $x$ as constant, differentiate in $y$ .
Gradient	$\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$	Vector of the two partials in order. At $(a,b)$ substitute $x=a$ , $y=b$ .

Example (partial in x)

For

f=3x+2y

, find

\frac{\partial f}{\partial x}

and its value at

(1,1)

Solution

With

y

constant,

\partial f/\partial x=3

. At

(1,1)

still 3. → Answer 3

Example (partial in y)

For

f=3x+2y

, find

\frac{\partial f}{\partial y}

and its value at

(1,1)

Solution

With

x

constant,

\partial f/\partial y=2

. At

(1,1)

still 2. → Answer 2

Example (gradient)

For

f=x^2+y^2

, find

\nabla f

and the gradient at

(1,2)

Solution

\partial f/\partial x=2x

\partial f/\partial y=2y

. So

\nabla f=(2x,2y)

. At

(1,2)

→

(2,4)

. → Answer (2, 4) or components 2, 4

What are partial derivatives and the gradient?

\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})

Intuition: Climbing a hill (height

z

, coordinates

x,y

): the slope when you step east (

x

) and when you step north (

y

) can differ. Partial derivatives are those slopes. The gradient

\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})

is the vector that points in the steepest uphill direction.

Example: for

f(x,y)=x^{2}+y^{2}

, differentiating in

x

only (with

y

constant) gives

2x

; in

y

only gives

2y

. So

\nabla f = (2x, 2y)

Example with numbers: For

f(x,y)=x+2y

, when

x

increases by 1,

f

increases by 1 (partial w.r.t.

x

is 1); when

y

increases by 1,

f

increases by 2 (partial w.r.t.

y

is 2). So

\nabla f = (1, 2)

. On the hill, the slope in the

y

direction is twice that in the

x

direction; the steepest climb is in the

(1,2)

direction.

Because it is the learning principle of deep learning models. AI models have tens of millions or hundreds of millions of parameters (weights

w_1, w_2, ...

w_1

—what is the effect?"

The gradient is the instruction manual for all weights. Once we have the vector

\nabla L

that encodes "increase

w_1

a bit, decrease

w_2

a lot," etc., the AI can in one computation update hundreds of millions of weights in the right direction at once.

Situation $Reducing error$
What we use $L$

Situation $Finding the best direction$
What we use $Form the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."$

Situation $Efficient large-scale training$
What we use $SGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.$

Situation $Multi-variable effect$
What we use $In economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"$

Situation	What we use
Reducing error	Take the partial derivative of the loss $L$ w.r.t. each weight $w$ ( $\frac{\partial L}{\partial w}$ ) to see whether that weight is a main cause of error or not.
Finding the best direction	Form the gradient (all partial derivatives) and move in the opposite direction to find the bottom of the "error valley."
Efficient large-scale training	SGD (stochastic gradient descent) uses a minibatch instead of the full data to get an approximate gradient and move quickly.
Multi-variable effect	In economics, when demand depends on both price and income, we use partial derivatives to ask "if we hold income fixed and raise price only, what happens?"

For a partial derivative, treat only the variable you differentiate as the variable; treat the rest as constants. The gradient is the vector of partial derivatives in order. Tip:

\frac{\partial f}{\partial x}

means differentiate in

x

with

y

fixed.

Simplest example:

f=3x+2y

. When differentiating in

x

only, treat

y

as a constant →

\partial f/\partial x = 3

. When differentiating in

y

only, treat

x

as a constant →

\partial f/\partial y = 2

. So

\nabla f = (3, 2)

. At

(1,1)

the gradient is still

(3, 2)

From easy to varied examples in the table below. For one variable at a time, the same Ch06 derivative rules apply.

Problem $f=3x+2y$
Solution $y$

Problem $f=3x+2y$
Solution $x$

Problem $f=x^{2}y$
Solution $y$

Problem $f=x^{2}+y^{2}$
Solution $(2x, 2y)$

Problem	Solution
$f=3x+2y$ , $\partial f/\partial x$	$y$ constant → 3
$f=3x+2y$ , $\partial f/\partial y$	$x$ constant → 2
$f=x^{2}y$ , $\partial f/\partial x$	$y$ constant → $2xy$
$f=x^{2}+y^{2}$ , $\nabla f$	$(2x, 2y)$

Problem types and how to solve

Type $Partial in x$
Description $\frac{\partial f}{\partial x}$
How to get the answer $y$

Type $Partial in y$
Description $\frac{\partial f}{\partial y}$
How to get the answer $x$

Type $Gradient$
Description $\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$
How to get the answer $(a,b)$

Type	Description	How to get the answer
Partial in x	$\frac{\partial f}{\partial x}$	Treat $y$ as constant, differentiate in $x$ . Linear → coefficient of $x$ ; $x^2 y$ → $2xy$ .
Partial in y	$\frac{\partial f}{\partial y}$	Treat $x$ as constant, differentiate in $y$ .
Gradient	$\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$	Vector of the two partials in order. At $(a,b)$ substitute $x=a$ , $y=b$ .

Example (partial in x)

For

f=3x+2y

, find

\frac{\partial f}{\partial x}

and its value at

(1,1)

Solution

With

y

constant,

\partial f/\partial x=3

. At

(1,1)

still 3. → Answer 3

Example (partial in y)

For

f=3x+2y

, find

\frac{\partial f}{\partial y}

and its value at

(1,1)

Solution

With

x

constant,

\partial f/\partial y=2

. At

(1,1)

still 2. → Answer 2

Example (gradient)

For

f=x^2+y^2

, find

\nabla f

and the gradient at

(1,2)

Solution

\partial f/\partial x=2x

\partial f/\partial y=2y

. So

\nabla f=(2x,2y)

. At

(1,2)

→

(2,4)

. → Answer (2, 4) or components 2, 4