Chapter 08

Partial Derivative & Gradient

When there are several variables, partial derivative is the derivative w.r.t. one variable with others fixed. The gradient is the vector of those partial derivatives. It's the basis of gradient descent.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

The slope when only x moves and the slope when only y moves are the partial derivatives. The gradient is the combined direction of those two.

Horizontal arrow = slope when only x changes; vertical = slope when only y changes. The diagonal is the gradient (combined) — the direction of steepest increase.

Horizontal arrow: slope when only $x$ moves (with $y$ fixed) → partial derivative $\frac{\partial f}{\partial x}$
Vertical arrow: slope when only $y$ moves (with $x$ fixed) → partial derivative $\frac{\partial f}{\partial y}$
Diagonal arrow: combined direction of the two partials → gradient $\nabla f$ (direction of steepest increase)

What are partial derivatives and the gradient?

For a function of several variables, the partial derivative is the derivative w.r.t. one variable with the others held constant. The gradient is the vector of all partial derivatives. One key formula:

\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})

In deep learning, the loss is a function of many weights. Training means computing how the loss changes when we change each weight a little, then updating weights in the direction that reduces the loss.

The gradient is exactly the vector of those partial derivatives. With thousands or millions of weights, we need partial derivatives (one variable at a time), and backprop (Ch07 chain rule) computes this gradient efficiently in one pass. With Ch06 and Ch07, extending to multiple variables here gives you gradient descent and SGD.

Partial derivatives and the gradient are the language of multi-variable optimization. The gradient components are the derivatives of the loss w.r.t. each weight; new parameter = previous − learning rate × gradient is how we update each step. This leads naturally into Ch09 (integral).

We use partial derivatives when we want the rate of change of a function when only one input changes. Gradient descent moves a little opposite to the gradient to decrease the function (e.g. loss). In economics (demand depending on price and income) or physics (pressure, temperature, volume), partial derivatives tell us the effect of changing one factor at a time.

Situation	What we use
When minimizing loss	"If we nudge this weight, does loss go up or down?" is the partial derivative w.r.t. that weight. The vector of those values is the gradient.
One step of gradient descent	New parameter = previous − (learning rate × gradient). You move one step in the direction that decreases loss (opposite to the gradient).
When training with small chunks of data	Instead of using the whole dataset at once, you take a small batch (minibatch), compute the gradient, then update parameters once. Repeating this speeds up training. (This is often called SGD.)
When the outcome depends on both x and y	"If we nudge only x, how much does it change?" is the partial derivative w.r.t. $x$ . For $y$ only, the partial w.r.t. $y$ works the same way.

In AI training, PyTorch and TensorFlow compute the gradient automatically via backprop. We only need to know that the gradient is the vector of partial derivatives and that gradient descent moves opposite to it. Image classification (object recognition), language models (e.g. ChatGPT), recommendation (Netflix, YouTube), translation, and speech recognition all build on this. Even with millions of weights, we get the gradient and update one step at a time in the opposite direction.

For a partial derivative, treat only the variable you differentiate as the variable; treat the rest as constants. The gradient is the vector of partial derivatives in order. Tip:

\frac{\partial f}{\partial x}

means differentiate in

x

with

y

fixed.