Chapter 11

Softmax (Turn into Probabilities)

A function that turns a vector into a probability distribution (values in [0,1], sum 1).

Deep learning diagram by chapter

As you complete each chapter, the diagram below fills in. This is the structure so far.

SoftmaxScore → probability(example: e ≈ 3)

Score

→

Mid

3 to the power

3³=27

3¹=3

3⁰=1

→

Probability

27/31

3/31

1/31

Divide by sum

27÷31=27/31

3÷31=3/31

1÷31=1/31

3raised to 27(3^3)

27/31=27 ÷ 31

Softmax in deep learning

Softmax is a function that converts multiple scores (numbers) into probabilities. All values become between 0 and 1, and they sum to exactly 1. So you can read them as probabilities.

The formula is $\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$ . Because it uses powers of e (≈2.718), the largest score gets amplified significantly while others shrink relatively. The gap between 1st and 2nd place becomes more pronounced.

Example: scores [3, 1, 0] → e³≈27, e¹≈2.7, e⁰=1 → sum ≈ 23.7 → probabilities → [0.84, 0.11, 0.04]. The score of 3 was only 3× larger than 1, but the probability is about 8× larger!

Softmax is used at the final layer of almost every classification model. 'This photo is 70% dog, 25% cat, 5% bird' lets you see per-class probabilities and how confident the model is.

When combined with cross-entropy loss during training, the gradients work out cleanly and stably. The model naturally learns to 'increase the correct class probability and decrease the rest.'

Softmax's property of 'all positive values that sum to 1' exactly matches the definition of a probability distribution. This makes it the most natural way to convert scores to probabilities, both statistically and theoretically.

Image classification: The model's final layer outputs scores (logits) like [5.2, 2.1, 0.8, ...]. Softmax converts them to [0.70, 0.25, 0.05, ...]—probabilities for each class. The highest probability class is the final answer.

Chatbots & translators: When ChatGPT picks the next word, it scores every word in its vocabulary (tens of thousands!), converts to probabilities via softmax, and samples a word based on those probabilities. High-probability words appear often, but occasionally low-probability words are picked for variety.

Attention mechanism: In translators, relevance scores for 'which input words to focus on' are passed through softmax to become probabilities (weights). These weights create a weighted average of inputs that emphasizes the most relevant parts.

Computation order: ① Compute $Z = W \cdot X + b$ (logits) ② Compute $e^Z$ (problem uses $e \approx 3$ ) ③ Compute $\Sigma$ (sum) = add all $e^Z$ values ④ $Y = \frac{e^Z}{\Sigma}$ (divide each by the sum). Follow this order.

Finding blanks: If Y is blank, compute 'that $e^Z \div \Sigma$ .' If $e^Z$ is blank, compute ' $Y \times \Sigma$ .' If Z is blank, reverse from $e^Z$ . If $\Sigma$ is blank, just add all $e^Z$ values.

Verification: After computing, check that all Y values are between 0 and 1 and sum to 1. If not, there's a calculation error. Also confirm whether the problem uses $e \approx 3$ or $e \approx 2.718$ .

Softmax turns numbers into values between 0 and 1 that sum to 1. Compute $Z = W \cdot X + b$ , then $e^Z$ , then divide each by the sum ( $\Sigma$ ) to get probabilities.

Score ( $Z$ ) → $3^Z$ → divide by sum → probability ( $Y$ )

X

Often used in the final layer for multi-class classification.

Z_1 = 1 \cdot 1 + 1 \cdot 1 + 1 = 3

Problem

Compute $Z = W \cdot X + b$ , $e^Z \;\; (e \approx 3)$ , $Y = \frac{e^Z}{\Sigma}$ , then fill in the blank (?).

In this problem we use e = 3 for easy calculation. So $e^Z = 3^Z$ . (e.g. Z=1 → 3, Z=2 → 9)

Score ( $Z$ ) → $3^Z$ → divide by sum ( $\Sigma$ ) → probability ( $Y$ )

-1

→

e^Z

→

\Sigma

→

1/2

Prob.

0.5

1 / 20