Chapter 10

Random Variables and Probability Distributions: Capturing Uncertainty in Numbers

A random variable assigns numbers to outcomes of an experiment; a probability distribution summarizes how likely each value is. Used in deep learning for prediction and uncertainty.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

Poisson: skewed (event count) · Binomial: symmetric, peak at center (success count)

Normal

Poisson

Binomial

Figure 2: Discrete vs continuous

What are random variables and probability distributions?

A random variable (Random Variable) maps the outcome of a trial (experiment) to numbers. It is usually written

X

. For example, the moment we agree that heads =

1

and tails =

0

, the real-world act of flipping a coin becomes the mathematical variable

X

. A probability distribution is the rule that shows at a glance (like a map) with what probability each of those numbers appears.

① Discrete random variable — takes only finite or countable values. It can be shown in a table, as a function, or as a bar graph. The probability

P(X=k)

for each value

k

is the probability mass function (PMF); the essential condition is

\sum_k P(X=k)=1

Representative discrete distributions: The binomial distribution deals with the number of heads when flipping a coin multiple times. The Poisson distribution deals with event counts such as how many customers arrive in a given time period.

② Continuous random variable — takes infinitely many values in an interval. The probability of any single value (e.g. exactly 170.00 cm) is

0

, because the area under a curve at a single point is zero. We use a probability density function (PDF) for probabilities over intervals (e.g. 170–180 cm). It's expressed by a function and a curve, not a table.

Representative continuous distribution: The bell-shaped normal distribution is most representative, as many natural data (measurement error, score distributions, etc.) follow it.

A probability distribution is the rule for which values occur and how often. The figure shows normal (continuous), Poisson (discrete), and binomial (discrete) — knowing these covers most uses in AI.

The probability mass function (PMF) is the probability

P(X=k)

for each value

k

of a discrete random variable. In a bar chart, the height of each bar is that probability, and the sum of all bar heights is 1. The figure below shows three common distributions.

Connecting to the figures — Figure 1 (above): the normal (left) is continuous (curve); Poisson and binomial (center, right) are discrete (bars). Figure 2 compares discrete (bars) and continuous (curve) side by side. In AI: normal for noise and regression, Poisson for event counts, binomial for success counts and binary classification.

Distribution condition (discrete) — The PMF is the probability

P(X=k)

of each value

k

. Essential:

\sum_k P(X=k)=1

. (e.g. For a die,

P(1)+\cdots+P(6)=1

In plain words: For discrete distributions, all the probabilities of the possible outcomes must add up to 1. Just like a die—the chances of 1 through 6 add up to 1.

Distribution condition (continuous) — The PDF

f(x)

gives probability over intervals:

P(a\le X\le b)=\int_a^b f(x)\,dx

, and the total area is

\int_{-\infty}^{\infty} f(x)\,dx=1

In plain words: For continuous distributions, probability is the area under the curve. The probability that X falls in [a,b] is the area under the curve from a to b, and the total area under the whole curve is 1.

Expectation (mean) — Discrete:

E[X]=\sum_k x_k\, P(X=k)

; continuous is given by an integral. The “average weighted by probability.”

In plain words: Expectation is the average value when each outcome is weighted by its probability. For a die, it's (1×1/6)+(2×1/6)+…+(6×1/6)=3.5—the "probability-weighted" average.

Variance —

\mathrm{Var}(X)=E[(X-E[X])^2]

. Standard deviation is

\sigma=\sqrt{\mathrm{Var}(X)}

. Ch11 covers this in detail.

In plain words: Variance measures how spread out the values are from the mean. You take (each value minus the mean), square it, then average by probability; the square root of variance is the standard deviation.

Normal distribution (continuous) — Density

f(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-(x-\mu)^2/(2\sigma^2)}

\mu

is the mean,

\sigma

the standard deviation.

In plain words: A symmetric bell-shaped curve centered at the mean μ. The spread is controlled by σ (standard deviation)—larger σ means a wider, flatter curve. Often used for heights, measurement error, and noise.

Poisson distribution (discrete) —

P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}

(

k=0,1,2,\ldots

\lambda

is the average number of events in a fixed interval.

In plain words: Used when counting how many times an event happens in a fixed time or space. λ is the average count; the formula gives the probability of exactly k events. The bar chart is usually skewed to one side.

Binomial distribution (discrete) —

P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}

n

= number of trials,

p

= success probability per trial.

In plain words: You run the same trial n times and count how many successes (k). p is the chance of success on one trial. Like flipping a coin n times and counting heads—often gives a symmetric, peaked bar chart.

It is the basis of prediction and decision. AI does not just say “this is a cat.” It outputs a probability distribution — e.g. “probability of cat 0.98, dog 0.02” — as a random variable. From that distribution we see how confident the model is.

Managing uncertainty: Real data is noisy and uncertain. By modeling measurement error with the normal distribution or binary outcomes (e.g. spam or not) with the binomial, AI uses probability to reach the most reasonable conclusion.

Everyday statistics: rain probability (discrete), average lifespan or height distribution (continuous). Many quantities around us are described by random variables and distributions — discrete (bars) vs continuous (curves) — so we can read the world clearly.

Inside deep learning: Weights are often initialized with the normal distribution; the last layer uses softmax to turn outputs into a probability distribution (sum 1). Probability distributions are involved at every stage; understanding them shows how AI generates and classifies data.

For a discrete random variable: ① values & probabilities →

② probabilities sum to 1 →

③ expectation =

\sum (\text{value})\times(\text{probability})

Sum of probabilities — e.g.

P(X=1)+P(X=2)+P(X=3)=1

. With denominator 6,

a/6+b/6+c/6=1

means

a+b+c=6

; two known → find the third.

Expectation —

E[X]=x_1 p_1+x_2 p_2+x_3 p_3

. With denominator 6,

6E[X]

is an integer, so problems often ask for 6×expectation.

Variance —

\mathrm{Var}(X)=E[X^2]-(E[X])^2

. With denominator 6, $36\times$ variance can be computed as an integer:

6\sum n_i x_i^2-(\sum n_i x_i)^2

(

n_i

= numerator,

x_i

= value).

Simplest case: probabilities

1/6,\,2/6,\,c/6

sum to 1.

1+2+c=6

→ $c=3$ .

Below are worked examples by type. Follow problem → solution → answer.

Example (sum of probabilities)

Three probabilities 1/6, 2/6, c/6 sum to 1. Find c.

Solution

With denominator 6, numerators must sum to 6:

1+2+c=6

→

c=3

→ Answer 3

Example (6×expectation)

Values 1, 2, 3 with probabilities 1/6, 2/6, 3/6. Find

6E[X]

Solution

6E[X]=1\times 1+2\times 2+3\times 3=14

→ Answer 14

Example (36×variance)

Same distribution with numerators

n_1=1,n_2=2,n_3=3

and values

x_i=1,2,3

. Find

36\times\mathrm{Var}(X)

Solution

\sum n_i x_i=14

\sum n_i x_i^2=36

, so

36\times\mathrm{Var}(X)=6\cdot36-14^2=20

→ Answer 20

Example (mode)

Same distribution. Find the mode.

Solution

Largest probability is at

X=3

(

3/6

→ Answer 3

Example (cumulative numerator)

Write

P(X\le 2)

k/6

and find the numerator

k

Solution

P(X\le 2)=P(X=1)+P(X=2)=1/6+2/6=3/6

. Numerator 3.

→ Answer 3

What are random variables and probability distributions?

A random variable (Random Variable) maps the outcome of a trial (experiment) to numbers. It is usually written

X

. For example, the moment we agree that heads =

1

and tails =

0

, the real-world act of flipping a coin becomes the mathematical variable

X

. A probability distribution is the rule that shows at a glance (like a map) with what probability each of those numbers appears.

① Discrete random variable — takes only finite or countable values. It can be shown in a table, as a function, or as a bar graph. The probability

P(X=k)

for each value

k

is the probability mass function (PMF); the essential condition is

\sum_k P(X=k)=1

② Continuous random variable — takes infinitely many values in an interval. The probability of any single value (e.g. exactly 170.00 cm) is

0

Representative continuous distribution: The bell-shaped normal distribution is most representative, as many natural data (measurement error, score distributions, etc.) follow it.

The probability mass function (PMF) is the probability

P(X=k)

for each value

k

of a discrete random variable. In a bar chart, the height of each bar is that probability, and the sum of all bar heights is 1. The figure below shows three common distributions.

Distribution condition (discrete) — The PMF is the probability

P(X=k)

of each value

k

. Essential:

\sum_k P(X=k)=1

. (e.g. For a die,

P(1)+\cdots+P(6)=1

In plain words: For discrete distributions, all the probabilities of the possible outcomes must add up to 1. Just like a die—the chances of 1 through 6 add up to 1.

Distribution condition (continuous) — The PDF

f(x)

gives probability over intervals:

P(a\le X\le b)=\int_a^b f(x)\,dx

, and the total area is

\int_{-\infty}^{\infty} f(x)\,dx=1

Expectation (mean) — Discrete:

E[X]=\sum_k x_k\, P(X=k)

; continuous is given by an integral. The “average weighted by probability.”

In plain words: Expectation is the average value when each outcome is weighted by its probability. For a die, it's (1×1/6)+(2×1/6)+…+(6×1/6)=3.5—the "probability-weighted" average.

Variance —

\mathrm{Var}(X)=E[(X-E[X])^2]

. Standard deviation is

\sigma=\sqrt{\mathrm{Var}(X)}

. Ch11 covers this in detail.

Normal distribution (continuous) — Density

f(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-(x-\mu)^2/(2\sigma^2)}

\mu

is the mean,

\sigma

the standard deviation.

Poisson distribution (discrete) —

P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}

(

k=0,1,2,\ldots

\lambda

is the average number of events in a fixed interval.

Binomial distribution (discrete) —

P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}

n

= number of trials,

p

= success probability per trial.

For a discrete random variable: ① values & probabilities →

② probabilities sum to 1 →

③ expectation =

\sum (\text{value})\times(\text{probability})

Sum of probabilities — e.g.

P(X=1)+P(X=2)+P(X=3)=1

. With denominator 6,

a/6+b/6+c/6=1

means

a+b+c=6

; two known → find the third.

Expectation —

E[X]=x_1 p_1+x_2 p_2+x_3 p_3

. With denominator 6,

6E[X]

is an integer, so problems often ask for 6×expectation.

Variance —

\mathrm{Var}(X)=E[X^2]-(E[X])^2

. With denominator 6, $36\times$ variance can be computed as an integer:

6\sum n_i x_i^2-(\sum n_i x_i)^2

(

n_i

= numerator,

x_i

= value).

Simplest case: probabilities

1/6,\,2/6,\,c/6

sum to 1.

1+2+c=6

→ $c=3$ .

Below are worked examples by type. Follow problem → solution → answer.

Example (sum of probabilities)

Three probabilities 1/6, 2/6, c/6 sum to 1. Find c.

Solution

With denominator 6, numerators must sum to 6:

1+2+c=6

→

c=3

→ Answer 3

Example (6×expectation)

Values 1, 2, 3 with probabilities 1/6, 2/6, 3/6. Find

6E[X]

Solution

6E[X]=1\times 1+2\times 2+3\times 3=14

→ Answer 14

Example (36×variance)

Same distribution with numerators

n_1=1,n_2=2,n_3=3

and values

x_i=1,2,3

. Find

36\times\mathrm{Var}(X)

Solution

\sum n_i x_i=14

\sum n_i x_i^2=36

, so

36\times\mathrm{Var}(X)=6\cdot36-14^2=20

→ Answer 20

Example (mode)

Same distribution. Find the mode.

Solution

Largest probability is at

X=3

(

3/6

→ Answer 3

Example (cumulative numerator)

Write

P(X\le 2)

k/6

and find the numerator

k

Solution

P(X\le 2)=P(X=1)+P(X=2)=1/6+2/6=3/6

. Numerator 3.

→ Answer 3