Chapter 10

Random Variable & Distribution

A random variable assigns numbers to outcomes of an experiment; a probability distribution summarizes how likely each value is. Used in deep learning for prediction and uncertainty.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

Poisson: skewed (event count) · Binomial: symmetric, peak at center (success count)

Normal
Poisson
Binomial
Figure 2: Discrete vs continuous

What are random variables and probability distributions?

A random variable assigns numbers to outcomes of an experiment; a probability distribution summarizes how likely each value is. The figure above shows three distributions often used in AI: normal, Poisson, and binomial.
① Discrete random variable — takes only finite or countable values. It can be shown in a table, as a function, or as a bar graph. The probability P(X=k)P(X=k) for each value kk is the probability mass function (PMF); the essential condition is kP(X=k)=1\sum_k P(X=k)=1.
Discrete examples — zoo visitors per day, number of heads when flipping two coins, number of rolls until a bowling strike: countable outcomes. The Poisson and binomial bars in the figure are discrete random variables.
② Continuous random variable — takes infinitely many values in an interval. We don't assign probability to a single value; we use a probability density function (PDF) for probabilities over intervals. It's expressed by a function and a curve, not a table.
Continuous examples — annual rainfall, light-bulb lifetime, time until the next bus: continuous quantities. The normal distribution (bell curve) in the figure is a classic continuous example.
A probability distribution is the rule for which values occur and how often. The figure shows normal (continuous), Poisson (discrete), and binomial (discrete) — knowing these covers most uses in AI.
The probability mass function (PMF) is the probability P(X=k)P(X=k) for each value kk of a discrete random variable. In a bar chart, the height of each bar is that probability, and the sum of all bar heights is 1. The figure below shows three common distributions.
Connecting to the figuresFigure 1 (above): the normal (left) is continuous (curve); Poisson and binomial (center, right) are discrete (bars). Figure 2 compares discrete (bars) and continuous (curve) side by side. In AI: normal for noise and regression, Poisson for event counts, binomial for success counts and binary classification.
Distribution condition (discrete) — The PMF is the probability P(X=k)P(X=k) of each value kk. Essential: kP(X=k)=1\sum_k P(X=k)=1. (e.g. For a die, P(1)++P(6)=1P(1)+\cdots+P(6)=1.)
In plain words: For discrete distributions, all the probabilities of the possible outcomes must add up to 1. Just like a die—the chances of 1 through 6 add up to 1.
Distribution condition (continuous) — The PDF f(x)f(x) gives probability over intervals: P(aXb)=abf(x)dxP(a\le X\le b)=\int_a^b f(x)\,dx, and the total area is f(x)dx=1\int_{-\infty}^{\infty} f(x)\,dx=1.
In plain words: For continuous distributions, probability is the area under the curve. The probability that X falls in [a,b] is the area under the curve from a to b, and the total area under the whole curve is 1.
Expectation (mean) — Discrete: E[X]=kxkP(X=k)E[X]=\sum_k x_k\, P(X=k); continuous is given by an integral. The “average weighted by probability.”
In plain words: Expectation is the average value when each outcome is weighted by its probability. For a die, it's (1×1/6)+(2×1/6)+…+(6×1/6)=3.5—the "probability-weighted" average.
VarianceVar(X)=E[(XE[X])2]\mathrm{Var}(X)=E[(X-E[X])^2]. Standard deviation is σ=Var(X)\sigma=\sqrt{\mathrm{Var}(X)}. Ch11 covers this in detail.
In plain words: Variance measures how spread out the values are from the mean. You take (each value minus the mean), square it, then average by probability; the square root of variance is the standard deviation.
Normal distribution (continuous) — Density f(x)=1σ2πe(xμ)2/(2σ2)f(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-(x-\mu)^2/(2\sigma^2)}. μ\mu is the mean, σ\sigma the standard deviation.
In plain words: A symmetric bell-shaped curve centered at the mean μ. The spread is controlled by σ (standard deviation)—larger σ means a wider, flatter curve. Often used for heights, measurement error, and noise.
Poisson distribution (discrete)P(X=k)=λkeλk!P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!} (k=0,1,2,k=0,1,2,\ldots). λ\lambda is the average number of events in a fixed interval.
In plain words: Used when counting how many times an event happens in a fixed time or space. λ is the average count; the formula gives the probability of exactly k events. The bar chart is usually skewed to one side.
Binomial distribution (discrete)P(X=k)=(nk)pk(1p)nkP(X=k)=\binom{n}{k}p^k(1-p)^{n-k}. nn = number of trials, pp = success probability per trial.
In plain words: You run the same trial n times and count how many successes (k). p is the chance of success on one trial. Like flipping a coin n times and counting heads—often gives a symmetric, peaked bar chart.
When we predict with "possible values and their probabilities," that's a random variable and distribution. The three distributions in the figure are used in AI to express uncertainty.
AI and the figure(Normal) for regression, noise, latent space; (Poisson) for view counts, clicks, event counts; (binomial) for binary classification and success probability. Softmax, sampling, and cross-entropy loss all tie to these distributions.
Daily life — zoo visitors (discrete); rainfall, bulb lifetime, bus wait time (continuous). Distinguishing countable vs continuous matches the bars (discrete) and curve (continuous) in the figure.
In AI — The normal in the figure models errors and Gaussian noise; Poisson models count data and word frequency; binomial models class probability and success/failure. Ch11–Ch12 cover mean, variance, and the normal distribution in more detail.
For a discrete random variable: ① list possible values and their probabilities → ② check that probabilities sum to 1 → ③ expectation = sum of (value)×(probability).
Sum of probabilitiesP(X=1)+P(X=2)+P(X=3)=1P(X=1)+P(X=2)+P(X=3)=1. With denominator 6, a/6+b/6+c/6=1a/6+b/6+c/6=1 gives a+b+c=6a+b+c=6. Knowing two of a,b,ca,b,c gives the third.
ExpectationE[X]=x1p1+x2p2+x3p3E[X]=x_1 p_1+x_2 p_2+x_3 p_3. When the denominator is 6, 6E[X]6\cdot E[X] is an integer, so problems may ask for “6×expectation”.
Examples — Fill the blank so probabilities sum to 1, or find 6×expectation.
Ex 1. Three probabilities a/6, b/6, c/6 sum to 1, so a+b+c=6. If a=1 and b=2, then c=3.
Ex 2. Values 1, 2, 3 with probabilities 1/6, 2/6, 3/6: 6×expectation = 1×1+2×2+3×3 = 14.