Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Chapter 10

Random Variables and Probability Distributions: Capturing Uncertainty in Numbers

A random variable assigns numbers to outcomes of an experiment; a probability distribution summarizes how likely each value is. Used in deep learning for prediction and uncertainty.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

Poisson: skewed (event count) · Binomial: symmetric, peak at center (success count)

Normal
μf(x)x
Poisson
0123456P(x)x

Skewed to one side → ‘how many times’ an event occurred

λ=1.5 (skewed right)

Binomial
012345678910P(x)x

Symmetric, peak in the center → ‘number of successes’ in n trials

n=10, p=0.5 (symmetric)

Figure 2: Discrete vs continuous
Discrete (bars)Continuous (curve)

What are random variables and probability distributions?

A random variable (Random Variable) maps the outcome of a trial (experiment) to numbers. It is usually written XXX. For example, the moment we agree that heads = 111 and tails = 000, the real-world act of flipping a coin becomes the mathematical variable XXX. A probability distribution is the rule that shows at a glance (like a map) with what probability each of those numbers appears.
① Discrete random variable — takes only finite or countable values. It can be shown in a table, as a function, or as a bar graph. The probability P(X=k)P(X=k)P(X=k) for each value kkk is the probability mass function (PMF); the essential condition is ∑kP(X=k)=1\sum_k P(X=k)=1∑k​P(X=k)=1.
Representative discrete distributions: The binomial distribution deals with the number of heads when flipping a coin multiple times. The Poisson distribution deals with event counts such as how many customers arrive in a given time period.
② Continuous random variable — takes infinitely many values in an interval. The probability of any single value (e.g. exactly 170.00 cm) is 000, because the area under a curve at a single point is zero. We use a probability density function (PDF) for probabilities over intervals (e.g. 170–180 cm). It's expressed by a function and a curve, not a table.
Representative continuous distribution: The bell-shaped normal distribution is most representative, as many natural data (measurement error, score distributions, etc.) follow it.
A probability distribution is the rule for which values occur and how often. The figure shows normal (continuous), Poisson (discrete), and binomial (discrete) — knowing these covers most uses in AI.
The probability mass function (PMF) is the probability P(X=k)P(X=k)P(X=k) for each value kkk of a discrete random variable. In a bar chart, the height of each bar is that probability, and the sum of all bar heights is 1. The figure below shows three common distributions.
Connecting to the figures — Figure 1 (above): the normal (left) is continuous (curve); Poisson and binomial (center, right) are discrete (bars). Figure 2 compares discrete (bars) and continuous (curve) side by side. In AI: normal for noise and regression, Poisson for event counts, binomial for success counts and binary classification.
Distribution condition (discrete) — The PMF is the probability P(X=k)P(X=k)P(X=k) of each value kkk. Essential: ∑kP(X=k)=1\sum_k P(X=k)=1∑k​P(X=k)=1. (e.g. For a die, P(1)+⋯+P(6)=1P (1) +\cdots+P (6) =1P(1)+⋯+P(6)=1.)
In plain words: For discrete distributions, all the probabilities of the possible outcomes must add up to 1. Just like a die—the chances of 1 through 6 add up to 1.
Distribution condition (continuous) — The PDF f(x)f(x)f(x) gives probability over intervals: P(a≤X≤b)=∫abf(x) dxP(a\le X\le b)=\int_a^b f(x)\,dxP(a≤X≤b)=∫ab​f(x)dx, and the total area is ∫−∞∞f(x) dx=1\int_{-\infty}^{\infty} f(x)\,dx=1∫−∞∞​f(x)dx=1.
In plain words: For continuous distributions, probability is the area under the curve. The probability that X falls in [a,b] is the area under the curve from a to b, and the total area under the whole curve is 1.
Expectation (mean) — Discrete: E[X]=∑kxk P(X=k)E[X]=\sum_k x_k\, P(X=k)E[X]=∑k​xk​P(X=k); continuous is given by an integral. The “average weighted by probability.”
In plain words: Expectation is the average value when each outcome is weighted by its probability. For a die, it's (1×1/6)+(2×1/6)+…+(6×1/6)=3.5—the "probability-weighted" average.
Variance — Var(X)=E[(X−E[X])2]\mathrm{Var}(X)=E[(X-E[X])^2]Var(X)=E[(X−E[X])2]. Standard deviation is σ=Var(X)\sigma=\sqrt{\mathrm{Var}(X)}σ=Var(X)​. Ch11 covers this in detail.
In plain words: Variance measures how spread out the values are from the mean. You take (each value minus the mean), square it, then average by probability; the square root of variance is the standard deviation.
Normal distribution (continuous) — Density f(x)=1σ2π e−(x−μ)2/(2σ2)f(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-(x-\mu)^2/(2\sigma^2)}f(x)=σ2π​1​e−(x−μ)2/(2σ2). μ\muμ is the mean, σ\sigmaσ the standard deviation.
In plain words: A symmetric bell-shaped curve centered at the mean μ. The spread is controlled by σ (standard deviation)—larger σ means a wider, flatter curve. Often used for heights, measurement error, and noise.
Poisson distribution (discrete) — P(X=k)=λke−λk!P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}P(X=k)=k!λke−λ​ (k=0,1,2,…k=0,1,2,\ldotsk=0,1,2,…). λ\lambdaλ is the average number of events in a fixed interval.
In plain words: Used when counting how many times an event happens in a fixed time or space. λ is the average count; the formula gives the probability of exactly k events. The bar chart is usually skewed to one side.
Binomial distribution (discrete) — P(X=k)=(nk)pk(1−p)n−kP(X=k)=\binom{n}{k}p^k(1-p)^{n-k}P(X=k)=(kn​)pk(1−p)n−k. nnn = number of trials, ppp = success probability per trial.
In plain words: You run the same trial n times and count how many successes (k). p is the chance of success on one trial. Like flipping a coin n times and counting heads—often gives a symmetric, peaked bar chart.
It is the basis of prediction and decision. AI does not just say “this is a cat.” It outputs a probability distribution — e.g. “probability of cat 0.98, dog 0.02” — as a random variable. From that distribution we see how confident the model is.
Managing uncertainty: Real data is noisy and uncertain. By modeling measurement error with the normal distribution or binary outcomes (e.g. spam or not) with the binomial, AI uses probability to reach the most reasonable conclusion.
Everyday statistics: rain probability (discrete), average lifespan or height distribution (continuous). Many quantities around us are described by random variables and distributions — discrete (bars) vs continuous (curves) — so we can read the world clearly.
Inside deep learning: Weights are often initialized with the normal distribution; the last layer uses softmax to turn outputs into a probability distribution (sum 1). Probability distributions are involved at every stage; understanding them shows how AI generates and classifies data.
For a discrete random variable: ① list possible values and their probabilities →
② check that probabilities sum to 1 →
③ expectation = sum of (value)×(probability).
Sum of probabilities — P(X=1)+P(X=2)+P(X=3)=1P(X=1)+P(X=2)+P(X=3)=1P(X=1)+P(X=2)+P(X=3)=1. With denominator 6, a/6+b/6+c/6=1a/6+b/6+c/6=1a/6+b/6+c/6=1 gives a+b+c=6a+b+c=6a+b+c=6. Knowing two of a,b,ca,b,ca,b,c gives the third.
Expectation — E[X]=x1p1+x2p2+x3p3E[X]=x_1 p_1+x_2 p_2+x_3 p_3E[X]=x1​p1​+x2​p2​+x3​p3​. When the denominator is 6, 6⋅E[X]6\cdot E[X]6⋅E[X] is an integer, so problems may ask for “6×expectation”.
Examples — Fill the blank so probabilities sum to 1, or find 6×expectation.
Ex 1. Three probabilities a/6, b/6, c/6 sum to 1, so a+b+c=6. If a=1 and b=2, then c=3.
Ex 2. Values 1, 2, 3 with probabilities 1/6, 2/6, 3/6: 6×expectation = 1×1+2×2+3×3 = 14.
Problem types and how to solve
  • TypeSum of probabilities
  • Descriptiona/6, b/6, c/6 sum to 1, find blank
  • How to get the answera+b+c=6a+b+c=6a+b+c=6. Two given → find the third.
  • Type6×expectation
  • Description6E[X]=∑(value×numerator)6 E[X] = \sum (\text{value}\times\text{numerator})6E[X]=∑(value×numerator)
  • How to get the answerMultiply each value by its numerator and add.
  • Type36×variance
  • Description36×variance36\times\text{variance}36×variance
  • How to get the answer6∑nixi2−(∑nixi)26\sum n_i x_i^2 - (\sum n_i x_i)^26∑ni​xi2​−(∑ni​xi​)2. nin_ini​=numerator, xix_ixi​=value.
  • TypeMode
  • DescriptionValue with highest probability
  • How to get the answerThe XXX value whose bar is tallest.
  • TypeCumulative numerator
  • DescriptionP(X≤k)P(X\le k)P(X≤k) in form ?/6, find numerator
  • How to get the answerSum the numerators of probabilities for values ≤k\le k≤k.
TypeDescriptionHow to get the answer
Sum of probabilitiesa/6, b/6, c/6 sum to 1, find blanka+b+c=6a+b+c=6a+b+c=6. Two given → find the third.
6×expectation6E[X]=∑(value×numerator)6 E[X] = \sum (\text{value}\times\text{numerator})6E[X]=∑(value×numerator)Multiply each value by its numerator and add.
36×variance36×variance36\times\text{variance}36×variance6∑nixi2−(∑nixi)26\sum n_i x_i^2 - (\sum n_i x_i)^26∑ni​xi2​−(∑ni​xi​)2. nin_ini​=numerator, xix_ixi​=value.
ModeValue with highest probabilityThe XXX value whose bar is tallest.
Cumulative numeratorP(X≤k)P(X\le k)P(X≤k) in form ?/6, find numeratorSum the numerators of probabilities for values ≤k\le k≤k.

Example (sum of probabilities)
Three probabilities 1/6, 2/6, c/6 sum to 1. Find c.
Solution
1+2+c=6 → c=3. → Answer 3

Example (6×expectation)
Values 1, 2, 3 with probabilities 1/6, 2/6, 3/6. Find 6×expectation.
Solution
6E[X]=1×1+2×2+3×3=1+4+9=146E[X]=1\times 1+2\times 2+3\times 3=1+4+9=146E[X]=1×1+2×2+3×3=1+4+9=14. → Answer 14