Chapter 11

Mean & Variance

The mean (expected value) is the center of a distribution; variance measures spread. Used in AI for prediction, loss, and regularization.

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of basic math at a glance.

Mean and variance

What are mean and variance

The mean (expected value) is the center of mass of a distribution. Variance measures how much values spread around the mean. Standard deviation is the square root of variance, so it shows “typical distance from the mean” in the same units as the data.
Mean — e.g. die average (1+…+6)/6=3.5, exam class average, or demand forecast “expected value.” The red line in the figure is the mean μ\mu.
Variance — probability-weighted average of (value−mean)². Large variance ⇒ more spread. Standard deviation σ=variance\sigma=\sqrt{\text{variance}} brings spread back to the original units (points, kg, etc.): e.g. “mean 70, σ=10” means many scores lie roughly in 60–80.
The mean tells where the center is; variance and standard deviation tell uncertainty or spread. In AI they are used for confidence intervals, loss, and regularization.
Concepts often used in AI — The table below summarizes mode, mean, min/max, and median: what they mean and how they are used in AI.
ConceptMeaningIn AI
ModeThe value with the highest probability; the outcome that appears most often in repeated trials.Used when choosing the “most likely class” in classification; the argmax of softmax output is the mode.
Mean (expected value)The center of mass of the distribution; the sum of value×probability. It represents the “expected” value.Used for regression predictions, loss (e.g. MSE), expected reward in reinforcement learning, and so on.
Min / MaxThe interval [min, max] in which the variable can lie; the smallest and largest values that define the range.Used in loss minimization (gradient descent), value clipping, and setting normalization ranges.
MedianThe value in the middle when ordered by size. Unlike the mean, it is less affected by extreme values (outliers).Used when summarizing data with many outliers or when a robust statistic is needed.
When a model gives a single-number prediction, that number is usually the mean (expected value). For example, “tomorrow’s sales will be about 1Mmeans1M” means1M is the expected value. A large standard deviation means the prediction is uncertain or volatile.
Uncertainty — When variance or standard deviation is large, values are spread widely around the mean, so you can tell “how reliable” the estimate is. Confidence intervals (e.g. mean ± 2σ) are essential in medicine, finance, and autonomous systems.
Loss functions — In regression, MSE (mean squared error) is the mean of squared errors, i.e. the same structure as variance. So training can be seen as reducing the variance of the prediction error.
Regularization and dropout — They control or reduce the variance of weights or activations. If variance is too high, predictions become unstable; regularization helps avoid overfitting and improves generalization.
Across AI — Bayesian and uncertainty-aware models predict both mean and variance (or σ). In generative models (VAE, diffusion), the mean and variance of the latent space are central.
Daily life — Exam scores are reported as “mean 70, standard deviation 10” so you see center and spread. Same for height/weight distributions, demand forecasts (expected value ± error range), and quality control (spec ± σ).
Regression — The prediction is usually the conditional expected value: “average output given this input.” We minimize MSE (mean of squared errors), i.e. we minimize a kind of average.
Classification — The model outputs probabilities per class; we take the mode (the class with the highest probability) as the predicted class. The argmax of the softmax output does exactly that.
Reinforcement learning — Policies are evaluated using the expected reward. We learn to maximize “average future reward” for an action, which is an expectation.
Uncertainty estimation — Bayesian neural networks, ensembles, and dropout at test time estimate predictive variance. Variance (or σ) tells “how confident” the model is.
Math flow — Ch10 defined expectation and variance; Ch11 practices computing them. Ch12 normal distribution is fully determined by mean μ\mu and standard deviation σ\sigma.
Discrete case: mean = sum of value×probability\text{value}\times\text{probability}, variance = E[X2](E[X])2E[X^2]-(E[X])^2. With denominator 6, 6×mean6\times\text{mean} and 36×variance36\times\text{variance} are integers.
Mean — add value×probability\text{value}\times\text{probability}. With denominator 6, 6×mean6\times\text{mean} is an integer.
VarianceE[X2]E[X^2] minus (mean)2(\text{mean})^2. 36×variance36\times\text{variance} is an integer and easy to compute.
Below: compute 6×mean6\times\text{mean}, 36×variance36\times\text{variance}, mean (integer), mode, and cumulative numerator.
Example. Values 1,2,3 with probs 16\frac{1}{6}, 26\frac{2}{6}, 36\frac{3}{6}6×mean=1×1+2×2+3×3=146\times\text{mean} = 1\times1+2\times2+3\times3 = 14.
Example. Same distribution: 36×variance=6i(nixi2)(inixi)236\times\text{variance} = 6\sum_i (n_i x_i^2) - (\sum_i n_i x_i)^2.