Appendix B: Probability Review

This appendix reviews probability concepts used throughout the book. For a comprehensive treatment, consult a probability textbook such as DeGroot & Schervish (2012) or Casella & Berger (2002).


B.1 Probability Fundamentals

Sample Space and Events

The sample space Ω\Omega is the set of all possible outcomes. An event is a subset of the sample space.

Example: Flipping a coin twice:

  • Sample space: Ω={HH,HT,TH,TT}\Omega = \{HH, HT, TH, TT\}

  • Event "at least one head": A={HH,HT,TH}A = \{HH, HT, TH\}

Probability Axioms (Kolmogorov)

For any event AA:

  1. P(A)0P(A) \geq 0 (non-negativity)

  2. P(Ω)=1P(\Omega) = 1 (normalization)

  3. If A1,A2,A_1, A_2, \ldots are mutually exclusive: P(iAi)=iP(Ai)P(\cup_i A_i) = \sum_i P(A_i) (countable additivity)

Implications

  • P()=0P(\emptyset) = 0

  • P(Ac)=1P(A)P(A^c) = 1 - P(A)

  • If ABA \subset B, then P(A)P(B)P(A) \leq P(B)

  • P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)


B.2 Conditional Probability

Definition

P(AB)=P(AB)P(B),provided P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad \text{provided } P(B) > 0

Intuition: Probability of AA, restricting attention to outcomes where BB occurred.

Multiplication Rule

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

Law of Total Probability

If B1,B2,,BkB_1, B_2, \ldots, B_k partition Ω\Omega:

P(A)=j=1kP(ABj)P(Bj)P(A) = \sum_{j=1}^k P(A \mid B_j) P(B_j)

Bayes' Theorem

P(BA)=P(AB)P(B)P(A)=P(AB)P(B)jP(ABj)P(Bj)P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)} = \frac{P(A \mid B) P(B)}{\sum_j P(A \mid B_j) P(B_j)}

Intuition: Updates prior beliefs P(B)P(B) to posterior P(BA)P(B \mid A) based on evidence AA.


B.3 Independence

Definition

Events AA and BB are independent if:

P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

Equivalently: P(AB)=P(A)P(A \mid B) = P(A) (knowing BB doesn't change probability of AA).

Conditional Independence

AA and BB are conditionally independent given CC if:

P(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

Notation: ABCA \perp B \mid C

Caution: Independence does not imply conditional independence, and vice versa.


B.4 Random Variables

Discrete Random Variables

A discrete random variable XX takes countably many values. Characterized by:

  • Probability mass function (PMF): pX(x)=P(X=x)p_X(x) = P(X = x)

  • Properties: pX(x)0p_X(x) \geq 0 and xpX(x)=1\sum_x p_X(x) = 1

Continuous Random Variables

A continuous random variable XX has:

  • Probability density function (PDF): fX(x)f_X(x)

  • Properties: fX(x)0f_X(x) \geq 0 and fX(x)dx=1\int_{-\infty}^{\infty} f_X(x) dx = 1

  • P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x) dx

Cumulative Distribution Function (CDF)

For any random variable:

FX(x)=P(Xx)F_X(x) = P(X \leq x)

Properties:

  • FX()=0F_X(-\infty) = 0, FX()=1F_X(\infty) = 1

  • Non-decreasing

  • Right-continuous


B.5 Expectation

Definition

Discrete: E[X]=xxpX(x)E[X] = \sum_x x \cdot p_X(x)

Continuous: E[X]=xfX(x)dxE[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx

Properties

  1. Linearity: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y]

  2. Function of RV: E[g(X)]=xg(x)pX(x)E[g(X)] = \sum_x g(x) p_X(x) or g(x)fX(x)dx\int g(x) f_X(x) dx

  3. Independence: If XYX \perp Y, then E[XY]=E[X]E[Y]E[XY] = E[X] \cdot E[Y]

Conditional Expectation

E[YX=x]=yyP(Y=yX=x)E[Y \mid X = x] = \sum_y y \cdot P(Y = y \mid X = x)

Law of Iterated Expectations (LIE):

E[Y]=E[E[YX]]E[Y] = E[E[Y \mid X]]

Intuition: Average over conditional means equals unconditional mean.


B.6 Variance and Covariance

Variance

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

Properties:

  • Var(X)0\text{Var}(X) \geq 0

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)

  • If XYX \perp Y: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Standard deviation: SD(X)=Var(X)\text{SD}(X) = \sqrt{\text{Var}(X)}

Covariance

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

Properties:

  • Cov(X,X)=Var(X)\text{Cov}(X, X) = \text{Var}(X)

  • Cov(X,Y)=Cov(Y,X)\text{Cov}(X, Y) = \text{Cov}(Y, X)

  • Cov(aX,bY)=abCov(X,Y)\text{Cov}(aX, bY) = ab \cdot \text{Cov}(X, Y)

  • If XYX \perp Y: Cov(X,Y)=0\text{Cov}(X, Y) = 0 (but not conversely!)

Correlation

Corr(X,Y)=Cov(X,Y)SD(X)SD(Y)\text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\text{SD}(X) \cdot \text{SD}(Y)}

  • Always between 1-1 and 11

  • Corr(X,Y)=1|\text{Corr}(X, Y)| = 1 iff Y=aX+bY = aX + b for some a,ba, b

Variance of a Sum

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)


B.7 Common Distributions

Discrete Distributions

Distribution
PMF
Mean
Variance

Bernoulli(pp)

px(1p)1xp^x(1-p)^{1-x}, x{0,1}x \in \{0,1\}

pp

p(1p)p(1-p)

Binomial(n,pn, p)

(nx)px(1p)nx\binom{n}{x}p^x(1-p)^{n-x}

npnp

np(1p)np(1-p)

Poisson(λ\lambda)

λxeλx!\frac{\lambda^x e^{-\lambda}}{x!}

λ\lambda

λ\lambda

Geometric(pp)

(1p)x1p(1-p)^{x-1}p

1/p1/p

(1p)/p2(1-p)/p^2

Continuous Distributions

Distribution
PDF
Mean
Variance

Uniform(a,ba, b)

1ba\frac{1}{b-a}, x[a,b]x \in [a,b]

a+b2\frac{a+b}{2}

(ba)212\frac{(b-a)^2}{12}

Exponential(λ\lambda)

λeλx\lambda e^{-\lambda x}, x0x \geq 0

1/λ1/\lambda

1/λ21/\lambda^2

Normal(μ,σ2\mu, \sigma^2)

12πσ2e(xμ)22σ2\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

μ\mu

σ2\sigma^2

Chi-squared(kk)

[complex]

kk

2k2k

Student's t(k)t(k)

[complex]

00 (if k>1k>1)

kk2\frac{k}{k-2} (if k>2k>2)

The Normal Distribution

The normal (Gaussian) distribution is central to statistics:

XN(μ,σ2)    f(x)=12πσ2exp((xμ)22σ2)X \sim N(\mu, \sigma^2) \implies f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Standard normal: ZN(0,1)Z \sim N(0, 1)

Standardization: If XN(μ,σ2)X \sim N(\mu, \sigma^2), then Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim N(0, 1)

Linear combinations: If XN(μX,σX2)X \sim N(\mu_X, \sigma_X^2) and YN(μY,σY2)Y \sim N(\mu_Y, \sigma_Y^2) are independent: aX+bYN(aμX+bμY,a2σX2+b2σY2)aX + bY \sim N(a\mu_X + b\mu_Y, a^2\sigma_X^2 + b^2\sigma_Y^2)


B.8 Sampling and the Central Limit Theorem

Random Sample

A random sample X1,X2,,XnX_1, X_2, \ldots, X_n consists of independent and identically distributed (i.i.d.) draws from some distribution.

Sample Mean

Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i

Properties (if E[Xi]=μE[X_i] = \mu, Var(Xi)=σ2\text{Var}(X_i) = \sigma^2):

  • E[Xˉ]=μE[\bar{X}] = \mu (unbiased)

  • Var(Xˉ)=σ2/n\text{Var}(\bar{X}) = \sigma^2/n

  • SE(Xˉ)=σ/n\text{SE}(\bar{X}) = \sigma/\sqrt{n}

Law of Large Numbers (LLN)

As nn \to \infty:

Xˉnpμ\bar{X}_n \xrightarrow{p} \mu

Interpretation: Sample mean converges to population mean.

Central Limit Theorem (CLT)

For i.i.d. XiX_i with mean μ\mu and variance σ2\sigma^2:

Xˉnμσ/ndN(0,1)as n\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Equivalently: n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)

Interpretation: Sample means are approximately normal for large nn, regardless of the population distribution.


B.9 Joint Distributions

Joint PMF/PDF

Discrete: pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x, y) = P(X = x, Y = y)

Continuous: fX,Y(x,y)f_{X,Y}(x, y) where P((X,Y)A)=AfX,Y(x,y)dxdyP((X, Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy

Marginal Distributions

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy

Conditional Distributions

fYX(yx)=fX,Y(x,y)fX(x)f_{Y|X}(y \mid x) = \frac{f_{X,Y}(x, y)}{f_X(x)}

Independence

XX and YY are independent iff: fX,Y(x,y)=fX(x)fY(y)for all x,yf_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } x, y


B.10 Transformations of Random Variables

Univariate

If Y=g(X)Y = g(X) and gg is monotonic with inverse g1g^{-1}:

fY(y)=fX(g1(y))ddyg1(y)f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right|

Linear Transformation

If Y=aX+bY = aX + b:

  • E[Y]=aE[X]+bE[Y] = aE[X] + b

  • Var(Y)=a2Var(X)\text{Var}(Y) = a^2 \text{Var}(X)


B.11 Moment Generating Functions

Definition

MX(t)=E[etX]M_X(t) = E[e^{tX}]

Properties

  1. If exists in neighborhood of 0, uniquely determines distribution

  2. E[Xk]=MX(k)(0)E[X^k] = M_X^{(k)}(0) (k-th derivative at 0)

  3. If XYX \perp Y: MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

Common MGFs

Distribution
MGF

Bernoulli(pp)

(1p)+pet(1-p) + pe^t

Normal(μ,σ2\mu, \sigma^2)

exp(μt+σ2t2/2)\exp(\mu t + \sigma^2 t^2/2)

Poisson(λ\lambda)

exp(λ(et1))\exp(\lambda(e^t - 1))


B.12 Inequalities

Markov's Inequality

For X0X \geq 0 and a>0a > 0: P(Xa)E[X]aP(X \geq a) \leq \frac{E[X]}{a}

Chebyshev's Inequality

For any k>0k > 0: P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Cauchy-Schwarz Inequality

E[XY]2E[X2]E[Y2]|E[XY]|^2 \leq E[X^2] \cdot E[Y^2]

Implies: Corr(X,Y)1|\text{Corr}(X, Y)| \leq 1

Jensen's Inequality

For convex gg: g(E[X])E[g(X)]g(E[X]) \leq E[g(X)]

For concave gg: inequality reverses.


B.13 Convergence Concepts

Types of Convergence

  1. Almost sure convergence: Xna.s.XX_n \xrightarrow{a.s.} X if P(limnXn=X)=1P(\lim_{n \to \infty} X_n = X) = 1

  2. Convergence in probability: XnpXX_n \xrightarrow{p} X if for all ϵ>0\epsilon > 0: limnP(XnX>ϵ)=0\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0

  3. Convergence in distribution: XndXX_n \xrightarrow{d} X if: limnFXn(x)=FX(x)at all continuity points\lim_{n \to \infty} F_{X_n}(x) = F_X(x) \quad \text{at all continuity points}

Relationships

a.s.    probability    distribution\text{a.s.} \implies \text{probability} \implies \text{distribution}

(Implications don't reverse in general)

Slutsky's Theorem

If XndXX_n \xrightarrow{d} X and YnpcY_n \xrightarrow{p} c (constant):

  • Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c

  • XnYndcXX_n Y_n \xrightarrow{d} cX

  • Xn/YndX/cX_n / Y_n \xrightarrow{d} X/c (if c0c \neq 0)

Continuous Mapping Theorem

If XndXX_n \xrightarrow{d} X and gg is continuous: g(Xn)dg(X)g(X_n) \xrightarrow{d} g(X)

Delta Method

If n(Xnθ)dN(0,σ2)\sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2) and g(θ)0g'(\theta) \neq 0: n(g(Xn)g(θ))dN(0,[g(θ)]2σ2)\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, [g'(\theta)]^2 \sigma^2)


Further Reading

  • DeGroot, M. H., & Schervish, M. J. (2012). Probability and Statistics (4th ed.). Pearson.

  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage.

  • Wasserman, L. (2004). All of Statistics. Springer.

Last updated