This appendix reviews probability concepts used throughout the book. For a comprehensive treatment, consult a probability textbook such as DeGroot & Schervish (2012) or Casella & Berger (2002).
B.1 Probability Fundamentals
Sample Space and Events
The sample space Ω \Omega Ω is the set of all possible outcomes. An event is a subset of the sample space.
Example : Flipping a coin twice:
Sample space: Ω = { H H , H T , T H , T T } \Omega = \{HH, HT, TH, TT\} Ω = { HH , H T , T H , TT }
Event "at least one head": A = { H H , H T , T H } A = \{HH, HT, TH\} A = { HH , H T , T H }
Probability Axioms (Kolmogorov)
For any event A A A :
P ( A ) ≥ 0 P(A) \geq 0 P ( A ) ≥ 0 (non-negativity)
P ( Ω ) = 1 P(\Omega) = 1 P ( Ω ) = 1 (normalization)
If A 1 , A 2 , … A_1, A_2, \ldots A 1 , A 2 , … are mutually exclusive: P ( ∪ i A i ) = ∑ i P ( A i ) P(\cup_i A_i) = \sum_i P(A_i) P ( ∪ i A i ) = ∑ i P ( A i ) (countable additivity)
P ( ∅ ) = 0 P(\emptyset) = 0 P ( ∅ ) = 0
P ( A c ) = 1 − P ( A ) P(A^c) = 1 - P(A) P ( A c ) = 1 − P ( A )
If A ⊂ B A \subset B A ⊂ B , then P ( A ) ≤ P ( B ) P(A) \leq P(B) P ( A ) ≤ P ( B )
P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) P(A \cup B) = P(A) + P(B) - P(A \cap B) P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B )
B.2 Conditional Probability
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) , provided P ( B ) > 0 P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad \text{provided } P(B) > 0 P ( A ∣ B ) = P ( B ) P ( A ∩ B ) , provided P ( B ) > 0
Intuition : Probability of A A A , restricting attention to outcomes where B B B occurred.
Multiplication Rule
P ( A ∩ B ) = P ( A ∣ B ) ⋅ P ( B ) = P ( B ∣ A ) ⋅ P ( A ) P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A) P ( A ∩ B ) = P ( A ∣ B ) ⋅ P ( B ) = P ( B ∣ A ) ⋅ P ( A )
Law of Total Probability
If B 1 , B 2 , … , B k B_1, B_2, \ldots, B_k B 1 , B 2 , … , B k partition Ω \Omega Ω :
P ( A ) = ∑ j = 1 k P ( A ∣ B j ) P ( B j ) P(A) = \sum_{j=1}^k P(A \mid B_j) P(B_j) P ( A ) = ∑ j = 1 k P ( A ∣ B j ) P ( B j )
P ( B ∣ A ) = P ( A ∣ B ) P ( B ) P ( A ) = P ( A ∣ B ) P ( B ) ∑ j P ( A ∣ B j ) P ( B j ) P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)} = \frac{P(A \mid B) P(B)}{\sum_j P(A \mid B_j) P(B_j)} P ( B ∣ A ) = P ( A ) P ( A ∣ B ) P ( B ) = ∑ j P ( A ∣ B j ) P ( B j ) P ( A ∣ B ) P ( B )
Intuition : Updates prior beliefs P ( B ) P(B) P ( B ) to posterior P ( B ∣ A ) P(B \mid A) P ( B ∣ A ) based on evidence A A A .
B.3 Independence
Events A A A and B B B are independent if:
P ( A ∩ B ) = P ( A ) ⋅ P ( B ) P(A \cap B) = P(A) \cdot P(B) P ( A ∩ B ) = P ( A ) ⋅ P ( B )
Equivalently: P ( A ∣ B ) = P ( A ) P(A \mid B) = P(A) P ( A ∣ B ) = P ( A ) (knowing B B B doesn't change probability of A A A ).
Conditional Independence
A A A and B B B are conditionally independent given C C C if:
P ( A ∩ B ∣ C ) = P ( A ∣ C ) ⋅ P ( B ∣ C ) P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C) P ( A ∩ B ∣ C ) = P ( A ∣ C ) ⋅ P ( B ∣ C )
Notation : A ⊥ B ∣ C A \perp B \mid C A ⊥ B ∣ C
Caution : Independence does not imply conditional independence, and vice versa.
B.4 Random Variables
Discrete Random Variables
A discrete random variable X X X takes countably many values. Characterized by:
Probability mass function (PMF) : p X ( x ) = P ( X = x ) p_X(x) = P(X = x) p X ( x ) = P ( X = x )
Properties: p X ( x ) ≥ 0 p_X(x) \geq 0 p X ( x ) ≥ 0 and ∑ x p X ( x ) = 1 \sum_x p_X(x) = 1 ∑ x p X ( x ) = 1
Continuous Random Variables
A continuous random variable X X X has:
Probability density function (PDF) : f X ( x ) f_X(x) f X ( x )
Properties: f X ( x ) ≥ 0 f_X(x) \geq 0 f X ( x ) ≥ 0 and ∫ − ∞ ∞ f X ( x ) d x = 1 \int_{-\infty}^{\infty} f_X(x) dx = 1 ∫ − ∞ ∞ f X ( x ) d x = 1
P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x P(a \leq X \leq b) = \int_a^b f_X(x) dx P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x
Cumulative Distribution Function (CDF)
For any random variable:
F X ( x ) = P ( X ≤ x ) F_X(x) = P(X \leq x) F X ( x ) = P ( X ≤ x )
Properties:
F X ( − ∞ ) = 0 F_X(-\infty) = 0 F X ( − ∞ ) = 0 , F X ( ∞ ) = 1 F_X(\infty) = 1 F X ( ∞ ) = 1
B.5 Expectation
Discrete : E [ X ] = ∑ x x ⋅ p X ( x ) E[X] = \sum_x x \cdot p_X(x) E [ X ] = ∑ x x ⋅ p X ( x )
Continuous : E [ X ] = ∫ − ∞ ∞ x ⋅ f X ( x ) d x E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx E [ X ] = ∫ − ∞ ∞ x ⋅ f X ( x ) d x
Linearity : E [ a X + b Y ] = a E [ X ] + b E [ Y ] E[aX + bY] = aE[X] + bE[Y] E [ a X + bY ] = a E [ X ] + b E [ Y ]
Function of RV : E [ g ( X ) ] = ∑ x g ( x ) p X ( x ) E[g(X)] = \sum_x g(x) p_X(x) E [ g ( X )] = ∑ x g ( x ) p X ( x ) or ∫ g ( x ) f X ( x ) d x \int g(x) f_X(x) dx ∫ g ( x ) f X ( x ) d x
Independence : If X ⊥ Y X \perp Y X ⊥ Y , then E [ X Y ] = E [ X ] ⋅ E [ Y ] E[XY] = E[X] \cdot E[Y] E [ X Y ] = E [ X ] ⋅ E [ Y ]
Conditional Expectation
E [ Y ∣ X = x ] = ∑ y y ⋅ P ( Y = y ∣ X = x ) E[Y \mid X = x] = \sum_y y \cdot P(Y = y \mid X = x) E [ Y ∣ X = x ] = ∑ y y ⋅ P ( Y = y ∣ X = x )
Law of Iterated Expectations (LIE) :
E [ Y ] = E [ E [ Y ∣ X ] ] E[Y] = E[E[Y \mid X]] E [ Y ] = E [ E [ Y ∣ X ]]
Intuition : Average over conditional means equals unconditional mean.
B.6 Variance and Covariance
Var ( X ) = E [ ( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 Var ( X ) = E [( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2
Properties :
Var ( X ) ≥ 0 \text{Var}(X) \geq 0 Var ( X ) ≥ 0
Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2 \text{Var}(X) Var ( a X + b ) = a 2 Var ( X )
If X ⊥ Y X \perp Y X ⊥ Y : Var ( X + Y ) = Var ( X ) + Var ( Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) Var ( X + Y ) = Var ( X ) + Var ( Y )
Standard deviation : SD ( X ) = Var ( X ) \text{SD}(X) = \sqrt{\text{Var}(X)} SD ( X ) = Var ( X )
Cov ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y ] − E [ X ] E [ Y ] \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])] = E [ X Y ] − E [ X ] E [ Y ]
Properties :
Cov ( X , X ) = Var ( X ) \text{Cov}(X, X) = \text{Var}(X) Cov ( X , X ) = Var ( X )
Cov ( X , Y ) = Cov ( Y , X ) \text{Cov}(X, Y) = \text{Cov}(Y, X) Cov ( X , Y ) = Cov ( Y , X )
Cov ( a X , b Y ) = a b ⋅ Cov ( X , Y ) \text{Cov}(aX, bY) = ab \cdot \text{Cov}(X, Y) Cov ( a X , bY ) = ab ⋅ Cov ( X , Y )
If X ⊥ Y X \perp Y X ⊥ Y : Cov ( X , Y ) = 0 \text{Cov}(X, Y) = 0 Cov ( X , Y ) = 0 (but not conversely!)
Corr ( X , Y ) = Cov ( X , Y ) SD ( X ) ⋅ SD ( Y ) \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\text{SD}(X) \cdot \text{SD}(Y)} Corr ( X , Y ) = SD ( X ) ⋅ SD ( Y ) Cov ( X , Y )
Always between − 1 -1 − 1 and 1 1 1
∣ Corr ( X , Y ) ∣ = 1 |\text{Corr}(X, Y)| = 1 ∣ Corr ( X , Y ) ∣ = 1 iff Y = a X + b Y = aX + b Y = a X + b for some a , b a, b a , b
Variance of a Sum
Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y )
B.7 Common Distributions
Discrete Distributions
Distribution
PMF
Mean
Variance
p x ( 1 − p ) 1 − x p^x(1-p)^{1-x} p x ( 1 − p ) 1 − x , x ∈ { 0 , 1 } x \in \{0,1\} x ∈ { 0 , 1 }
( n x ) p x ( 1 − p ) n − x \binom{n}{x}p^x(1-p)^{n-x} ( x n ) p x ( 1 − p ) n − x
λ x e − λ x ! \frac{\lambda^x e^{-\lambda}}{x!} x ! λ x e − λ
( 1 − p ) x − 1 p (1-p)^{x-1}p ( 1 − p ) x − 1 p
( 1 − p ) / p 2 (1-p)/p^2 ( 1 − p ) / p 2
Continuous Distributions
Distribution
PDF
Mean
Variance
1 b − a \frac{1}{b-a} b − a 1 , x ∈ [ a , b ] x \in [a,b] x ∈ [ a , b ]
( b − a ) 2 12 \frac{(b-a)^2}{12} 12 ( b − a ) 2
λ e − λ x \lambda e^{-\lambda x} λ e − λ x , x ≥ 0 x \geq 0 x ≥ 0
Normal(μ , σ 2 \mu, \sigma^2 μ , σ 2 )
1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} 2 π σ 2 1 e − 2 σ 2 ( x − μ ) 2
k k − 2 \frac{k}{k-2} k − 2 k (if k > 2 k>2 k > 2 )
The Normal Distribution
The normal (Gaussian) distribution is central to statistics:
X ∼ N ( μ , σ 2 ) ⟹ f ( x ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 ) X \sim N(\mu, \sigma^2) \implies f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) X ∼ N ( μ , σ 2 ) ⟹ f ( x ) = 2 π σ 2 1 exp ( − 2 σ 2 ( x − μ ) 2 )
Standard normal : Z ∼ N ( 0 , 1 ) Z \sim N(0, 1) Z ∼ N ( 0 , 1 )
Standardization : If X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) , then Z = X − μ σ ∼ N ( 0 , 1 ) Z = \frac{X - \mu}{\sigma} \sim N(0, 1) Z = σ X − μ ∼ N ( 0 , 1 )
Linear combinations : If X ∼ N ( μ X , σ X 2 ) X \sim N(\mu_X, \sigma_X^2) X ∼ N ( μ X , σ X 2 ) and Y ∼ N ( μ Y , σ Y 2 ) Y \sim N(\mu_Y, \sigma_Y^2) Y ∼ N ( μ Y , σ Y 2 ) are independent: a X + b Y ∼ N ( a μ X + b μ Y , a 2 σ X 2 + b 2 σ Y 2 ) aX + bY \sim N(a\mu_X + b\mu_Y, a^2\sigma_X^2 + b^2\sigma_Y^2) a X + bY ∼ N ( a μ X + b μ Y , a 2 σ X 2 + b 2 σ Y 2 )
B.8 Sampling and the Central Limit Theorem
A random sample X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n consists of independent and identically distributed (i.i.d.) draws from some distribution.
X ˉ = 1 n ∑ i = 1 n X i \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i X ˉ = n 1 ∑ i = 1 n X i
Properties (if E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ , Var ( X i ) = σ 2 \text{Var}(X_i) = \sigma^2 Var ( X i ) = σ 2 ):
E [ X ˉ ] = μ E[\bar{X}] = \mu E [ X ˉ ] = μ (unbiased)
Var ( X ˉ ) = σ 2 / n \text{Var}(\bar{X}) = \sigma^2/n Var ( X ˉ ) = σ 2 / n
SE ( X ˉ ) = σ / n \text{SE}(\bar{X}) = \sigma/\sqrt{n} SE ( X ˉ ) = σ / n
Law of Large Numbers (LLN)
As n → ∞ n \to \infty n → ∞ :
X ˉ n → p μ \bar{X}_n \xrightarrow{p} \mu X ˉ n p μ
Interpretation : Sample mean converges to population mean.
Central Limit Theorem (CLT)
For i.i.d. X i X_i X i with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 :
X ˉ n − μ σ / n → d N ( 0 , 1 ) as n → ∞ \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty σ / n X ˉ n − μ d N ( 0 , 1 ) as n → ∞
Equivalently: n ( X ˉ n − μ ) → d N ( 0 , σ 2 ) \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2) n ( X ˉ n − μ ) d N ( 0 , σ 2 )
Interpretation : Sample means are approximately normal for large n n n , regardless of the population distribution.
B.9 Joint Distributions
Discrete : p X , Y ( x , y ) = P ( X = x , Y = y ) p_{X,Y}(x, y) = P(X = x, Y = y) p X , Y ( x , y ) = P ( X = x , Y = y )
Continuous : f X , Y ( x , y ) f_{X,Y}(x, y) f X , Y ( x , y ) where P ( ( X , Y ) ∈ A ) = ∬ A f X , Y ( x , y ) d x d y P((X, Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy P (( X , Y ) ∈ A ) = ∬ A f X , Y ( x , y ) d x d y
Marginal Distributions
f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y
Conditional Distributions
f Y ∣ X ( y ∣ x ) = f X , Y ( x , y ) f X ( x ) f_{Y|X}(y \mid x) = \frac{f_{X,Y}(x, y)}{f_X(x)} f Y ∣ X ( y ∣ x ) = f X ( x ) f X , Y ( x , y )
X X X and Y Y Y are independent iff: f X , Y ( x , y ) = f X ( x ) ⋅ f Y ( y ) for all x , y f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } x, y f X , Y ( x , y ) = f X ( x ) ⋅ f Y ( y ) for all x , y
If Y = g ( X ) Y = g(X) Y = g ( X ) and g g g is monotonic with inverse g − 1 g^{-1} g − 1 :
f Y ( y ) = f X ( g − 1 ( y ) ) ∣ d d y g − 1 ( y ) ∣ f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right| f Y ( y ) = f X ( g − 1 ( y )) d y d g − 1 ( y )
If Y = a X + b Y = aX + b Y = a X + b :
E [ Y ] = a E [ X ] + b E[Y] = aE[X] + b E [ Y ] = a E [ X ] + b
Var ( Y ) = a 2 Var ( X ) \text{Var}(Y) = a^2 \text{Var}(X) Var ( Y ) = a 2 Var ( X )
B.11 Moment Generating Functions
M X ( t ) = E [ e t X ] M_X(t) = E[e^{tX}] M X ( t ) = E [ e tX ]
If exists in neighborhood of 0, uniquely determines distribution
E [ X k ] = M X ( k ) ( 0 ) E[X^k] = M_X^{(k)}(0) E [ X k ] = M X ( k ) ( 0 ) (k-th derivative at 0)
If X ⊥ Y X \perp Y X ⊥ Y : M X + Y ( t ) = M X ( t ) ⋅ M Y ( t ) M_{X+Y}(t) = M_X(t) \cdot M_Y(t) M X + Y ( t ) = M X ( t ) ⋅ M Y ( t )
( 1 − p ) + p e t (1-p) + pe^t ( 1 − p ) + p e t
Normal(μ , σ 2 \mu, \sigma^2 μ , σ 2 )
exp ( μ t + σ 2 t 2 / 2 ) \exp(\mu t + \sigma^2 t^2/2) exp ( μ t + σ 2 t 2 /2 )
exp ( λ ( e t − 1 ) ) \exp(\lambda(e^t - 1)) exp ( λ ( e t − 1 ))
B.12 Inequalities
Markov's Inequality
For X ≥ 0 X \geq 0 X ≥ 0 and a > 0 a > 0 a > 0 : P ( X ≥ a ) ≤ E [ X ] a P(X \geq a) \leq \frac{E[X]}{a} P ( X ≥ a ) ≤ a E [ X ]
Chebyshev's Inequality
For any k > 0 k > 0 k > 0 : P ( ∣ X − μ ∣ ≥ k σ ) ≤ 1 k 2 P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} P ( ∣ X − μ ∣ ≥ kσ ) ≤ k 2 1
Cauchy-Schwarz Inequality
∣ E [ X Y ] ∣ 2 ≤ E [ X 2 ] ⋅ E [ Y 2 ] |E[XY]|^2 \leq E[X^2] \cdot E[Y^2] ∣ E [ X Y ] ∣ 2 ≤ E [ X 2 ] ⋅ E [ Y 2 ]
Implies: ∣ Corr ( X , Y ) ∣ ≤ 1 |\text{Corr}(X, Y)| \leq 1 ∣ Corr ( X , Y ) ∣ ≤ 1
Jensen's Inequality
For convex g g g : g ( E [ X ] ) ≤ E [ g ( X ) ] g(E[X]) \leq E[g(X)] g ( E [ X ]) ≤ E [ g ( X )]
For concave g g g : inequality reverses.
B.13 Convergence Concepts
Types of Convergence
Almost sure convergence : X n → a . s . X X_n \xrightarrow{a.s.} X X n a . s . X if P ( lim n → ∞ X n = X ) = 1 P(\lim_{n \to \infty} X_n = X) = 1 P ( lim n → ∞ X n = X ) = 1
Convergence in probability : X n → p X X_n \xrightarrow{p} X X n p X if for all ϵ > 0 \epsilon > 0 ϵ > 0 : lim n → ∞ P ( ∣ X n − X ∣ > ϵ ) = 0 \lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0 lim n → ∞ P ( ∣ X n − X ∣ > ϵ ) = 0
Convergence in distribution : X n → d X X_n \xrightarrow{d} X X n d X if: lim n → ∞ F X n ( x ) = F X ( x ) at all continuity points \lim_{n \to \infty} F_{X_n}(x) = F_X(x) \quad \text{at all continuity points} lim n → ∞ F X n ( x ) = F X ( x ) at all continuity points
a.s. ⟹ probability ⟹ distribution \text{a.s.} \implies \text{probability} \implies \text{distribution} a.s. ⟹ probability ⟹ distribution
(Implications don't reverse in general)
Slutsky's Theorem
If X n → d X X_n \xrightarrow{d} X X n d X and Y n → p c Y_n \xrightarrow{p} c Y n p c (constant):
X n + Y n → d X + c X_n + Y_n \xrightarrow{d} X + c X n + Y n d X + c
X n Y n → d c X X_n Y_n \xrightarrow{d} cX X n Y n d c X
X n / Y n → d X / c X_n / Y_n \xrightarrow{d} X/c X n / Y n d X / c (if c ≠ 0 c \neq 0 c = 0 )
Continuous Mapping Theorem
If X n → d X X_n \xrightarrow{d} X X n d X and g g g is continuous: g ( X n ) → d g ( X ) g(X_n) \xrightarrow{d} g(X) g ( X n ) d g ( X )
If n ( X n − θ ) → d N ( 0 , σ 2 ) \sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2) n ( X n − θ ) d N ( 0 , σ 2 ) and g ′ ( θ ) ≠ 0 g'(\theta) \neq 0 g ′ ( θ ) = 0 : n ( g ( X n ) − g ( θ ) ) → d N ( 0 , [ g ′ ( θ ) ] 2 σ 2 ) \sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, [g'(\theta)]^2 \sigma^2) n ( g ( X n ) − g ( θ )) d N ( 0 , [ g ′ ( θ ) ] 2 σ 2 )
Further Reading
DeGroot, M. H., & Schervish, M. J. (2012). Probability and Statistics (4th ed.). Pearson.
Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage.
Wasserman, L. (2004). All of Statistics . Springer.