Chapter 3: Statistical Foundations
Opening Question
How do we learn from data while accounting for the fundamental uncertainty that what we observe is just one realization of what could have happened?
Chapter Overview
Empirical work rests on statistical foundations. We observe data—a sample of what could have occurred—and want to learn about the underlying process that generated it. The challenge is that our sample is random: a different draw from the same process would give different numbers. Statistical inference provides the tools to reason from sample to population while acknowledging this uncertainty.
This chapter reviews the statistical foundations needed for the methods in this book. It's not a substitute for a full statistics or econometrics course, but rather a focused treatment of concepts that will recur: probability distributions, estimation, hypothesis testing, and regression. Readers with strong statistical training may skim this chapter; those newer to quantitative methods should work through it carefully.
What you will learn:
Probability concepts essential for statistical inference
How sampling distributions connect sample statistics to population parameters
Frequentist and Bayesian perspectives on inference
Properties of estimators and common estimation methods
The regression framework as a foundation for causal inference methods
Prerequisites: Basic algebra; comfort with summation notation
Historical Context: The Statistical Revolution
Modern statistical inference emerged in the early 20th century through the work of Karl Pearson, Ronald Fisher, Jerzy Neyman, and Egon Pearson. Their debates shaped the methods we use today.
Fisher (1890-1962) developed maximum likelihood estimation, analysis of variance, and the randomization approach to experiments. His significance testing framework—computing p-values to assess whether data are consistent with a null hypothesis—became the dominant paradigm.
Neyman and Pearson formalized hypothesis testing as a decision problem, introducing the concepts of Type I and Type II errors, power, and the Neyman-Pearson lemma for optimal tests.
The frequentist-Bayesian divide reflects philosophical disagreements about probability itself. Frequentists interpret probability as long-run frequency; Bayesians interpret it as degree of belief. Both approaches inform modern practice.
In economics, the Cowles Commission (1940s-50s) developed simultaneous equations methods and laid the foundation for structural econometrics. The "credibility revolution" (1990s-present) shifted emphasis toward research design and reduced-form identification, but statistical foundations remain essential.
3.1 Probability Essentials
Random Variables
A random variable X is a function that assigns numbers to outcomes of a random process. We distinguish:
Discrete random variables: Take countable values (integers, categories).
Continuous random variables: Take uncountable values (any real number in an interval).
Probability Distributions
The distribution of X tells us how probability is spread across possible values.
For discrete X: Probability mass function P(X=x)=f(x)
For continuous X: Probability density function f(x) where P(a<X<b)=∫abf(x)dx
Cumulative distribution function (both cases): F(x)=P(X≤x)
Key Distributions
Normal (Gaussian): The bell curve, X∼N(μ,σ2)
f(x)=2πσ21exp(−2σ2(x−μ)2)
Why it matters: The Central Limit Theorem makes the normal distribution the limiting distribution for many sample statistics.
Chi-squared: χk2 with k degrees of freedom
Sum of k squared standard normals
Used in variance estimation and goodness-of-fit tests
t-distribution: tk with k degrees of freedom
Heavier tails than normal
Arises in small-sample inference when variance is estimated
F-distribution: Ratio of two chi-squared variables
Used in joint hypothesis tests, ANOVA
Expectations and Moments
Expected value: E[X]=∑xx⋅f(x) (discrete) or ∫x⋅f(x)dx (continuous)
The probability-weighted average of possible values
Variance: Var(X)=E[(X−E[X])2]=E[X2]−(E[X])2
Measures spread around the mean
Standard deviation: SD(X)=Var(X)
Covariance: Cov(X,Y)=E[(X−E[X])(Y−E[Y])]
Measures linear association between two variables
Correlation: ρXY=SD(X)⋅SD(Y)Cov(X,Y)
Standardized covariance; bounded between -1 and 1
Conditional Probability and Bayes' Rule
Conditional probability: P(A∣B)=P(B)P(A∩B)
Probability of A given that B occurred
Bayes' rule: P(A∣B)=P(B)P(B∣A)⋅P(A)
This connects:
P(A∣B): Posterior (what we want)
P(B∣A): Likelihood (what we can compute)
P(A): Prior (what we believed before seeing data)
Worked Example: Testing for Disease
A disease affects 1% of the population. A test is 95% accurate (95% sensitivity and specificity).
Question: If someone tests positive, what's the probability they have the disease?
Setup:
P(Disease)=0.01 (prevalence)
P(Positive∣Disease)=0.95 (sensitivity)
P(Negative∣NoDisease)=0.95 (specificity)
By Bayes' rule: P(Disease∣Positive)=P(Positive)P(Positive∣Disease)⋅P(Disease)
Where: P(Positive)=P(Positive∣Disease)⋅P(Disease)+P(Positive∣NoDisease)⋅P(NoDisease) =0.95×0.01+0.05×0.99=0.0095+0.0495=0.059
Therefore: P(Disease∣Positive)=0.0590.95×0.01=0.0590.0095≈0.16
Interpretation: Despite the "95% accurate" test, only 16% of positive results are true positives! The low base rate means false positives outnumber true positives. This illustrates why base rates matter—a lesson relevant for any diagnostic or predictive exercise.
3.2 Sampling Distributions
From Sample to Population
We observe a sample (X1,X2,...,Xn) from a population with distribution F. We want to learn about population parameters: the mean μ, variance σ2, or other quantities.
Sample statistics summarize the sample:
Sample mean: Xˉ=n1∑i=1nXi
Sample variance: S2=n−11∑i=1n(Xi−Xˉ)2
These statistics are themselves random variables—different samples give different values.
The Sampling Distribution
Definition 3.1 (Sampling Distribution): The sampling distribution of a statistic is the distribution of values the statistic would take across all possible samples of size n from the population.
Example: If we repeatedly draw samples of 100 people and compute the sample mean income each time, the distribution of those means is the sampling distribution of Xˉ.
Standard Error
The standard error is the standard deviation of a sampling distribution:
SE(Xˉ)=nσ
Standard error decreases with sample size: larger samples give more precise estimates.
The Central Limit Theorem
Theorem 3.1 (Central Limit Theorem): For iid random variables X1,...,Xn with mean μ and finite variance σ2, as n→∞: σ/nXˉ−μdN(0,1)
Translation: Regardless of the population distribution, the sample mean is approximately normal for large samples.
Why it matters: This justifies using normal-based inference even when the underlying data aren't normal. It's the foundation of most hypothesis testing and confidence intervals.
Visualization: The CLT in Action

Figure 3.1 illustrates the CLT in action. The exponential distribution is highly right-skewed—most values are small, but occasional large values pull the mean above the median. With n=1, the sampling distribution mirrors this skewness exactly. By n=10, the distribution is already more symmetric. At n=50 and n=100, the sampling distributions are nearly indistinguishable from normal, despite the skewed population. The red curve shows the theoretical normal approximation, which fits increasingly well as n grows.
3.3 Estimation
Point Estimation
A point estimator θ^ is a rule for computing an estimate of a parameter θ from data.
Desirable properties:
Unbiasedness: E[θ^]=θ
On average, the estimator hits the target
Bias = E[θ^]−θ
Bias captures systematic error—the extent to which an estimator is "off target" on average across repeated samples. An unbiased estimator will sometimes overestimate and sometimes underestimate, but these errors cancel out in expectation. A biased estimator, by contrast, tends to err in one direction. For example, using n rather than n−1 in the denominator of sample variance produces a downward bias—the estimator systematically underestimates the true variance.
The distinction between bias and variance matters. An estimator can be unbiased but highly variable (scattering wildly around the truth) or biased but precise (consistently off by the same amount). In some contexts, we accept small bias in exchange for lower variance—this is the bias-variance tradeoff that recurs in machine learning (Chapter 22).
Consistency: θ^pθ as n→∞
The estimator converges to the truth as sample size grows
Efficiency: Among unbiased estimators, has smallest variance
The Cramér-Rao lower bound gives the minimum achievable variance
Method of Moments (MOM)
Idea: Equate sample moments to population moments; solve for parameters.
Example: Estimating mean and variance
Population moments: E[X]=μ, E[X2]=σ2+μ2
Sample moments: Xˉ, n1∑Xi2
Solve: μ^=Xˉ, σ^2=n1∑Xi2−Xˉ2
Advantages: Simple, always applicable Disadvantages: May not be efficient; may not use all information in the data
Maximum Likelihood Estimation (MLE)
Idea: Choose parameter values that make the observed data most probable.
Likelihood function: L(θ)=∏i=1nf(Xi;θ)
Log-likelihood: ℓ(θ)=∑i=1nlogf(Xi;θ)
MLE: θ^MLE=argmaxθℓ(θ)
Example: Normal distribution with known variance
For X1,...,Xn∼N(μ,σ2) with σ2 known: ℓ(μ)=−2nlog(2πσ2)−2σ21∑i=1n(Xi−μ)2
Taking derivative and setting to zero: μ^MLE=Xˉ
Properties of MLE:
Consistent: θ^MLEpθ0
Asymptotically normal: n(θ^MLE−θ0)dN(0,I(θ0)−1)
Asymptotically efficient: Achieves Cramér-Rao bound
Generalized Method of Moments (GMM)
Idea: Generalize MOM to use any moment conditions, possibly more conditions than parameters.
Setup: Moment conditions E[g(Xi,θ0)]=0
GMM estimator: Minimize weighted distance from sample moments to zero:
θ^GMM=argminθ[n1∑i=1ng(Xi,θ)]′W[n1∑i=1ng(Xi,θ)]
where W is a positive definite weighting matrix. The key insight is that if we have more moment conditions than parameters (over-identification), we cannot satisfy all conditions exactly—so we minimize a weighted sum of squared violations.
Why it matters: GMM is the foundation for IV estimation (Chapter 12) and many structural approaches. When you estimate IV using two-stage least squares, you're applying a specific case of GMM.
Box: GMM Generalizes IV—The Moment Condition Connection
Instrumental variables estimation is a special case of GMM. Understanding this connection clarifies what IV does and enables extensions.
The IV moment condition: In the model Y=Xβ+ε, if instruments Z are valid (uncorrelated with ε), then: E[Z′(Y−Xβ)]=0⇔E[Z′ε]=0
This is a moment condition in GMM form: g(Yi,Xi,Zi,β)=Zi′(Yi−Xi′β).
Just-identified case (# instruments = # endogenous variables): The GMM solution is exactly the IV estimator: β^IV=(Z′X)−1Z′Y
Over-identified case (# instruments > # endogenous variables): Cannot satisfy all moment conditions exactly. GMM minimizes weighted violations: β^GMM=argminβ(Y−Xβ)′Z⋅W⋅Z′(Y−Xβ)
With optimal weighting W=(Z′Z)−1, this is two-stage least squares (2SLS).
The J-test for over-identification: If we have more instruments than needed, GMM provides a test of whether all instruments are valid. Under the null that all moment conditions hold, the minimized objective (appropriately scaled) follows a chi-squared distribution with degrees of freedom = (# instruments) − (# parameters). This is Hansen's J-test.
Implications: Understanding IV as GMM clarifies that (1) exclusion restrictions are moment conditions, (2) over-identification enables testing, and (3) weak instruments are really about weak moments. Advanced methods like efficient GMM, continuously updated GMM, and LIML are alternative ways of handling the same underlying moment conditions.
3.4 Hypothesis Testing
Hypothesis testing provides a framework for deciding whether observed data are consistent with a specific claim about the world. The logic is indirect: we assume the claim is true, ask how likely our data would be under that assumption, and reject the claim if the data are sufficiently unlikely.
This approach rests on several assumptions. First, we must specify a probability model linking parameters to data. Second, we need a test statistic whose distribution under the null hypothesis is known (exactly or approximately). Third, we choose a threshold for "sufficiently unlikely" before seeing the data. The frequentist framework interprets probability as long-run frequency: if we repeated the same procedure many times, how often would we make errors?
The Framework
Null hypothesis H0: A statement about the parameter (often "no effect": θ=0). The null is typically a sharp claim—an exact value or set of values we test against.
Alternative hypothesis H1: What we conclude if we reject H0 (often θ=0). The alternative can be one-sided (θ>0) or two-sided (θ=0).
Test statistic: A function of the data with known distribution under H0. Common examples include the t-statistic (for means and regression coefficients), the F-statistic (for joint hypotheses), and the chi-squared statistic (for categorical data).
Decision rule: Reject H0 if test statistic exceeds critical value. The critical value is determined by the significance level α—the maximum probability of false rejection we're willing to accept.
Errors
Type I error: Reject H0 when it's true (false positive)
Probability = significance level α
Convention: α=0.05
Type II error: Fail to reject H0 when it's false (false negative)
Probability = β
Power = 1−β = probability of detecting a true effect
P-Values
Definition 3.2 (P-Value): The p-value is the probability, under H0, of observing a test statistic at least as extreme as the one computed.
Interpretation: Small p-value = data unlikely under H0 = evidence against H0
Common threshold: p<0.05 often interpreted as "statistically significant"
Caution: P-values are widely misinterpreted. A p-value is NOT:
The probability that H0 is true
The probability of a Type I error given this result
A measure of effect size or importance
The P-Value Controversy
The misuse of p-values has become a central concern in empirical research. In 2016, the American Statistical Association took the unprecedented step of releasing an official statement on statistical significance and p-values—the first time in its 177-year history that it had issued guidance on a specific matter of statistical practice (Wasserstein and Lazar 2016). The statement was prompted by growing concern about the "reproducibility crisis" and the recognition that "while p-values can be useful, they are commonly misused and misinterpreted."
The ASA's Six Principles
The statement articulated six principles:
P-values can indicate how incompatible the data are with a specified statistical model. This is what p-values actually measure—the degree to which observed data conflict with what would be expected under the null hypothesis.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. This is the most common misinterpretation. A p-value of 0.03 does not mean there's a 3% chance the null is true; it means that if the null were true, data this extreme would occur 3% of the time.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. The practice of treating p<0.05 as a bright line between "real" and "not real" effects has no scientific basis. Fisher never intended 0.05 as a rigid cutoff.
Proper inference requires full reporting and transparency. P-hacking—selectively reporting tests that achieve significance while burying those that don't—invalidates the statistical logic underlying p-values.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. A tiny, trivial effect can be "highly significant" with enough data; a large, important effect can be "not significant" with too little data.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Other approaches—effect sizes, confidence intervals, Bayesian methods, likelihood ratios—often convey more useful information.
Why This Matters Now
The ASA statement emerged from a specific historical moment. The replication crisis—the discovery that many published findings, especially in psychology and biomedicine, fail to replicate—shook confidence in the scientific literature. Investigations revealed that researcher degrees of freedom (choices about data analysis, sample definition, and outcome measurement) combined with publication bias (journals preferring significant results) produced literatures full of false positives.
P-values, while not the root cause, became a focal point. The binary publication rule—significant findings get published, null findings don't—created perverse incentives. Researchers learned to hunt for p<0.05 through specification searches, data dredging, and selective reporting. The result was a literature where "significant" findings were often statistical flukes rather than real effects.
The Post p<0.05 Era
The ASA convened a follow-up symposium in 2017 titled "Scientific Method for the 21st Century: A World Beyond p<0.05." The discussions explored alternatives: Bayesian methods, likelihood ratios, effect sizes with confidence intervals, and even proposals to abandon significance testing entirely.
Several journals have responded. Some have banned or discouraged p-values. Others require reporting of effect sizes and confidence intervals alongside p-values. Pre-registration—specifying analyses before seeing data—has become increasingly common as a protection against p-hacking.
Practical Implications
For the researcher, the implications are:
Report effect sizes, not just significance. Always interpret the magnitude of estimated effects, not just whether they cross an arbitrary threshold.
Report confidence intervals. A 95% CI of [0.02,4.50] tells a very different story than [1.80,2.20], even if both exclude zero.
Consider the prior probability. A "significant" finding from an underpowered study testing an implausible hypothesis is probably a false positive. Context matters.
Don't treat p=0.049 and p=0.051 as categorically different. They are statistically indistinguishable and should be interpreted similarly.
Be transparent about all analyses conducted. Report the full set of tests, not just those that "worked."
Consider Bayesian alternatives when appropriate. Bayesian methods directly quantify the probability that hypotheses are true, conditional on data and priors—often what researchers actually want to know.
The ASA statement did not call for abandoning p-values. Rather, it called for using them properly: as one piece of evidence among many, never as a sole arbiter of truth, and always accompanied by effect sizes, confidence intervals, and substantive reasoning. The p-value is a tool, not a verdict.
Confidence Intervals
Definition 3.3 (Confidence Interval): A (1−α) confidence interval is a random interval [L(X),U(X)] such that P(θ∈[L,U])=1−α before the data are observed.
Interpretation: If we repeated the experiment many times, (1−α) of the intervals would contain the true parameter.
Standard CI for mean (large sample): Xˉ±zα/2⋅ns
where zα/2≈1.96 for 95% confidence.
Worked Example: Wage Gap
Question: Is there a gender wage gap?
Data: We observe n=500 workers. Sample mean wages are $52,000 for men and $47,000 for women. Sample standard deviations are $15,000 for men and $12,000 for women, with 250 workers in each group.
Hypotheses:
H0: μmen−μwomen=0 (no gap)
H1: μmen−μwomen=0 (there is a gap)
Test statistic: We use the two-sample t-test, which compares means across groups while accounting for sampling variability in both groups:
t=smen2/nmen+swomen2/nwomen(Yˉmen−Yˉwomen)−0
Substituting:
t=150002/250+120002/2505000=900000+5760005000=12155000≈4.12
P-value: For t=4.12 with large degrees of freedom, p<0.0001. Under the null hypothesis of no gap, observing a t-statistic this extreme would occur less than 0.01% of the time.
Conclusion: We reject the null hypothesis. The $5,000 gap is statistically significant—unlikely to arise from sampling variation alone.
95% CI: 5000±1.96×1215=[2618,7382]
Interpretation: We're 95% confident the true wage gap is between $2,618 and $7,382. Note that this tells us about the existence and magnitude of the gap, but not its cause. Observing a gap doesn't tell us whether it reflects discrimination, occupational sorting, differences in hours worked, or other factors. Causal interpretation requires the methods in Part III.
3.5 Bayesian vs. Frequentist Inference
Two Philosophies
Frequentist:
Parameters are fixed but unknown
Probability refers to long-run frequency
Inference via sampling distributions, p-values, confidence intervals
No prior information (or: prior is uniform/uninformative)
Bayesian:
Parameters have probability distributions
Probability represents degree of belief
Inference via updating prior to posterior
Prior information is explicit and influential
Bayesian Inference
Bayes' theorem for inference: P(θ∣Data)=P(Data)P(Data∣θ)⋅P(θ)
Components:
P(θ∣Data): Posterior distribution (updated belief)
P(Data∣θ): Likelihood (how probable is this data given θ?)
P(θ): Prior distribution (belief before data)
P(Data): Marginal likelihood (normalizing constant)
Example: Estimating a proportion
Prior: θ∼Beta(1,1) (uniform on [0,1]) Data: 7 successes in 10 trials Likelihood: Binomial(10,θ) Posterior: θ∣Data∼Beta(1+7,1+3)=Beta(8,4)
Posterior mean: 8/12=0.67 Posterior 95% credible interval: [0.39,0.88]
When to Use Which?
Frequentist advantages:
Objective (no prior choice required)
Established conventions and software
Easier to teach and communicate
Well-defined error rates (Type I, Type II)
Bayesian advantages:
Incorporates prior information explicitly
Direct probability statements about parameters
More natural for sequential updating
Better for small samples with informative priors
Straightforward handling of nuisance parameters
Can You Use Both?
Yes—and many researchers do. The frequentist-Bayesian distinction is sometimes treated as an ideological divide, but in practice the approaches are complementary.
Pragmatic mixing: A researcher might use frequentist methods for primary hypothesis tests (where conventions are well established) but Bayesian methods for prediction, hierarchical modeling, or sensitivity analysis. The key is understanding what each approach delivers: frequentist methods control long-run error rates; Bayesian methods deliver probability statements conditional on the model and prior.
Asymptotic equivalence: With large samples and diffuse priors, Bayesian posterior intervals often numerically coincide with frequentist confidence intervals. The philosophical interpretation differs—a 95% credible interval says "there's a 95% probability the parameter lies here given the data and prior," while a 95% confidence interval says "this procedure captures the true parameter 95% of the time"—but the numbers may be identical.
Where they diverge: The approaches differ most with small samples, strong priors, or complex models. In these settings, the choice matters and should be made deliberately.
In practice: Most applied economics is frequentist, but Bayesian methods are growing, especially for hierarchical models, meta-analysis, and forecasting. Macroeconomics increasingly relies on Bayesian estimation of DSGE models (Chapter 17). Machine learning often adopts a Bayesian perspective for regularization and uncertainty quantification.
Box: The Bayesian Origin of Regularization
Ridge regression and LASSO—the workhorses of machine learning—have a natural Bayesian interpretation. This connection reveals what regularization really does.
The equivalence:
Ridge: min∑(Yi−Xiβ)2+λ∥β∥22
Normal prior: βj∼N(0,σ2/λ)
LASSO: min∑(Yi−Xiβ)2+λ∥β∥1
Laplace prior: βj∼Laplace(0,1/λ)
Elastic Net: combines both
Mixture of Normal and Laplace
What this means:
Ridge = Bayesian regression with the prior belief that coefficients are small (centered at zero)
LASSO = Bayesian regression with the prior belief that most coefficients are exactly zero (sparsity)
The penalty parameter λ = How strongly you believe the prior relative to the data
Intuition: Regularization isn't just a "trick" to prevent overfitting—it's incorporating prior information that extreme coefficients are unlikely. The penalty encodes skepticism about large effects.
Why this matters for causal inference:
Regularization shrinks coefficients toward zero, introducing bias
For causal parameters, this bias can be a problem (see DML in Chapter 21)
But for nuisance parameters (propensity scores, outcome models), regularization often improves performance
Understanding the Bayesian interpretation clarifies what prior beliefs regularization imposes
Reference: Murphy (2012), Machine Learning: A Probabilistic Perspective, Chapter 7.
3.6 The Regression Framework
Linear Regression Model
Yi=β0+β1X1i+β2X2i+...+βkXki+εi
or in matrix form: Y=Xβ+ε
Components:
Y: Outcome (dependent variable)
X: Regressors (independent variables, covariates)
β: Coefficients (parameters to estimate)
ε: Error term (unobserved)
OLS Estimation
Ordinary Least Squares minimizes sum of squared residuals:
β^OLS=argminβ∑i=1n(Yi−Xi′β)2
Solution (derived in the Technical Appendix):
β^=(X′X)−1X′Y
Classical Assumptions
For OLS to be the Best Linear Unbiased Estimator (BLUE):
Linearity: Y=Xβ+ε
Random sampling: (Yi,Xi) iid across observations
No perfect multicollinearity: X′X is invertible
Zero conditional mean: E[ε∣X]=0
Homoskedasticity: Var(ε∣X)=σ2
Normality (for exact inference): ε∣X∼N(0,σ2)
What Does "BLUE" Mean?
The Gauss-Markov theorem states that under assumptions 1-5, OLS is the Best Linear Unbiased Estimator. "Best" means minimum variance; "Linear" means a linear function of Y; "Unbiased" means E[β^]=β.
In the credibility revolution era (Chapter 1), researchers often focus less on BLUE and more on consistency and identification. Why? First, BLUE is a finite-sample result—with large samples, consistency matters more than minimum variance among linear estimators. Second, "best among linear estimators" ignores nonlinear alternatives that might be more efficient. Third, the BLUE property requires homoskedasticity, which rarely holds in practice. Modern applied work therefore relies on robust standard errors (discussed below) rather than trusting the classical variance formula.
Still, OLS remains the workhorse estimator. Even when BLUE doesn't strictly hold, OLS is often consistent, computationally simple, and well understood. The emphasis shifts from optimality theorems to practical concerns: Is the identifying assumption (E[ε∣X]=0) plausible? Are standard errors computed correctly?
Box: From Finite-Sample to Large-Sample Thinking
Graduate econometrics courses traditionally emphasize the Gauss-Markov theorem and BLUE. But modern applied work has shifted toward large-sample (asymptotic) properties:
Unbiasedness
Consistency
BLUE (minimum variance among linear)
Asymptotic efficiency
Exact normality assumption
CLT-based inference
Homoskedasticity required
Robust SEs handle heteroskedasticity
Why the shift?
Sample sizes grew: With thousands or millions of observations, consistency matters more than finite-sample unbiasedness
Nonlinear methods became common: ML estimators, GMM, and IV are biased in finite samples but consistent—BLUE doesn't apply to them
Robust inference became standard: Heteroskedasticity-robust and cluster-robust SEs don't require homoskedasticity
Focus shifted to identification: Whether the estimand equals the causal effect matters more than whether it's BLUE
Practical implication: When evaluating an estimator, ask: "Is it consistent for the parameter I care about?" and "Are my standard errors valid?" rather than "Is it BLUE?"
Interpretation
Definition 3.4 (Regression Coefficient): βj is the expected change in Y associated with a one-unit change in Xj, holding other regressors constant.
Caution: This is a conditional association, not necessarily a causal effect. E[ε∣X]=0 requires that there are no omitted variables correlated with both X and Y—a strong assumption that usually fails in observational data (Chapter 11).
Box: Modeling Proportion Outcomes
OLS assumes the outcome can take any value on the real line. But many outcomes are bounded—most commonly proportions or rates that must lie between 0 and 1. Examples include vote shares, test score percentiles, budget allocations, and the share of income spent on a category. Applying OLS to proportions can produce nonsensical predictions (negative values or values exceeding 1) and biased estimates.
Four approaches to proportion outcomes:
Linear Probability Model (OLS)
Y=Xβ+ε
Simple, familiar coefficients
Predictions outside [0,1]; heteroskedastic by construction
Logit transformation
log(Y/(1−Y))=Xβ+ε
Constrains predictions
Undefined at Y=0 or Y=1; coefficients hard to interpret
Fractional logit (Papke & Wooldridge 1996)
Quasi-MLE with logistic mean
Handles 0s and 1s; robust
Coefficients on logit scale
Beta regression
Y∼Beta(μ,ϕ) with g(μ)=Xβ
Models full distribution; natural for proportions
Cannot handle exact 0s or 1s without extension
Beta regression treats Y as drawn from a Beta distribution, parameterized by mean μ and precision ϕ. The mean is linked to covariates through a logit (or other) link function. This approach:
Naturally respects the [0,1] bounds
Allows modeling of heteroskedasticity (precision can vary with covariates)
Produces predictions that are always valid proportions
When to use what:
If proportions cluster near 0.5 and precision doesn't matter: OLS may suffice as approximation
If you have exact 0s or 1s in the data: Use fractional logit or zero/one-inflated beta
If you want to model the full distribution and have no boundary values: Beta regression
For causal inference: Report average marginal effects (Chapter 11) regardless of model, since coefficients from nonlinear models aren't directly interpretable as effects on Y
In R, beta regression is implemented in the
betaregpackage; Bayesian estimation viabrms. See Cribari-Neto & Zeileis (2010) for a comprehensive treatment.
Standard Errors and Inference
Under classical assumptions:
β^∼N(β,σ2(X′X)−1)
Standard error: The standard error of β^j is the square root of the j-th diagonal element of the variance matrix:
SE(β^j)=σ^2⋅[(X′X)−1]jj
t-statistic: tj=β^j/SE(β^j)
Robust standard errors (Huber-White): When homoskedasticity fails, the classical variance formula is wrong. Robust standard errors remain valid under heteroskedasticity:
V^robust=(X′X)−1(∑i=1nε^i2XiXi′)(X′X)−1
This "sandwich" formula (so called because the middle term is sandwiched between (X′X)−1 terms) is now standard practice. In Stata, add , robust to regression commands; in R, use the sandwich package.
Clustered standard errors: When observations are correlated within groups (e.g., students within schools, workers within firms), standard errors must account for this clustering. Ignoring clustering leads to standard errors that are too small—often dramatically so—and inflated t-statistics.
Rule of Thumb: Cluster at the Level of Treatment Assignment
If treatment is assigned at the school level, cluster at the school level. If a policy affects all workers in a state, cluster at the state level. The intuition: units that receive the same treatment share common shocks, so their residuals are correlated. Clustering corrects for this.
Box: Common Clustering Mistakes and Edge Cases
Mistake 1: Clustering too narrowly If treatment varies at the state-year level but you cluster only at the individual level, you understate uncertainty. Always cluster at least at the level of treatment variation.
Mistake 2: Ignoring correlation structure With panel data, residuals may be correlated both within units over time and within time periods across units. Options:
Two-way clustering:
cluster = ~unit + time(Stata:cluster(unit) cluster(time))Conservative: cluster at the higher level of aggregation
Mistake 3: Clustering on outcome-determined groups Don't cluster on variables affected by treatment (e.g., post-treatment occupation). Cluster on pre-determined or exogenous groupings.
Edge case: Multiple possible clustering levels With students in classrooms in schools in districts, cluster at the level where treatment varies:
Teacher training intervention → cluster at classroom or school
State policy change → cluster at state
When in doubt, cluster higher—this is conservative
Edge case: Regression discontinuity In RD designs, clustering is less clear. If assignment is based on a continuous running variable with no grouped structure, robust SEs may suffice. But if running variable has mass points or groups, cluster accordingly.
See Abadie, Athey, Imbens & Wooldridge (2023) for a comprehensive treatment of when and how to cluster.
The mechanics: The clustered variance estimator sums residuals within clusters before squaring:
V^cluster=(X′X)−1(∑g=1Gu^gu^g′)(X′X)−1
where u^g=∑i∈gXiε^i is the sum of score contributions within cluster g.
How many clusters?: Cluster-robust SEs perform poorly with few clusters (G<30−50). With small G, use:
Wild cluster bootstrap (see Chapter 13)
Cluster-robust t-tests with adjusted degrees of freedom
Randomization inference
Implementation:
The Bootstrap
The bootstrap is a simulation-based method for computing standard errors, confidence intervals, and other measures of statistical uncertainty. Rather than deriving analytical formulas (which can be difficult or unavailable for complex estimators), the bootstrap approximates the sampling distribution directly.
The idea: If we could repeatedly sample from the population, we'd see how much our estimate varies. We can't resample from the population, but we can resample from our sample. Treating the sample as if it were the population, we draw many "bootstrap samples" (random samples with replacement from the original data) and compute our statistic for each. The variability across bootstrap samples approximates the sampling variability.
The algorithm:
From original sample of size n, draw B bootstrap samples (each of size n, with replacement)
Compute the statistic θ^(b) for each bootstrap sample b=1,...,B
The bootstrap standard error is the standard deviation of {θ^(1),...,θ^(B)}
Confidence intervals can be constructed from the percentiles (e.g., 2.5th and 97.5th for 95% CI)
When to use the bootstrap:
When analytical standard errors are unavailable or complex (e.g., median, quantile regressions, matching estimators)
When asymptotic approximations may be poor (small samples)
When you want confidence intervals for nonstandard quantities (ratios, differences in coefficients)
Cautions:
Bootstrap requires iid sampling or appropriate adjustments (e.g., block bootstrap for clustered data)
Some statistics have non-standard behavior that the bootstrap can't fix (e.g., unit root tests)
Computational cost: B=1000 or more replications is typical
The bootstrap has become routine in applied economics. Software makes it easy: in R, the boot package provides general bootstrap functionality; in Stata, the bootstrap prefix works with most estimation commands.
Worked Example: Returns to Education
Model: log(Wagei)=β0+β1Educationi+β2Experiencei+β3Experiencei2+εi
Data: Current Population Survey, working adults, n = 10,000
Results (hypothetical):
Education
0.085
0.004
21.3
Experience
0.035
0.003
11.7
Experience²
-0.0005
0.0001
-5.0
Constant
1.25
0.05
25.0
Interpretation:
Each additional year of education is associated with 8.5% higher wages
Experience has diminishing returns (positive linear, negative quadratic)
Caution: This is a descriptive regression. The 8.5% is the association between education and wages after controlling for experience. It is NOT the causal effect of education—people with more education differ in unobserved ways (ability, family background) that also affect wages. Identifying the causal effect requires methods from Chapters 10-14.
Practical Guidance
Choosing Methods
Large sample, standard inference
Frequentist with robust SEs
Small sample, strong priors
Bayesian
Clustered data
Clustered SEs or hierarchical models
Heteroskedasticity suspected
Robust SEs (always safe)
Complex hypothesis (joint test)
F-test or likelihood ratio
Common Pitfalls
Pitfall 1: P-value worship Treating p<0.05 as meaningful and p>0.05 as meaningless.
How to avoid: Report effect sizes and confidence intervals, not just significance. A large, imprecisely estimated effect may be more important than a small, precisely estimated one.
Pitfall 2: Confusing statistical and practical significance A tiny effect can be "statistically significant" with large n.
How to avoid: Always interpret magnitude. Ask: is this effect economically meaningful?
Pitfall 3: Ignoring multiple testing Testing many hypotheses inflates false positive rate.
How to avoid: Adjust for multiple comparisons (Bonferroni, FDR). Report all tests conducted.
Pitfall 4: Forgetting model assumptions Using standard errors that assume homoskedasticity when it fails.
How to avoid: Use robust or clustered SEs by default. Test assumptions when possible.
Summary
Key takeaways:
Probability provides the foundation: Random variables, distributions, and expectations are the language of statistical inference.
Sampling distributions connect sample to population: The CLT justifies normal-based inference for sample means.
Estimation has desirable properties: Unbiasedness, consistency, and efficiency guide our choice of estimators.
Hypothesis testing trades off error types: Significance level controls Type I error; power addresses Type II.
P-values are tools, not verdicts: The ASA statement reminds us to report effect sizes, use confidence intervals, and never treat p<0.05 as a bright line.
Frequentist and Bayesian approaches offer different perspectives: Neither is "correct"; both are useful.
Regression is fundamental: OLS provides a workhorse, but causal interpretation requires additional assumptions.
Returning to the opening question: We learn from data by understanding that our observations are random realizations of an underlying process. Statistical inference—through estimation, testing, and interval construction—allows us to reason about the process while acknowledging the uncertainty inherent in seeing only one sample.
Further Reading
Essential
Angrist and Pischke (2009), Mostly Harmless Econometrics, Ch. 3 - Regression review for economists
Freedman (2009), Statistical Models: Theory and Practice - Clear treatment of foundations
For Deeper Understanding
Casella and Berger (2002), Statistical Inference - Comprehensive mathematical statistics
Gelman et al. (2013), Bayesian Data Analysis - Modern Bayesian treatment
Hansen (2022), Econometrics - Graduate econometrics textbook (free online)
Historical/Philosophical
Stigler (1986), The History of Statistics - Fascinating history
Hacking (2001), An Introduction to Probability and Inductive Logic - Philosophical foundations
Mayo (2018), Statistical Inference as Severe Testing - Philosophy of frequentist inference
On Statistical Practice and the P-Value Debate
Wasserstein and Lazar (2016), "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician 70(2): 129-133 - The landmark ASA statement
Wasserstein, Schirm, and Lazar (2019), "Moving to a World Beyond 'p < 0.05'," The American Statistician 73(sup1): 1-19 - Follow-up with 43 invited commentaries
Benjamin et al. (2018), "Redefine Statistical Significance," Nature Human Behaviour 2: 6-10 - Proposal to lower threshold to 0.005
McShane et al. (2019), "Abandon Statistical Significance," The American Statistician 73(sup1): 235-245 - The case against bright-line thresholds
Applications
Imbens and Rubin (2015), Causal Inference for Statistics, Social, and Biomedical Sciences - Connects statistics to causation
Gelman and Hill (2007), Data Analysis Using Regression and Multilevel/Hierarchical Models - Applied Bayesian regression
Specialized Regression Models
Papke and Wooldridge (1996), "Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates," Journal of Applied Econometrics 11(6): 619-632 - Fractional logit for proportions
Cribari-Neto and Zeileis (2010), "Beta Regression in R," Journal of Statistical Software 34(2): 1-24 - Comprehensive treatment of beta regression with R implementation
Exercises
Conceptual
Explain the difference between a parameter and a statistic. Why is the sample mean a statistic but the population mean a parameter?
A researcher reports a "marginally significant" result with p=0.08. Another reports a "highly significant" result with p=0.001. What can and can't we conclude from these p-values?
Explain the Central Limit Theorem in plain language. Why does it matter for statistical practice?
Applied
Generate 1,000 samples of size n=30 from an exponential distribution. For each sample, compute the sample mean and construct a 95% confidence interval for the population mean. What proportion of intervals contain the true mean? Does this match theory?
Using CPS or similar data:
Estimate a Mincer wage regression (log wage on education, experience, experience²)
Report standard errors: (a) assuming homoskedasticity, (b) robust
Do the conclusions change?
Discussion
Bayesian methods require specifying a prior distribution. Critics argue this makes results subjective. Defenders argue that frequentist methods also embed assumptions (just less explicitly). Who has the stronger argument?
A medical researcher tests whether a new drug reduces blood pressure. The study finds a mean reduction of 2 mmHg with p=0.03 and 95% CI of [0.2,3.8]. The researcher concludes: "There is a 97% probability that the drug works." A newspaper reports: "Scientists prove new drug lowers blood pressure." Evaluate both statements in light of the ASA's six principles. What would be a more accurate summary of these findings?
Consider these three hypothetical studies:
Study A: Tests a new physics theory with strong prior theoretical support. Finds p=0.04 with n=50.
Study B: Tests whether birth month affects personality. Finds p=0.01 with n=10,000.
Study C: Replicates a well-established finding. Finds p=0.08 with n=200.
Rank these from most to least convincing evidence, and explain your reasoning. Why might a smaller p-value not always indicate stronger evidence?
Technical Appendix
A. Derivation of OLS Estimator
From the first-order conditions of minimizing ∑i(Yi−Xi′β)2: ∂β∂∑i(Yi−Xi′β)2=−2∑iXi(Yi−Xi′β)=0 ⇒X′Y=X′Xβ ⇒β^=(X′X)−1X′Y
B. Properties of OLS Under Classical Assumptions
Unbiasedness: E[β^∣X]=E[(X′X)−1X′Y∣X]=(X′X)−1X′E[Y∣X]=(X′X)−1X′Xβ=β
Variance: Var(β^∣X)=(X′X)−1X′Var(Y∣X)X(X′X)−1=σ2(X′X)−1
Gauss-Markov Theorem: Under assumptions 1-5, OLS is BLUE (Best Linear Unbiased Estimator).
C. Asymptotic Distribution
Under weaker conditions (finite fourth moments, E[XiXi′] invertible): n(β^−β)dN(0,E[XiXi′]−1E[εi2XiXi′]E[XiXi′]−1)
This is the "sandwich" form that justifies robust standard errors.
Last updated