Chapter 3: Statistical Foundations

Opening Question

How do we learn from data while accounting for the fundamental uncertainty that what we observe is just one realization of what could have happened?


Chapter Overview

Empirical work rests on statistical foundations. We observe data—a sample of what could have occurred—and want to learn about the underlying process that generated it. The challenge is that our sample is random: a different draw from the same process would give different numbers. Statistical inference provides the tools to reason from sample to population while acknowledging this uncertainty.

This chapter reviews the statistical foundations needed for the methods in this book. It's not a substitute for a full statistics or econometrics course, but rather a focused treatment of concepts that will recur: probability distributions, estimation, hypothesis testing, and regression. Readers with strong statistical training may skim this chapter; those newer to quantitative methods should work through it carefully.

What you will learn:

  • Probability concepts essential for statistical inference

  • How sampling distributions connect sample statistics to population parameters

  • Frequentist and Bayesian perspectives on inference

  • Properties of estimators and common estimation methods

  • The regression framework as a foundation for causal inference methods

Prerequisites: Basic algebra; comfort with summation notation


Historical Context: The Statistical Revolution

Modern statistical inference emerged in the early 20th century through the work of Karl Pearson, Ronald Fisher, Jerzy Neyman, and Egon Pearson. Their debates shaped the methods we use today.

Fisher (1890-1962) developed maximum likelihood estimation, analysis of variance, and the randomization approach to experiments. His significance testing framework—computing p-values to assess whether data are consistent with a null hypothesis—became the dominant paradigm.

Neyman and Pearson formalized hypothesis testing as a decision problem, introducing the concepts of Type I and Type II errors, power, and the Neyman-Pearson lemma for optimal tests.

The frequentist-Bayesian divide reflects philosophical disagreements about probability itself. Frequentists interpret probability as long-run frequency; Bayesians interpret it as degree of belief. Both approaches inform modern practice.

In economics, the Cowles Commission (1940s-50s) developed simultaneous equations methods and laid the foundation for structural econometrics. The "credibility revolution" (1990s-present) shifted emphasis toward research design and reduced-form identification, but statistical foundations remain essential.


3.1 Probability Essentials

Random Variables

A random variable XX is a function that assigns numbers to outcomes of a random process. We distinguish:

Discrete random variables: Take countable values (integers, categories).

Continuous random variables: Take uncountable values (any real number in an interval).

Probability Distributions

The distribution of XX tells us how probability is spread across possible values.

For discrete XX: Probability mass function P(X=x)=f(x)P(X = x) = f(x)

For continuous XX: Probability density function f(x)f(x) where P(a<X<b)=abf(x)dxP(a < X < b) = \int_a^b f(x)dx

Cumulative distribution function (both cases): F(x)=P(Xx)F(x) = P(X \leq x)

Key Distributions

Normal (Gaussian): The bell curve, XN(μ,σ2)X \sim N(\mu, \sigma^2)

f(x)=12πσ2exp((xμ)22σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Why it matters: The Central Limit Theorem makes the normal distribution the limiting distribution for many sample statistics.

Chi-squared: χk2\chi^2_k with kk degrees of freedom

  • Sum of kk squared standard normals

  • Used in variance estimation and goodness-of-fit tests

t-distribution: tkt_k with kk degrees of freedom

  • Heavier tails than normal

  • Arises in small-sample inference when variance is estimated

F-distribution: Ratio of two chi-squared variables

  • Used in joint hypothesis tests, ANOVA

Expectations and Moments

Expected value: E[X]=xxf(x)E[X] = \sum_x x \cdot f(x) (discrete) or xf(x)dx\int x \cdot f(x)dx (continuous)

  • The probability-weighted average of possible values

Variance: Var(X)=E[(XE[X])2]=E[X2](E[X])2Var(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2

  • Measures spread around the mean

Standard deviation: SD(X)=Var(X)SD(X) = \sqrt{Var(X)}

Covariance: Cov(X,Y)=E[(XE[X])(YE[Y])]Cov(X, Y) = E[(X - E[X])(Y - E[Y])]

  • Measures linear association between two variables

Correlation: ρXY=Cov(X,Y)SD(X)SD(Y)\rho_{XY} = \frac{Cov(X,Y)}{SD(X) \cdot SD(Y)}

  • Standardized covariance; bounded between -1 and 1

Conditional Probability and Bayes' Rule

Conditional probability: P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

  • Probability of AA given that BB occurred

Bayes' rule: P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

This connects:

  • P(AB)P(A|B): Posterior (what we want)

  • P(BA)P(B|A): Likelihood (what we can compute)

  • P(A)P(A): Prior (what we believed before seeing data)

Worked Example: Testing for Disease

A disease affects 1% of the population. A test is 95% accurate (95% sensitivity and specificity).

Question: If someone tests positive, what's the probability they have the disease?

Setup:

  • P(Disease)=0.01P(Disease) = 0.01 (prevalence)

  • P(PositiveDisease)=0.95P(Positive | Disease) = 0.95 (sensitivity)

  • P(NegativeNoDisease)=0.95P(Negative | No Disease) = 0.95 (specificity)

By Bayes' rule: P(DiseasePositive)=P(PositiveDisease)P(Disease)P(Positive)P(Disease | Positive) = \frac{P(Positive | Disease) \cdot P(Disease)}{P(Positive)}

Where: P(Positive)=P(PositiveDisease)P(Disease)+P(PositiveNoDisease)P(NoDisease)P(Positive) = P(Positive | Disease) \cdot P(Disease) + P(Positive | No Disease) \cdot P(No Disease) =0.95×0.01+0.05×0.99=0.0095+0.0495=0.059= 0.95 \times 0.01 + 0.05 \times 0.99 = 0.0095 + 0.0495 = 0.059

Therefore: P(DiseasePositive)=0.95×0.010.059=0.00950.0590.16P(Disease | Positive) = \frac{0.95 \times 0.01}{0.059} = \frac{0.0095}{0.059} \approx 0.16

Interpretation: Despite the "95% accurate" test, only 16% of positive results are true positives! The low base rate means false positives outnumber true positives. This illustrates why base rates matter—a lesson relevant for any diagnostic or predictive exercise.


3.2 Sampling Distributions

From Sample to Population

We observe a sample (X1,X2,...,Xn)(X_1, X_2, ..., X_n) from a population with distribution FF. We want to learn about population parameters: the mean μ\mu, variance σ2\sigma^2, or other quantities.

Sample statistics summarize the sample:

  • Sample mean: Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i

  • Sample variance: S2=1n1i=1n(XiXˉ)2S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2

These statistics are themselves random variables—different samples give different values.

The Sampling Distribution

Definition 3.1 (Sampling Distribution): The sampling distribution of a statistic is the distribution of values the statistic would take across all possible samples of size nn from the population.

Example: If we repeatedly draw samples of 100 people and compute the sample mean income each time, the distribution of those means is the sampling distribution of Xˉ\bar{X}.

Standard Error

The standard error is the standard deviation of a sampling distribution:

SE(Xˉ)=σnSE(\bar{X}) = \frac{\sigma}{\sqrt{n}}

Standard error decreases with sample size: larger samples give more precise estimates.

The Central Limit Theorem

Theorem 3.1 (Central Limit Theorem): For iid random variables X1,...,XnX_1, ..., X_n with mean μ\mu and finite variance σ2\sigma^2, as nn \to \infty: Xˉμσ/ndN(0,1)\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)

Translation: Regardless of the population distribution, the sample mean is approximately normal for large samples.

Why it matters: This justifies using normal-based inference even when the underlying data aren't normal. It's the foundation of most hypothesis testing and confidence intervals.

Visualization: The CLT in Action

Figure 3.1: Sampling distributions of the mean for samples of size n=1, 10, 50, 100 from an exponential population. As sample size increases, the distribution of sample means becomes increasingly normal (bell-shaped), even though the underlying population is highly skewed.

Figure 3.1 illustrates the CLT in action. The exponential distribution is highly right-skewed—most values are small, but occasional large values pull the mean above the median. With n=1n=1, the sampling distribution mirrors this skewness exactly. By n=10n=10, the distribution is already more symmetric. At n=50n=50 and n=100n=100, the sampling distributions are nearly indistinguishable from normal, despite the skewed population. The red curve shows the theoretical normal approximation, which fits increasingly well as nn grows.


3.3 Estimation

Point Estimation

A point estimator θ^\hat{\theta} is a rule for computing an estimate of a parameter θ\theta from data.

Desirable properties:

Unbiasedness: E[θ^]=θE[\hat{\theta}] = \theta

  • On average, the estimator hits the target

  • Bias = E[θ^]θE[\hat{\theta}] - \theta

Bias captures systematic error—the extent to which an estimator is "off target" on average across repeated samples. An unbiased estimator will sometimes overestimate and sometimes underestimate, but these errors cancel out in expectation. A biased estimator, by contrast, tends to err in one direction. For example, using nn rather than n1n-1 in the denominator of sample variance produces a downward bias—the estimator systematically underestimates the true variance.

The distinction between bias and variance matters. An estimator can be unbiased but highly variable (scattering wildly around the truth) or biased but precise (consistently off by the same amount). In some contexts, we accept small bias in exchange for lower variance—this is the bias-variance tradeoff that recurs in machine learning (Chapter 22).

Consistency: θ^pθ\hat{\theta} \xrightarrow{p} \theta as nn \to \infty

  • The estimator converges to the truth as sample size grows

Efficiency: Among unbiased estimators, has smallest variance

  • The Cramér-Rao lower bound gives the minimum achievable variance

Method of Moments (MOM)

Idea: Equate sample moments to population moments; solve for parameters.

Example: Estimating mean and variance

  • Population moments: E[X]=μE[X] = \mu, E[X2]=σ2+μ2E[X^2] = \sigma^2 + \mu^2

  • Sample moments: Xˉ\bar{X}, 1nXi2\frac{1}{n}\sum X_i^2

  • Solve: μ^=Xˉ\hat{\mu} = \bar{X}, σ^2=1nXi2Xˉ2\hat{\sigma}^2 = \frac{1}{n}\sum X_i^2 - \bar{X}^2

Advantages: Simple, always applicable Disadvantages: May not be efficient; may not use all information in the data

Maximum Likelihood Estimation (MLE)

Idea: Choose parameter values that make the observed data most probable.

Likelihood function: L(θ)=i=1nf(Xi;θ)L(\theta) = \prod_{i=1}^n f(X_i; \theta)

Log-likelihood: (θ)=i=1nlogf(Xi;θ)\ell(\theta) = \sum_{i=1}^n \log f(X_i; \theta)

MLE: θ^MLE=argmaxθ(θ)\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)

Example: Normal distribution with known variance

For X1,...,XnN(μ,σ2)X_1, ..., X_n \sim N(\mu, \sigma^2) with σ2\sigma^2 known: (μ)=n2log(2πσ2)12σ2i=1n(Xiμ)2\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (X_i - \mu)^2

Taking derivative and setting to zero: μ^MLE=Xˉ\hat{\mu}_{MLE} = \bar{X}

Properties of MLE:

  • Consistent: θ^MLEpθ0\hat{\theta}_{MLE} \xrightarrow{p} \theta_0

  • Asymptotically normal: n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})

  • Asymptotically efficient: Achieves Cramér-Rao bound

Generalized Method of Moments (GMM)

Idea: Generalize MOM to use any moment conditions, possibly more conditions than parameters.

Setup: Moment conditions E[g(Xi,θ0)]=0E[g(X_i, \theta_0)] = 0

GMM estimator: Minimize weighted distance from sample moments to zero:

θ^GMM=argminθ[1ni=1ng(Xi,θ)]W[1ni=1ng(Xi,θ)]\hat{\theta}_{\text{GMM}} = \arg\min_{\theta} \left[\frac{1}{n}\sum_{i=1}^{n} g(X_i, \theta)\right]' W \left[\frac{1}{n}\sum_{i=1}^{n} g(X_i, \theta)\right]

where WW is a positive definite weighting matrix. The key insight is that if we have more moment conditions than parameters (over-identification), we cannot satisfy all conditions exactly—so we minimize a weighted sum of squared violations.

Why it matters: GMM is the foundation for IV estimation (Chapter 12) and many structural approaches. When you estimate IV using two-stage least squares, you're applying a specific case of GMM.

Box: GMM Generalizes IV—The Moment Condition Connection

Instrumental variables estimation is a special case of GMM. Understanding this connection clarifies what IV does and enables extensions.

The IV moment condition: In the model Y=Xβ+εY = X\beta + \varepsilon, if instruments ZZ are valid (uncorrelated with ε\varepsilon), then: E[Z(YXβ)]=0E[Zε]=0E[Z'(Y - X\beta)] = 0 \quad \Leftrightarrow \quad E[Z'\varepsilon] = 0

This is a moment condition in GMM form: g(Yi,Xi,Zi,β)=Zi(YiXiβ)g(Y_i, X_i, Z_i, \beta) = Z_i'(Y_i - X_i'\beta).

Just-identified case (# instruments = # endogenous variables): The GMM solution is exactly the IV estimator: β^IV=(ZX)1ZY\hat{\beta}_{IV} = (Z'X)^{-1}Z'Y

Over-identified case (# instruments > # endogenous variables): Cannot satisfy all moment conditions exactly. GMM minimizes weighted violations: β^GMM=argminβ(YXβ)ZWZ(YXβ)\hat{\beta}_{GMM} = \arg\min_\beta (Y - X\beta)'Z \cdot W \cdot Z'(Y - X\beta)

With optimal weighting W=(ZZ)1W = (Z'Z)^{-1}, this is two-stage least squares (2SLS).

The J-test for over-identification: If we have more instruments than needed, GMM provides a test of whether all instruments are valid. Under the null that all moment conditions hold, the minimized objective (appropriately scaled) follows a chi-squared distribution with degrees of freedom = (# instruments) − (# parameters). This is Hansen's J-test.

Implications: Understanding IV as GMM clarifies that (1) exclusion restrictions are moment conditions, (2) over-identification enables testing, and (3) weak instruments are really about weak moments. Advanced methods like efficient GMM, continuously updated GMM, and LIML are alternative ways of handling the same underlying moment conditions.


3.4 Hypothesis Testing

Hypothesis testing provides a framework for deciding whether observed data are consistent with a specific claim about the world. The logic is indirect: we assume the claim is true, ask how likely our data would be under that assumption, and reject the claim if the data are sufficiently unlikely.

This approach rests on several assumptions. First, we must specify a probability model linking parameters to data. Second, we need a test statistic whose distribution under the null hypothesis is known (exactly or approximately). Third, we choose a threshold for "sufficiently unlikely" before seeing the data. The frequentist framework interprets probability as long-run frequency: if we repeated the same procedure many times, how often would we make errors?

The Framework

Null hypothesis H0H_0: A statement about the parameter (often "no effect": θ=0\theta = 0). The null is typically a sharp claim—an exact value or set of values we test against.

Alternative hypothesis H1H_1: What we conclude if we reject H0H_0 (often θ0\theta \neq 0). The alternative can be one-sided (θ>0\theta > 0) or two-sided (θ0\theta \neq 0).

Test statistic: A function of the data with known distribution under H0H_0. Common examples include the t-statistic (for means and regression coefficients), the F-statistic (for joint hypotheses), and the chi-squared statistic (for categorical data).

Decision rule: Reject H0H_0 if test statistic exceeds critical value. The critical value is determined by the significance level α\alpha—the maximum probability of false rejection we're willing to accept.

Errors

Type I error: Reject H0H_0 when it's true (false positive)

  • Probability = significance level α\alpha

  • Convention: α=0.05\alpha = 0.05

Type II error: Fail to reject H0H_0 when it's false (false negative)

  • Probability = β\beta

  • Power = 1β1 - \beta = probability of detecting a true effect

P-Values

Definition 3.2 (P-Value): The p-value is the probability, under H0H_0, of observing a test statistic at least as extreme as the one computed.

Interpretation: Small p-value = data unlikely under H0H_0 = evidence against H0H_0

Common threshold: p<0.05p < 0.05 often interpreted as "statistically significant"

Caution: P-values are widely misinterpreted. A p-value is NOT:

  • The probability that H0H_0 is true

  • The probability of a Type I error given this result

  • A measure of effect size or importance

The P-Value Controversy

The misuse of p-values has become a central concern in empirical research. In 2016, the American Statistical Association took the unprecedented step of releasing an official statement on statistical significance and p-values—the first time in its 177-year history that it had issued guidance on a specific matter of statistical practice (Wasserstein and Lazar 2016). The statement was prompted by growing concern about the "reproducibility crisis" and the recognition that "while p-values can be useful, they are commonly misused and misinterpreted."

The ASA's Six Principles

The statement articulated six principles:

  1. P-values can indicate how incompatible the data are with a specified statistical model. This is what p-values actually measure—the degree to which observed data conflict with what would be expected under the null hypothesis.

  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. This is the most common misinterpretation. A p-value of 0.03 does not mean there's a 3% chance the null is true; it means that if the null were true, data this extreme would occur 3% of the time.

  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. The practice of treating p<0.05p < 0.05 as a bright line between "real" and "not real" effects has no scientific basis. Fisher never intended 0.05 as a rigid cutoff.

  4. Proper inference requires full reporting and transparency. P-hacking—selectively reporting tests that achieve significance while burying those that don't—invalidates the statistical logic underlying p-values.

  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. A tiny, trivial effect can be "highly significant" with enough data; a large, important effect can be "not significant" with too little data.

  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Other approaches—effect sizes, confidence intervals, Bayesian methods, likelihood ratios—often convey more useful information.

Why This Matters Now

The ASA statement emerged from a specific historical moment. The replication crisis—the discovery that many published findings, especially in psychology and biomedicine, fail to replicate—shook confidence in the scientific literature. Investigations revealed that researcher degrees of freedom (choices about data analysis, sample definition, and outcome measurement) combined with publication bias (journals preferring significant results) produced literatures full of false positives.

P-values, while not the root cause, became a focal point. The binary publication rule—significant findings get published, null findings don't—created perverse incentives. Researchers learned to hunt for p<0.05p < 0.05 through specification searches, data dredging, and selective reporting. The result was a literature where "significant" findings were often statistical flukes rather than real effects.

The Post p<0.05p < 0.05 Era

The ASA convened a follow-up symposium in 2017 titled "Scientific Method for the 21st Century: A World Beyond p<0.05p < 0.05." The discussions explored alternatives: Bayesian methods, likelihood ratios, effect sizes with confidence intervals, and even proposals to abandon significance testing entirely.

Several journals have responded. Some have banned or discouraged p-values. Others require reporting of effect sizes and confidence intervals alongside p-values. Pre-registration—specifying analyses before seeing data—has become increasingly common as a protection against p-hacking.

Practical Implications

For the researcher, the implications are:

  1. Report effect sizes, not just significance. Always interpret the magnitude of estimated effects, not just whether they cross an arbitrary threshold.

  2. Report confidence intervals. A 95% CI of [0.02,4.50][0.02, 4.50] tells a very different story than [1.80,2.20][1.80, 2.20], even if both exclude zero.

  3. Consider the prior probability. A "significant" finding from an underpowered study testing an implausible hypothesis is probably a false positive. Context matters.

  4. Don't treat p=0.049p = 0.049 and p=0.051p = 0.051 as categorically different. They are statistically indistinguishable and should be interpreted similarly.

  5. Be transparent about all analyses conducted. Report the full set of tests, not just those that "worked."

  6. Consider Bayesian alternatives when appropriate. Bayesian methods directly quantify the probability that hypotheses are true, conditional on data and priors—often what researchers actually want to know.

The ASA statement did not call for abandoning p-values. Rather, it called for using them properly: as one piece of evidence among many, never as a sole arbiter of truth, and always accompanied by effect sizes, confidence intervals, and substantive reasoning. The p-value is a tool, not a verdict.

Confidence Intervals

Definition 3.3 (Confidence Interval): A (1α)(1-\alpha) confidence interval is a random interval [L(X),U(X)][L(X), U(X)] such that P(θ[L,U])=1αP(\theta \in [L, U]) = 1 - \alpha before the data are observed.

Interpretation: If we repeated the experiment many times, (1α)(1-\alpha) of the intervals would contain the true parameter.

Standard CI for mean (large sample): Xˉ±zα/2sn\bar{X} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}

where zα/21.96z_{\alpha/2} \approx 1.96 for 95% confidence.

Worked Example: Wage Gap

Question: Is there a gender wage gap?

Data: We observe n=500n = 500 workers. Sample mean wages are $52,000 for men and $47,000 for women. Sample standard deviations are $15,000 for men and $12,000 for women, with 250 workers in each group.

Hypotheses:

  • H0H_0: μmenμwomen=0\mu_{\text{men}} - \mu_{\text{women}} = 0 (no gap)

  • H1H_1: μmenμwomen0\mu_{\text{men}} - \mu_{\text{women}} \neq 0 (there is a gap)

Test statistic: We use the two-sample t-test, which compares means across groups while accounting for sampling variability in both groups:

t=(YˉmenYˉwomen)0smen2/nmen+swomen2/nwoment = \frac{(\bar{Y}_{\text{men}} - \bar{Y}_{\text{women}}) - 0}{\sqrt{s_{\text{men}}^2/n_{\text{men}} + s_{\text{women}}^2/n_{\text{women}}}}

Substituting:

t=5000150002/250+120002/250=5000900000+576000=500012154.12t = \frac{5000}{\sqrt{15000^2/250 + 12000^2/250}} = \frac{5000}{\sqrt{900000 + 576000}} = \frac{5000}{1215} \approx 4.12

P-value: For t=4.12t = 4.12 with large degrees of freedom, p<0.0001p < 0.0001. Under the null hypothesis of no gap, observing a t-statistic this extreme would occur less than 0.01% of the time.

Conclusion: We reject the null hypothesis. The $5,000 gap is statistically significant—unlikely to arise from sampling variation alone.

95% CI: 5000±1.96×1215=[2618,7382]5000 \pm 1.96 \times 1215 = [2618, 7382]

Interpretation: We're 95% confident the true wage gap is between $2,618 and $7,382. Note that this tells us about the existence and magnitude of the gap, but not its cause. Observing a gap doesn't tell us whether it reflects discrimination, occupational sorting, differences in hours worked, or other factors. Causal interpretation requires the methods in Part III.


3.5 Bayesian vs. Frequentist Inference

Two Philosophies

Frequentist:

  • Parameters are fixed but unknown

  • Probability refers to long-run frequency

  • Inference via sampling distributions, p-values, confidence intervals

  • No prior information (or: prior is uniform/uninformative)

Bayesian:

  • Parameters have probability distributions

  • Probability represents degree of belief

  • Inference via updating prior to posterior

  • Prior information is explicit and influential

Bayesian Inference

Bayes' theorem for inference: P(θData)=P(Dataθ)P(θ)P(Data)P(\theta | Data) = \frac{P(Data | \theta) \cdot P(\theta)}{P(Data)}

Components:

  • P(θData)P(\theta | Data): Posterior distribution (updated belief)

  • P(Dataθ)P(Data | \theta): Likelihood (how probable is this data given θ\theta?)

  • P(θ)P(\theta): Prior distribution (belief before data)

  • P(Data)P(Data): Marginal likelihood (normalizing constant)

Example: Estimating a proportion

Prior: θBeta(1,1)\theta \sim Beta(1, 1) (uniform on [0,1]) Data: 7 successes in 10 trials Likelihood: Binomial(10,θ)Binomial(10, \theta) Posterior: θDataBeta(1+7,1+3)=Beta(8,4)\theta | Data \sim Beta(1 + 7, 1 + 3) = Beta(8, 4)

Posterior mean: 8/12=0.678/12 = 0.67 Posterior 95% credible interval: [0.39,0.88][0.39, 0.88]

When to Use Which?

Frequentist advantages:

  • Objective (no prior choice required)

  • Established conventions and software

  • Easier to teach and communicate

  • Well-defined error rates (Type I, Type II)

Bayesian advantages:

  • Incorporates prior information explicitly

  • Direct probability statements about parameters

  • More natural for sequential updating

  • Better for small samples with informative priors

  • Straightforward handling of nuisance parameters

Can You Use Both?

Yes—and many researchers do. The frequentist-Bayesian distinction is sometimes treated as an ideological divide, but in practice the approaches are complementary.

Pragmatic mixing: A researcher might use frequentist methods for primary hypothesis tests (where conventions are well established) but Bayesian methods for prediction, hierarchical modeling, or sensitivity analysis. The key is understanding what each approach delivers: frequentist methods control long-run error rates; Bayesian methods deliver probability statements conditional on the model and prior.

Asymptotic equivalence: With large samples and diffuse priors, Bayesian posterior intervals often numerically coincide with frequentist confidence intervals. The philosophical interpretation differs—a 95% credible interval says "there's a 95% probability the parameter lies here given the data and prior," while a 95% confidence interval says "this procedure captures the true parameter 95% of the time"—but the numbers may be identical.

Where they diverge: The approaches differ most with small samples, strong priors, or complex models. In these settings, the choice matters and should be made deliberately.

In practice: Most applied economics is frequentist, but Bayesian methods are growing, especially for hierarchical models, meta-analysis, and forecasting. Macroeconomics increasingly relies on Bayesian estimation of DSGE models (Chapter 17). Machine learning often adopts a Bayesian perspective for regularization and uncertainty quantification.

Box: The Bayesian Origin of Regularization

Ridge regression and LASSO—the workhorses of machine learning—have a natural Bayesian interpretation. This connection reveals what regularization really does.

The equivalence:

Frequentist Regularization
Bayesian Prior

Ridge: min(YiXiβ)2+λβ22\min \sum(Y_i - X_i\beta)^2 + \lambda\|\beta\|_2^2

Normal prior: βjN(0,σ2/λ)\beta_j \sim N(0, \sigma^2/\lambda)

LASSO: min(YiXiβ)2+λβ1\min \sum(Y_i - X_i\beta)^2 + \lambda\|\beta\|_1

Laplace prior: βjLaplace(0,1/λ)\beta_j \sim Laplace(0, 1/\lambda)

Elastic Net: combines both

Mixture of Normal and Laplace

What this means:

  • Ridge = Bayesian regression with the prior belief that coefficients are small (centered at zero)

  • LASSO = Bayesian regression with the prior belief that most coefficients are exactly zero (sparsity)

  • The penalty parameter λ\lambda = How strongly you believe the prior relative to the data

Intuition: Regularization isn't just a "trick" to prevent overfitting—it's incorporating prior information that extreme coefficients are unlikely. The penalty encodes skepticism about large effects.

Why this matters for causal inference:

  • Regularization shrinks coefficients toward zero, introducing bias

  • For causal parameters, this bias can be a problem (see DML in Chapter 21)

  • But for nuisance parameters (propensity scores, outcome models), regularization often improves performance

  • Understanding the Bayesian interpretation clarifies what prior beliefs regularization imposes

Reference: Murphy (2012), Machine Learning: A Probabilistic Perspective, Chapter 7.


3.6 The Regression Framework

Linear Regression Model

Yi=β0+β1X1i+β2X2i+...+βkXki+εiY_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ... + \beta_k X_{ki} + \varepsilon_i

or in matrix form: Y=Xβ+εY = X\beta + \varepsilon

Components:

  • YY: Outcome (dependent variable)

  • XX: Regressors (independent variables, covariates)

  • β\beta: Coefficients (parameters to estimate)

  • ε\varepsilon: Error term (unobserved)

OLS Estimation

Ordinary Least Squares minimizes sum of squared residuals:

β^OLS=argminβi=1n(YiXiβ)2\hat{\beta}_{\text{OLS}} = \arg\min_{\beta} \sum_{i=1}^{n} (Y_i - X_i'\beta)^2

Solution (derived in the Technical Appendix):

β^=(XX)1XY\hat{\beta} = (X'X)^{-1}X'Y

Classical Assumptions

For OLS to be the Best Linear Unbiased Estimator (BLUE):

  1. Linearity: Y=Xβ+εY = X\beta + \varepsilon

  2. Random sampling: (Yi,Xi)(Y_i, X_i) iid across observations

  3. No perfect multicollinearity: XXX'X is invertible

  4. Zero conditional mean: E[εX]=0E[\varepsilon | X] = 0

  5. Homoskedasticity: Var(εX)=σ2Var(\varepsilon | X) = \sigma^2

  6. Normality (for exact inference): εXN(0,σ2)\varepsilon | X \sim N(0, \sigma^2)

What Does "BLUE" Mean?

The Gauss-Markov theorem states that under assumptions 1-5, OLS is the Best Linear Unbiased Estimator. "Best" means minimum variance; "Linear" means a linear function of YY; "Unbiased" means E[β^]=βE[\hat{\beta}] = \beta.

In the credibility revolution era (Chapter 1), researchers often focus less on BLUE and more on consistency and identification. Why? First, BLUE is a finite-sample result—with large samples, consistency matters more than minimum variance among linear estimators. Second, "best among linear estimators" ignores nonlinear alternatives that might be more efficient. Third, the BLUE property requires homoskedasticity, which rarely holds in practice. Modern applied work therefore relies on robust standard errors (discussed below) rather than trusting the classical variance formula.

Still, OLS remains the workhorse estimator. Even when BLUE doesn't strictly hold, OLS is often consistent, computationally simple, and well understood. The emphasis shifts from optimality theorems to practical concerns: Is the identifying assumption (E[εX]=0E[\varepsilon | X] = 0) plausible? Are standard errors computed correctly?

Box: From Finite-Sample to Large-Sample Thinking

Graduate econometrics courses traditionally emphasize the Gauss-Markov theorem and BLUE. But modern applied work has shifted toward large-sample (asymptotic) properties:

Finite-Sample Focus
Large-Sample Focus

Unbiasedness

Consistency

BLUE (minimum variance among linear)

Asymptotic efficiency

Exact normality assumption

CLT-based inference

Homoskedasticity required

Robust SEs handle heteroskedasticity

Why the shift?

  1. Sample sizes grew: With thousands or millions of observations, consistency matters more than finite-sample unbiasedness

  2. Nonlinear methods became common: ML estimators, GMM, and IV are biased in finite samples but consistent—BLUE doesn't apply to them

  3. Robust inference became standard: Heteroskedasticity-robust and cluster-robust SEs don't require homoskedasticity

  4. Focus shifted to identification: Whether the estimand equals the causal effect matters more than whether it's BLUE

Practical implication: When evaluating an estimator, ask: "Is it consistent for the parameter I care about?" and "Are my standard errors valid?" rather than "Is it BLUE?"

Interpretation

Definition 3.4 (Regression Coefficient): βj\beta_j is the expected change in YY associated with a one-unit change in XjX_j, holding other regressors constant.

Caution: This is a conditional association, not necessarily a causal effect. E[εX]=0E[\varepsilon | X] = 0 requires that there are no omitted variables correlated with both XX and YY—a strong assumption that usually fails in observational data (Chapter 11).

Box: Modeling Proportion Outcomes

OLS assumes the outcome can take any value on the real line. But many outcomes are bounded—most commonly proportions or rates that must lie between 0 and 1. Examples include vote shares, test score percentiles, budget allocations, and the share of income spent on a category. Applying OLS to proportions can produce nonsensical predictions (negative values or values exceeding 1) and biased estimates.

Four approaches to proportion outcomes:

Approach
Model
Pros
Cons

Linear Probability Model (OLS)

Y=Xβ+εY = X\beta + \varepsilon

Simple, familiar coefficients

Predictions outside [0,1]; heteroskedastic by construction

Logit transformation

log(Y/(1Y))=Xβ+ε\log(Y/(1-Y)) = X\beta + \varepsilon

Constrains predictions

Undefined at Y=0Y=0 or Y=1Y=1; coefficients hard to interpret

Fractional logit (Papke & Wooldridge 1996)

Quasi-MLE with logistic mean

Handles 0s and 1s; robust

Coefficients on logit scale

Beta regression

YBeta(μ,ϕ)Y \sim Beta(\mu, \phi) with g(μ)=Xβg(\mu) = X\beta

Models full distribution; natural for proportions

Cannot handle exact 0s or 1s without extension

Beta regression treats YY as drawn from a Beta distribution, parameterized by mean μ\mu and precision ϕ\phi. The mean is linked to covariates through a logit (or other) link function. This approach:

  • Naturally respects the [0,1] bounds

  • Allows modeling of heteroskedasticity (precision can vary with covariates)

  • Produces predictions that are always valid proportions

When to use what:

  • If proportions cluster near 0.5 and precision doesn't matter: OLS may suffice as approximation

  • If you have exact 0s or 1s in the data: Use fractional logit or zero/one-inflated beta

  • If you want to model the full distribution and have no boundary values: Beta regression

  • For causal inference: Report average marginal effects (Chapter 11) regardless of model, since coefficients from nonlinear models aren't directly interpretable as effects on YY

In R, beta regression is implemented in the betareg package; Bayesian estimation via brms. See Cribari-Neto & Zeileis (2010) for a comprehensive treatment.

Standard Errors and Inference

Under classical assumptions:

β^N(β,σ2(XX)1)\hat{\beta} \sim N(\beta, \sigma^2(X'X)^{-1})

Standard error: The standard error of β^j\hat{\beta}_j is the square root of the jj-th diagonal element of the variance matrix:

SE(β^j)=σ^2[(XX)1]jjSE(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 \cdot [(X'X)^{-1}]_{jj}}

t-statistic: tj=β^j/SE(β^j)t_j = \hat{\beta}_j / SE(\hat{\beta}_j)

Robust standard errors (Huber-White): When homoskedasticity fails, the classical variance formula is wrong. Robust standard errors remain valid under heteroskedasticity:

V^robust=(XX)1(i=1nε^i2XiXi)(XX)1\hat{V}_{\text{robust}} = (X'X)^{-1}\left(\sum_{i=1}^{n} \hat{\varepsilon}_i^2 X_i X_i'\right)(X'X)^{-1}

This "sandwich" formula (so called because the middle term is sandwiched between (XX)1(X'X)^{-1} terms) is now standard practice. In Stata, add , robust to regression commands; in R, use the sandwich package.

Clustered standard errors: When observations are correlated within groups (e.g., students within schools, workers within firms), standard errors must account for this clustering. Ignoring clustering leads to standard errors that are too small—often dramatically so—and inflated t-statistics.

Rule of Thumb: Cluster at the Level of Treatment Assignment

If treatment is assigned at the school level, cluster at the school level. If a policy affects all workers in a state, cluster at the state level. The intuition: units that receive the same treatment share common shocks, so their residuals are correlated. Clustering corrects for this.

Box: Common Clustering Mistakes and Edge Cases

Mistake 1: Clustering too narrowly If treatment varies at the state-year level but you cluster only at the individual level, you understate uncertainty. Always cluster at least at the level of treatment variation.

Mistake 2: Ignoring correlation structure With panel data, residuals may be correlated both within units over time and within time periods across units. Options:

  • Two-way clustering: cluster = ~unit + time (Stata: cluster(unit) cluster(time))

  • Conservative: cluster at the higher level of aggregation

Mistake 3: Clustering on outcome-determined groups Don't cluster on variables affected by treatment (e.g., post-treatment occupation). Cluster on pre-determined or exogenous groupings.

Edge case: Multiple possible clustering levels With students in classrooms in schools in districts, cluster at the level where treatment varies:

  • Teacher training intervention → cluster at classroom or school

  • State policy change → cluster at state

  • When in doubt, cluster higher—this is conservative

Edge case: Regression discontinuity In RD designs, clustering is less clear. If assignment is based on a continuous running variable with no grouped structure, robust SEs may suffice. But if running variable has mass points or groups, cluster accordingly.

See Abadie, Athey, Imbens & Wooldridge (2023) for a comprehensive treatment of when and how to cluster.

The mechanics: The clustered variance estimator sums residuals within clusters before squaring:

V^cluster=(XX)1(g=1Gu^gu^g)(XX)1\hat{V}_{\text{cluster}} = (X'X)^{-1}\left(\sum_{g=1}^{G} \hat{u}_g \hat{u}_g'\right)(X'X)^{-1}

where u^g=igXiε^i\hat{u}_g = \sum_{i \in g} X_i \hat{\varepsilon}_i is the sum of score contributions within cluster gg.

How many clusters?: Cluster-robust SEs perform poorly with few clusters (G<3050G < 30-50). With small GG, use:

  • Wild cluster bootstrap (see Chapter 13)

  • Cluster-robust t-tests with adjusted degrees of freedom

  • Randomization inference

Implementation:

The Bootstrap

The bootstrap is a simulation-based method for computing standard errors, confidence intervals, and other measures of statistical uncertainty. Rather than deriving analytical formulas (which can be difficult or unavailable for complex estimators), the bootstrap approximates the sampling distribution directly.

The idea: If we could repeatedly sample from the population, we'd see how much our estimate varies. We can't resample from the population, but we can resample from our sample. Treating the sample as if it were the population, we draw many "bootstrap samples" (random samples with replacement from the original data) and compute our statistic for each. The variability across bootstrap samples approximates the sampling variability.

The algorithm:

  1. From original sample of size nn, draw BB bootstrap samples (each of size nn, with replacement)

  2. Compute the statistic θ^(b)\hat{\theta}^{(b)} for each bootstrap sample b=1,...,Bb = 1, ..., B

  3. The bootstrap standard error is the standard deviation of {θ^(1),...,θ^(B)}\{\hat{\theta}^{(1)}, ..., \hat{\theta}^{(B)}\}

  4. Confidence intervals can be constructed from the percentiles (e.g., 2.5th and 97.5th for 95% CI)

When to use the bootstrap:

  • When analytical standard errors are unavailable or complex (e.g., median, quantile regressions, matching estimators)

  • When asymptotic approximations may be poor (small samples)

  • When you want confidence intervals for nonstandard quantities (ratios, differences in coefficients)

Cautions:

  • Bootstrap requires iid sampling or appropriate adjustments (e.g., block bootstrap for clustered data)

  • Some statistics have non-standard behavior that the bootstrap can't fix (e.g., unit root tests)

  • Computational cost: B=1000B = 1000 or more replications is typical

The bootstrap has become routine in applied economics. Software makes it easy: in R, the boot package provides general bootstrap functionality; in Stata, the bootstrap prefix works with most estimation commands.

Worked Example: Returns to Education

Model: log(Wagei)=β0+β1Educationi+β2Experiencei+β3Experiencei2+εi\log(Wage_i) = \beta_0 + \beta_1 Education_i + \beta_2 Experience_i + \beta_3 Experience_i^2 + \varepsilon_i

Data: Current Population Survey, working adults, n = 10,000

Results (hypothetical):

Variable
Coefficient
SE
t-stat

Education

0.085

0.004

21.3

Experience

0.035

0.003

11.7

Experience²

-0.0005

0.0001

-5.0

Constant

1.25

0.05

25.0

Interpretation:

  • Each additional year of education is associated with 8.5% higher wages

  • Experience has diminishing returns (positive linear, negative quadratic)

Caution: This is a descriptive regression. The 8.5% is the association between education and wages after controlling for experience. It is NOT the causal effect of education—people with more education differ in unobserved ways (ability, family background) that also affect wages. Identifying the causal effect requires methods from Chapters 10-14.


Practical Guidance

Choosing Methods

Situation
Recommended Approach

Large sample, standard inference

Frequentist with robust SEs

Small sample, strong priors

Bayesian

Clustered data

Clustered SEs or hierarchical models

Heteroskedasticity suspected

Robust SEs (always safe)

Complex hypothesis (joint test)

F-test or likelihood ratio

Common Pitfalls

Pitfall 1: P-value worship Treating p<0.05p < 0.05 as meaningful and p>0.05p > 0.05 as meaningless.

How to avoid: Report effect sizes and confidence intervals, not just significance. A large, imprecisely estimated effect may be more important than a small, precisely estimated one.

Pitfall 2: Confusing statistical and practical significance A tiny effect can be "statistically significant" with large n.

How to avoid: Always interpret magnitude. Ask: is this effect economically meaningful?

Pitfall 3: Ignoring multiple testing Testing many hypotheses inflates false positive rate.

How to avoid: Adjust for multiple comparisons (Bonferroni, FDR). Report all tests conducted.

Pitfall 4: Forgetting model assumptions Using standard errors that assume homoskedasticity when it fails.

How to avoid: Use robust or clustered SEs by default. Test assumptions when possible.


Summary

Key takeaways:

  1. Probability provides the foundation: Random variables, distributions, and expectations are the language of statistical inference.

  2. Sampling distributions connect sample to population: The CLT justifies normal-based inference for sample means.

  3. Estimation has desirable properties: Unbiasedness, consistency, and efficiency guide our choice of estimators.

  4. Hypothesis testing trades off error types: Significance level controls Type I error; power addresses Type II.

  5. P-values are tools, not verdicts: The ASA statement reminds us to report effect sizes, use confidence intervals, and never treat p<0.05p < 0.05 as a bright line.

  6. Frequentist and Bayesian approaches offer different perspectives: Neither is "correct"; both are useful.

  7. Regression is fundamental: OLS provides a workhorse, but causal interpretation requires additional assumptions.

Returning to the opening question: We learn from data by understanding that our observations are random realizations of an underlying process. Statistical inference—through estimation, testing, and interval construction—allows us to reason about the process while acknowledging the uncertainty inherent in seeing only one sample.


Further Reading

Essential

  • Angrist and Pischke (2009), Mostly Harmless Econometrics, Ch. 3 - Regression review for economists

  • Freedman (2009), Statistical Models: Theory and Practice - Clear treatment of foundations

For Deeper Understanding

  • Casella and Berger (2002), Statistical Inference - Comprehensive mathematical statistics

  • Gelman et al. (2013), Bayesian Data Analysis - Modern Bayesian treatment

  • Hansen (2022), Econometrics - Graduate econometrics textbook (free online)

Historical/Philosophical

  • Stigler (1986), The History of Statistics - Fascinating history

  • Hacking (2001), An Introduction to Probability and Inductive Logic - Philosophical foundations

  • Mayo (2018), Statistical Inference as Severe Testing - Philosophy of frequentist inference

On Statistical Practice and the P-Value Debate

  • Wasserstein and Lazar (2016), "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician 70(2): 129-133 - The landmark ASA statement

  • Wasserstein, Schirm, and Lazar (2019), "Moving to a World Beyond 'p < 0.05'," The American Statistician 73(sup1): 1-19 - Follow-up with 43 invited commentaries

  • Benjamin et al. (2018), "Redefine Statistical Significance," Nature Human Behaviour 2: 6-10 - Proposal to lower threshold to 0.005

  • McShane et al. (2019), "Abandon Statistical Significance," The American Statistician 73(sup1): 235-245 - The case against bright-line thresholds

Applications

  • Imbens and Rubin (2015), Causal Inference for Statistics, Social, and Biomedical Sciences - Connects statistics to causation

  • Gelman and Hill (2007), Data Analysis Using Regression and Multilevel/Hierarchical Models - Applied Bayesian regression

Specialized Regression Models

  • Papke and Wooldridge (1996), "Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates," Journal of Applied Econometrics 11(6): 619-632 - Fractional logit for proportions

  • Cribari-Neto and Zeileis (2010), "Beta Regression in R," Journal of Statistical Software 34(2): 1-24 - Comprehensive treatment of beta regression with R implementation


Exercises

Conceptual

  1. Explain the difference between a parameter and a statistic. Why is the sample mean a statistic but the population mean a parameter?

  2. A researcher reports a "marginally significant" result with p=0.08p = 0.08. Another reports a "highly significant" result with p=0.001p = 0.001. What can and can't we conclude from these p-values?

  3. Explain the Central Limit Theorem in plain language. Why does it matter for statistical practice?

Applied

  1. Generate 1,000 samples of size n=30 from an exponential distribution. For each sample, compute the sample mean and construct a 95% confidence interval for the population mean. What proportion of intervals contain the true mean? Does this match theory?

  2. Using CPS or similar data:

    • Estimate a Mincer wage regression (log wage on education, experience, experience²)

    • Report standard errors: (a) assuming homoskedasticity, (b) robust

    • Do the conclusions change?

Discussion

  1. Bayesian methods require specifying a prior distribution. Critics argue this makes results subjective. Defenders argue that frequentist methods also embed assumptions (just less explicitly). Who has the stronger argument?

  2. A medical researcher tests whether a new drug reduces blood pressure. The study finds a mean reduction of 2 mmHg with p=0.03p = 0.03 and 95% CI of [0.2,3.8][0.2, 3.8]. The researcher concludes: "There is a 97% probability that the drug works." A newspaper reports: "Scientists prove new drug lowers blood pressure." Evaluate both statements in light of the ASA's six principles. What would be a more accurate summary of these findings?

  3. Consider these three hypothetical studies:

    • Study A: Tests a new physics theory with strong prior theoretical support. Finds p=0.04p = 0.04 with n=50n = 50.

    • Study B: Tests whether birth month affects personality. Finds p=0.01p = 0.01 with n=10,000n = 10,000.

    • Study C: Replicates a well-established finding. Finds p=0.08p = 0.08 with n=200n = 200.

    Rank these from most to least convincing evidence, and explain your reasoning. Why might a smaller p-value not always indicate stronger evidence?


Technical Appendix

A. Derivation of OLS Estimator

From the first-order conditions of minimizing i(YiXiβ)2\sum_i (Y_i - X_i'\beta)^2: βi(YiXiβ)2=2iXi(YiXiβ)=0\frac{\partial}{\partial \beta}\sum_i (Y_i - X_i'\beta)^2 = -2\sum_i X_i(Y_i - X_i'\beta) = 0 XY=XXβ\Rightarrow X'Y = X'X\beta β^=(XX)1XY\Rightarrow \hat{\beta} = (X'X)^{-1}X'Y

B. Properties of OLS Under Classical Assumptions

Unbiasedness: E[β^X]=E[(XX)1XYX]=(XX)1XE[YX]=(XX)1XXβ=βE[\hat{\beta}|X] = E[(X'X)^{-1}X'Y|X] = (X'X)^{-1}X'E[Y|X] = (X'X)^{-1}X'X\beta = \beta

Variance: Var(β^X)=(XX)1XVar(YX)X(XX)1=σ2(XX)1Var(\hat{\beta}|X) = (X'X)^{-1}X'Var(Y|X)X(X'X)^{-1} = \sigma^2(X'X)^{-1}

Gauss-Markov Theorem: Under assumptions 1-5, OLS is BLUE (Best Linear Unbiased Estimator).

C. Asymptotic Distribution

Under weaker conditions (finite fourth moments, E[XiXi]E[X_i X_i'] invertible): n(β^β)dN(0,E[XiXi]1E[εi2XiXi]E[XiXi]1)\sqrt{n}(\hat{\beta} - \beta) \xrightarrow{d} N(0, E[X_i X_i']^{-1}E[\varepsilon_i^2 X_i X_i']E[X_i X_i']^{-1})

This is the "sandwich" form that justifies robust standard errors.

Last updated