Chapter 10: Experimental Methods

Opening Question

When can we randomize our way to causal knowledge—and when does randomization fall short?


Chapter Overview

Randomized experiments hold a privileged place in causal inference. By randomly assigning treatment, we break the link between treatment and confounders, allowing simple comparisons to reveal causal effects. No other method achieves this so directly.

Yet experiments are not a panacea. They may be infeasible, unethical, or too costly. When feasible, they face threats from attrition, non-compliance, and spillovers. Even a perfect experiment may tell us little about effects in other contexts. The 2019 Nobel Prize in Economics to Banerjee, Duflo, and Kremer recognized the power of field experiments in development—but also sparked renewed debate about their limitations.

This chapter covers the design, analysis, and interpretation of randomized experiments. We explain why randomization works, how to design experiments well, what can go wrong, and when experiments are the right tool. The goal is neither uncritical enthusiasm nor reflexive skepticism, but a clear-eyed understanding of what experiments can and cannot deliver.

What you will learn:

  • Why random assignment identifies causal effects

  • How to design experiments: power analysis, stratification, clustering

  • The distinction between ITT and treatment-on-treated effects

  • Major threats to experimental validity and how to address them

  • When experiments are feasible, ethical, and informative

Prerequisites: Chapter 9 (The Causal Framework)


10.1 Why Randomization Works

The Logic of Random Assignment

Recall from Chapter 9 the fundamental problem: we cannot observe the same unit in both treatment and control states. Causal inference requires assumptions about the unobserved counterfactuals.

Randomization solves this problem elegantly. If we randomly assign units to treatment and control, the two groups are comparable in expectation. Any differences in outcomes can be attributed to treatment—not to pre-existing differences between groups.

Formally, random assignment ensures:

Independence Assumption

(Y(0),Y(1))D(Y(0), Y(1)) \perp D

Potential outcomes are independent of treatment assignment. Treatment is unrelated to who would benefit (or be harmed) by it.

Under this assumption, the selection bias term from Chapter 9 vanishes:

E[Y(0)D=1]=E[Y(0)D=0]=E[Y(0)]E[Y(0) | D = 1] = E[Y(0) | D = 0] = E[Y(0)]

The treated and control groups have the same expected potential outcomes under control. Any difference in observed outcomes reflects the treatment effect.

Figure 10.1: Covariate Balance in Randomized vs. Observational Studies. Randomization ensures balance on all covariates—observed and unobserved. Observational studies often show systematic imbalance, creating selection bias.

The Simple Difference in Means

With random assignment, causal effects are identified by simple comparisons:

τ^=Yˉ1Yˉ0=1n1Di=1Yi1n0Di=0Yi\hat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \frac{1}{n_1}\sum_{D_i=1} Y_i - \frac{1}{n_0}\sum_{D_i=0} Y_i

This difference-in-means estimator is:

  • Unbiased: E[τ^]=ATEE[\hat{\tau}] = \text{ATE}

  • Consistent: τ^pATE\hat{\tau} \xrightarrow{p} \text{ATE} as nn \to \infty

  • Simple: No modeling assumptions required

The variance is: Var(τ^)=σ12n1+σ02n0\text{Var}(\hat{\tau}) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}

where σ12\sigma_1^2 and σ02\sigma_0^2 are the outcome variances in treatment and control.

Regression Analysis of Experiments

Although simple differences suffice, regression is often used:

Yi=α+τDi+εiY_i = \alpha + \tau D_i + \varepsilon_i

With random assignment, OLS estimates τ^\hat{\tau} equal the difference in means. Why use regression?

  1. Convenience: Standard errors and tests come automatically

  2. Covariates: Adding baseline covariates can improve precision

  3. Subgroups: Interactions allow heterogeneity analysis

Adding Covariates in Experiments

Yi=α+τDi+Xiβ+εiY_i = \alpha + \tau D_i + X_i'\beta + \varepsilon_i

With random assignment, covariates do not change τ^\hat{\tau} in expectation—but they can reduce variance and improve precision. This is because covariates explain some of the outcome variation, reducing residual variance.

What Random Assignment Does Not Do

Random assignment is powerful, but it does not solve all problems:

  1. It does not ensure balance in finite samples: Random assignment balances groups in expectation, but any particular randomization may produce imbalanced groups by chance.

  2. It does not prevent attrition: If subjects drop out differentially by treatment status, balance is compromised.

  3. It does not prevent non-compliance: Subjects may not take the treatment they are assigned.

  4. It does not prevent spillovers: Treatment may affect control group outcomes.

  5. It does not establish external validity: Effects in the experimental sample may not generalize.

The rest of this chapter addresses these challenges.


10.2 Experimental Design

Sample Size and Power

Statistical power is the probability of detecting a true effect. Underpowered experiments waste resources and may produce misleading null results.

Power depends on:

  • Effect size (τ\tau): Larger effects are easier to detect

  • Sample size (nn): More observations increase power

  • Outcome variance (σ2\sigma^2): Less noisy outcomes increase power

  • Significance level (α\alpha): Lower α\alpha reduces power

The standard power formula for a two-sample t-test:

n=2(z1α/2+z1β)2σ2τ2n = \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2 \sigma^2}{\tau^2}

where:

  • z1α/2z_{1-\alpha/2} is the critical value for significance level α\alpha

  • z1βz_{1-\beta} is the critical value for power 1β1-\beta

  • σ2\sigma^2 is the outcome variance

  • τ\tau is the minimum detectable effect

Example: Power Calculation

Suppose we want 80% power (β=0.20\beta = 0.20) at the 5% level (α=0.05\alpha = 0.05) to detect an effect of 0.2 standard deviations.

With z0.975=1.96z_{0.975} = 1.96 and z0.80=0.84z_{0.80} = 0.84:

n=2(1.96+0.84)20.22=2×7.840.04=392n = \frac{2(1.96 + 0.84)^2}{0.2^2} = \frac{2 \times 7.84}{0.04} = 392

We need about 400 subjects per arm (800 total).

Figure 10.2: Statistical Power Curves. Power increases with sample size, but the relationship depends strongly on effect size. Detecting small effects (d=0.2) requires much larger samples than detecting large effects (d=0.8). The 80% power threshold is a common target.

Stratification and Blocking

Stratification (or blocking) improves precision by ensuring balance on key covariates:

  1. Divide subjects into strata based on baseline characteristics

  2. Randomize separately within each stratum

  3. Analyze with stratum fixed effects or weights

Benefits:

  • Guarantees balance on stratifying variables

  • Reduces variance, increasing power

  • Enables subgroup analysis

Practical Box: What to Stratify On

Stratify on variables that:

  • Strongly predict the outcome

  • Might be imbalanced by chance

  • Define subgroups of interest

Common choices: baseline outcome, geographic region, demographics. Don't over-stratify—too many strata with few units per stratum creates problems.

Figure 10.3: Stratified vs. Simple Randomization. The left panel shows the distribution of high-risk subjects in the treatment group across 1,000 simple randomizations—substantial variation occurs by chance. The right panel shows stratified randomization, which guarantees exact balance on the stratifying variable in every realization.

Cluster Randomization

Sometimes individual randomization is infeasible or undesirable. Cluster randomization assigns treatment at the group level:

  • Schools, not students

  • Villages, not households

  • Clinics, not patients

Why cluster?

  • Logistical necessity (cannot treat individuals differently within a classroom)

  • Reduce spillovers (if treatment affects nearby individuals)

  • Policy relevance (policies often implemented at group level)

The cost of clustering: Effective sample size is reduced. With mm clusters of size kk:

neff=mk1+(k1)ρn_{eff} = \frac{mk}{1 + (k-1)\rho}

where ρ\rho is the intraclass correlation coefficient (ICC)—the fraction of variance between clusters.

Example: Clustering Dramatically Reduces Power

With 100 clusters of 50 students each (5,000 total) and ICC = 0.10:

neff=50001+49×0.10=50005.9847n_{eff} = \frac{5000}{1 + 49 \times 0.10} = \frac{5000}{5.9} \approx 847

The effective sample size is only 847, not 5,000. Power calculations must account for this.

Randomization Inference

Classical inference assumes random sampling from a population and relies on asymptotic approximations. Randomization inference (also called permutation inference or Fisher's exact test) derives p-values from the randomization itself—valid even with tiny samples.

The Sharp Null Hypothesis: H0:Yi(1)=Yi(0)H_0: Y_i(1) = Y_i(0) for all units ii

Under this null, treatment has zero effect on every unit. This is stronger than the usual null of zero average effect.

The Logic: Under the sharp null, each unit's outcome is fixed regardless of treatment assignment. The only randomness comes from which units received treatment. We can compute what the test statistic would have been under every possible randomization.

Algorithm: Randomization Inference

Implementation in R:

Key advantages:

  • Valid in finite samples (no asymptotic approximation)

  • Does not require distributional assumptions

  • Automatically accounts for actual randomization procedure (clustering, stratification, blocking)

  • Particularly valuable when nn is small or distributions are skewed

Limitation: Tests the sharp null of zero effect for everyone, not just zero average effect. Rejection means treatment affected at least some units, not necessarily that the average effect is nonzero.

Optimal and Adaptive Designs

Standard experiments assign treatment with fixed probability (often 50-50). More sophisticated designs can improve efficiency or serve multiple goals simultaneously.

Optimal treatment allocation

When groups have different outcome variances or sizes, equal allocation is suboptimal. The variance-minimizing allocation assigns more treated units to groups with higher outcome variance:

pgσgp^*_g \propto \sigma_g

Example: In a wage experiment, if wage variance is higher for college graduates than non-graduates, allocate more treatment to college graduates.

Neyman Allocation for Stratified Experiments

With strata g=1,,Gg = 1, \ldots, G, the optimal within-stratum treatment proportion minimizes variance of the overall ATE estimator. This allocates more observations to strata with higher outcome variance or larger population share.

Multi-arm experiments

When comparing multiple treatments (not just treatment vs. control), power calculations must account for multiple comparisons. Common designs:

  • Factorial designs: Test combinations of factors efficiently (e.g., price × messaging)

  • Balanced incomplete block designs: Not everyone sees every treatment, but comparisons are balanced

Adaptive experiments (bandits)

In some settings, we want to learn while deciding—assigning treatment based on accumulating evidence:

Box: Multi-Armed Bandits and Adaptive Assignment

In a bandit problem, assignment probabilities adjust based on observed outcomes. Arms (treatments) that appear more effective get assigned more often.

Thompson Sampling: Maintain a posterior distribution over each arm's effectiveness; sample from posteriors and assign the arm with the highest sampled value.

Explore-exploit tradeoff: Early in the experiment, explore all arms to learn their effects. Later, exploit by assigning the best-performing arm more often.

When to use adaptive designs:

  • Ethical imperative to minimize harm (clinical trials)

  • Objective is welfare during the experiment, not just learning

  • Large-scale tech experiments where regret is costly

Limitations for causal inference:

  • Inference is more complex (assignment depends on previous outcomes)

  • May not have enough data on some arms for precise estimates

  • Standard confidence intervals don't apply without adjustment

Key references: Russo et al. (2018) for bandits; Hadad et al. (2021) for inference in adaptive experiments.


10.3 Types of Experiments

Laboratory Experiments

Lab experiments bring subjects to a controlled environment:

Advantages:

  • Maximum control over environment and treatment

  • Can implement complex treatments and measure fine-grained outcomes

  • Relatively cheap per observation

Limitations:

  • Artificial setting may not reflect real-world behavior

  • Subject pools (often students) may not be representative

  • Demand effects: subjects may behave to please the experimenter

  • Hawthorne effects: behavior changes from being observed

Common applications:

  • Behavioral economics (risk, time, social preferences)

  • Psychology (cognition, decision-making)

  • Game theory (strategic behavior)

Field Experiments

Field experiments randomize in real-world settings:

Advantages:

  • Natural environment and real stakes

  • Outcomes are actual behavior, not stated intentions

  • Policy-relevant by design

Limitations:

  • Less control over implementation

  • More expensive and time-consuming

  • Ethical and logistical constraints

  • May still have limited external validity

Examples:

  • Microfinance RCTs across developing countries

  • Get-out-the-vote experiments

  • Resume audit studies for discrimination

Running Example: Microfinance

Banerjee et al. (2015) conducted randomized evaluations of microfinance in six countries. Villages were randomly assigned to receive microfinance access or not. The studies found modest effects on business outcomes, no transformation of poverty, and considerable heterogeneity across contexts.

This example illustrates both the power of experiments (clean identification of average effects) and their limitations (effects varied across sites; mechanisms remained unclear).

Meager (2019) later showed how to synthesize these experiments using Bayesian hierarchical models. Her analysis revealed that roughly 60% of the observed cross-study variation was sampling error---the studies were actually more consistent than they appeared. The framework also enabled prediction of effects in new sites, directly addressing external validity. This demonstrates how multiple experiments can be combined to learn more than any single study could reveal (see Chapter 24).

Natural Field Experiments

Some studies embed randomization in naturally occurring processes without subjects knowing they are in an experiment:

  • Audit studies (send resumes with randomly varied characteristics)

  • Direct mail experiments (randomly vary marketing messages)

  • A/B testing on platforms

These combine the control of experiments with the naturalness of observational settings.

Online Experiments and A/B Testing

Digital platforms enable massive-scale experimentation:

Advantages:

  • Huge sample sizes at low cost

  • Rapid iteration

  • Precise measurement of digital outcomes

Limitations:

  • Outcomes limited to platform behavior

  • User populations may not be representative

  • Ethical concerns about manipulation without consent

  • Multiple simultaneous tests create inference problems

Example: A/B Testing at Scale

Tech companies run thousands of experiments yearly. A typical test might randomize users to different website layouts and measure click-through rates. Sample sizes can be millions, detecting effects of 0.1% with high precision.

But: Effect sizes in user experience are often tiny. Statistically significant effects may be practically meaningless.

Survey Experiments

Survey experiments randomize within questionnaires:

  • Randomly vary question wording, order, or information provided

  • Measure how responses change

Applications:

  • Measuring sensitive attitudes (list experiments, endorsement experiments)

  • Testing framing effects

  • Conjoint analysis for preferences

Advantages: Cheap, fast, large samples Limitations: Measures stated preferences, not behavior


10.4 Threats to Validity

Internal Validity Threats

Internal validity asks: Did the experiment correctly estimate the causal effect for the study population?

Figure 10.4: Threats to Internal and External Validity. Internal validity threats (left) compromise the causal estimate within the study—attrition, non-compliance, spillovers, and experimenter effects. External validity threats (right) limit generalization beyond the study—site selection, volunteer bias, and context dependence. Good experiments address internal threats through design and acknowledge external validity limitations.

Attrition

Subjects who drop out may differ from those who remain. If attrition relates to treatment, balance is compromised.

Diagnosis:

  • Compare attrition rates across arms

  • Compare baseline characteristics of attriters vs. completers

  • Test whether attrition predicts treatment

Solutions:

  • Minimize attrition through design (incentives, tracking)

  • Report bounds under different assumptions about missing outcomes

  • Lee bounds: Trim the group with less attrition to restore balance

Non-Compliance

Subjects may not take the treatment they are assigned:

  • One-sided: Some assigned to treatment don't take it

  • Two-sided: Some assigned to control take treatment anyway

With non-compliance, we distinguish:

Intent-to-Treat (ITT)

ITT=E[YZ=1]E[YZ=0]\text{ITT} = E[Y | Z = 1] - E[Y | Z = 0]

The effect of assignment to treatment, regardless of actual treatment.

Treatment-on-Treated (TOT) / LATE

TOT=ITTCompliance Rate=E[YZ=1]E[YZ=0]E[DZ=1]E[DZ=0]\text{TOT} = \frac{\text{ITT}}{\text{Compliance Rate}} = \frac{E[Y | Z = 1] - E[Y | Z = 0]}{E[D | Z = 1] - E[D | Z = 0]}

The effect for those who comply with their assignment. This is the LATE from Chapter 12.

ITT is always identified. TOT requires assuming no defiers (monotonicity) and interprets the effect as applying to compliers only.

When to Report What

  • ITT: Always report. It is the effect of the policy (offering treatment).

  • TOT: Report when the effect of actually receiving treatment is of interest. Be clear about the complier population.

Figure 10.5: Compliance Types and ITT vs. TOT. The left panel shows the four compliance types: compliers follow their assignment, always-takers take treatment regardless, never-takers refuse treatment regardless, and defiers do the opposite of their assignment. The right panel illustrates how ITT (the reduced-form effect) and TOT (the effect scaled by the compliance rate) relate—TOT equals ITT divided by the first stage.

Spillovers and Interference

The Stable Unit Treatment Value Assumption (SUTVA) requires that one unit's treatment doesn't affect another's outcome. Spillovers violate SUTVA:

  • Vaccination reduces disease for unvaccinated (positive spillover)

  • Job training helps trainees but may hurt non-trainees competing for jobs (negative spillover)

  • Information interventions spread through networks

Diagnosis:

  • Test whether control outcomes vary with treatment density nearby

  • Measure outcomes for indirect contacts

Solutions:

  • Cluster randomization at a level where spillovers are contained

  • Design studies to estimate spillover effects directly

  • Partial population experiments: randomize treatment intensity

Hawthorne and Experimenter Effects

Being in an experiment may change behavior:

  • Hawthorne effect: Subjects behave differently because they're observed

  • Experimenter demand: Subjects try to confirm what they think the experimenter wants

Mitigation:

  • Blind subjects to treatment when possible (placebo controls)

  • Minimize salience of observation

  • Natural field experiments where subjects don't know they're in a study

External Validity Threats

External validity asks: Do findings generalize beyond the study?

Sample Selection

Experimental samples are often unrepresentative:

  • Volunteers differ from non-volunteers

  • Sites are chosen for feasibility, not representativeness

  • Developing country field experiments may not inform developed country policy

Site Selection

Researchers may choose sites where effects are likely large:

  • Programs implemented by capable NGOs

  • Motivated local partners

  • Contexts where treatment should work

Results may not replicate under routine implementation.

Equilibrium Effects

Small-scale experiments cannot capture market-wide effects:

  • A job training program helps trainees but would depress wages if scaled

  • A school voucher experiment cannot reveal effects of universal vouchers

The Deaton Critique

Angus Deaton (2010) articulated several concerns about the RCT movement:

  1. External validity: Experiments identify effects in specific contexts. Without theory, we cannot extrapolate.

  2. Mechanisms: Experiments show that something works, not why. Without understanding mechanisms, we cannot generalize or improve.

  3. Ethical constraints: We cannot randomize many important treatments. The set of randomizable questions may not include the most important ones.

  4. Local effects: LATE applies to compliers. Policy effects on a different population may differ.

  5. Implementation at scale: Experimental effects under careful implementation may vanish under routine delivery.

Deaton's Challenge

"The RCT is a useful tool, but I think that it is a mistake to put method ahead of substance. Suppose we had had RCTs of the effects of the demographic transition on economic development, or of democracy on development, or of the effect of the welfare state on poverty. While such RCTs might be attractive in principle, it is hard to imagine what they would look like in practice... we have learned a great deal about these topics without RCTs."

— Deaton (2010)

The response is not to abandon experiments but to use them wisely—combining experimental evidence with theory, mechanism, and judgment about external validity.

Running Example Connection: China's Growth

Deaton's critique resonates powerfully with our China running example. What explains China's post-1978 economic growth? We cannot randomly assign reform packages to countries, randomize WTO accession, or experimentally vary initial conditions. The question is simply not amenable to experimental methods—yet it may be the most important economic question of the past half-century. As Chapter 1 discusses, such questions require combining description, quasi-experimental evidence (like comparing Special Economic Zones), time series analysis, and theoretical reasoning. Experiments are powerful where they work, but they cannot answer every important question.


10.5 Practical Guidance

When Are Experiments Feasible?

Factor
More Feasible
Less Feasible

Treatment

Discrete, deliverable intervention

Structural change, long-term process

Scale

Small groups, defined populations

Entire economies, political systems

Timeline

Short-term outcomes

Effects over decades

Ethics

Clear equipoise

Withholding known benefits

Control

Researcher can assign

Determined by politics, choice

When Are Experiments Ethical?

Ethical experiments require:

Equipoise: Genuine uncertainty about which treatment is better. If we knew treatment was beneficial, withholding it would be unethical.

Informed consent: Subjects understand the study and agree to participate. Waived only when risk is minimal and consent would compromise validity.

Proportionality: Expected benefits (knowledge gained) justify risks to subjects.

Fair subject selection: Burdens and benefits distributed equitably, not exploiting vulnerable populations.

Practical Box: Ethics Checklist

Pre-Analysis Plans

Pre-analysis plans (PAPs) specify hypotheses, outcomes, and analysis methods before examining data:

Benefits:

  • Distinguish confirmatory from exploratory analysis

  • Prevent p-hacking and selective reporting

  • Increase credibility

Costs:

  • May discourage useful exploration

  • Difficult to anticipate all analyses

  • Not always feasible (secondary data)

Best practices:

  • Pre-specify primary outcomes and main analysis

  • Register plan publicly (AEA registry, OSF)

  • Report deviations transparently

  • Clearly label exploratory analyses

Common Pitfalls

Pitfall 1: Underpowered Studies

Many experiments lack power to detect plausible effects. A null result from an underpowered study tells us little.

How to avoid: Conduct power analysis before collecting data. Be realistic about minimum detectable effects.

Pitfall 2: Ignoring Clustering

Analyzing clustered data as if individually randomized inflates precision and produces misleading inference.

How to avoid: Cluster standard errors at the level of randomization. Account for clustering in power calculations.

Pitfall 3: Multiple Hypothesis Testing

Testing many outcomes without adjustment inflates false positive rates.

How to avoid: Pre-specify primary outcomes. Adjust for multiple testing (Bonferroni, FDR). Distinguish primary from exploratory.

Pitfall 4: Overgeneralizing

Concluding that an effect found in one context will hold everywhere.

How to avoid: Characterize the study population. Discuss mechanisms. Acknowledge limitations on external validity.

Pitfall 5: Balance Tests (t-Tests on Baseline Covariates)

A common but conceptually flawed practice: running t-tests on baseline covariates to "verify randomization worked." If randomization was conducted properly, it guarantees balance in expectation. Any statistically significant difference is, by construction, a Type I error—you will find "significant imbalance" 5% of the time by chance alone.

The problem:

  • With 20 baseline covariates, you expect 1 "significant" imbalance at α=0.05\alpha = 0.05, even with perfect randomization

  • P-values from balance tests are uniformly distributed under the null (randomization was valid)

  • Conditioning analysis on balance test results introduces bias

What to do instead:

  • Report standardized differences (e.g., (Xˉ1Xˉ0)/σ(|\bar{X}_1 - \bar{X}_0|)/\sigma) to assess magnitude of imbalance, not statistical significance

  • Use randomization inference to verify the randomization procedure was implemented correctly

  • If substantial imbalance exists, include the imbalanced covariate as a control variable—this is always valid in an experiment

Bottom line: Significant balance test results should not lead you to question your randomization (if properly implemented) or exclude the data. They're expected by chance.

Implementation Checklist

Before the experiment:

During implementation:

Analysis:


Qualitative Bridge

What Experiments Cannot Tell Us

Experiments establish that something works. They typically do not tell us:

  • Why it works (mechanism)

  • How it works in practice (implementation)

  • For whom it works best (without pre-specified subgroups)

  • Whether it will work elsewhere (external validity)

Qualitative methods can address these gaps.

Combining Experiments with Qualitative Methods

Process evaluation: Qualitative research alongside an experiment to understand implementation:

  • Were treatments delivered as planned?

  • How did subjects experience the intervention?

  • What barriers and facilitators emerged?

Mechanism investigation: Qualitative work to understand how effects occur:

  • Interviews with participants about behavior change

  • Observation of how interventions are used

  • Theory development to explain findings

Context understanding: Qualitative research to interpret findings:

  • Why did effects differ across sites?

  • What local factors shaped implementation?

  • How might results transfer to other contexts?

Example: Combining Methods in Microfinance Research

The microfinance RCTs found modest average effects but substantial heterogeneity. Qualitative research helped explain:

  • Some borrowers used loans productively; others for consumption smoothing

  • Local economic conditions shaped opportunities

  • Social dynamics affected repayment and borrowing decisions

  • Implementation quality varied across sites

This mixed-methods understanding is richer than either approach alone.


Integration Note

Connections to Other Chapters

Chapter
Connection

Ch. 9 (Causal Framework)

Experiments as the "gold standard" that other methods approximate

Ch. 11 (Selection on Observables)

Experiments provide benchmarks for validating observational methods (LaLonde); when randomization is infeasible, SOO is an alternative

Ch. 12 (IV)

Non-compliance creates IV interpretation; ITT/LATE distinction

Ch. 19 (Mechanisms)

Experiments establish effects; mechanism analysis explains them

Ch. 20 (Heterogeneity)

Experimental analysis of treatment effect heterogeneity

Ch. 24 (Evidence Synthesis)

Meta-analysis aggregates experimental evidence

Triangulation with Non-Experimental Methods

Experimental evidence is strongest when triangulated with:

  1. Quasi-experimental evidence: Do natural experiments yield similar estimates?

  2. Observational evidence: Do selection-on-observables analyses align?

  3. Theoretical predictions: Is the effect size consistent with theory?

  4. Mechanism evidence: Do proposed mechanisms explain the findings?

Agreement across methods strengthens confidence. Disagreement prompts investigation.

Box: Experiments as Methodological Benchmarks

Beyond estimating treatment effects, experiments serve a second important function: validating nonexperimental methods. When experimental and nonexperimental estimates exist for the same question, comparing them reveals whether observational methods work in that setting.

The landmark example is LaLonde (1986). The National Supported Work demonstration randomly assigned participants to a job training program, providing an experimental benchmark. LaLonde then asked: Can nonexperimental methods—applied to the same treated individuals but using comparison groups from surveys—replicate the experimental estimate?

His answer was discouraging: nonexperimental estimates varied wildly and often had the wrong sign. This finding helped catalyze the credibility revolution's emphasis on research design over statistical adjustment.

Four decades of subsequent research have refined this picture (Imbens & Xu 2025):

  • Overlap is essential: With comparable treated and control groups, modern methods (matching, IPW, AIPW) yield stable estimates. Without overlap, estimates depend heavily on extrapolation.

  • Stable estimates ≠ valid estimates: Several modern methods produce ATT estimates close to the LaLonde experimental benchmark—seemingly vindicating propensity score methods. But placebo tests (estimating "effects" on pre-treatment outcomes) fail badly, suggesting unconfoundedness does not hold.

  • The sobering lesson: Methods that robustly estimate the statistical estimand may still fail to recover the causal estimand. Passing the benchmark test in one sample doesn't guarantee validity in another.

This "within-study comparison" design—using experiments to test nonexperimental methods—has become a methodological research program of its own. It offers the field honest feedback about what our methods can and cannot do.

See Chapter 11 for full discussion of selection on observables and the LaLonde benchmark.


Summary

Key takeaways:

  1. Random assignment breaks confounding: By making treatment independent of potential outcomes, experiments enable simple comparisons to identify causal effects.

  2. Design matters: Power analysis, stratification, and proper clustering are essential for valid and efficient experiments.

  3. Threats require attention: Attrition, non-compliance, and spillovers can compromise even randomized experiments. ITT is always identified; TOT requires additional assumptions.

  4. External validity is a challenge: Experimental effects in specific contexts may not generalize. Theory, replication, and mechanism understanding help bridge this gap.

  5. Experiments are one tool among many: Powerful for some questions, infeasible or uninformative for others. The best research uses experiments where appropriate and complements them with other evidence.

Returning to the opening question: We can randomize our way to causal knowledge when experiments are feasible, ethical, and properly designed. But randomization does not solve all problems: we must still address attrition, compliance, spillovers, and external validity. The credibility of experimental evidence comes not from the method alone, but from careful design, transparent reporting, and thoughtful interpretation of what experiments can and cannot tell us.


Further Reading

Essential

  • Duflo, Glennerster & Kremer (2007). "Using Randomization in Development Economics Research." Handbook of Development Economics. The field experiment toolkit.

  • Gerber & Green (2012). Field Experiments: Design, Analysis, and Interpretation. Comprehensive textbook.

For Deeper Understanding

  • Athey & Imbens (2017). "The Econometrics of Randomized Experiments." Handbook of Economic Field Experiments.

  • Bruhn & McKenzie (2009). "In Pursuit of Balance." American Economic Journal: Applied.

Critical Perspectives

  • Deaton (2010). "Instruments, Randomization, and Learning about Development." JEL. The critique.

  • Heckman & Smith (1995). "Assessing the Case for Social Experiments." JEP. Limitations of experiments.

  • Banerjee et al. (2017). "A Theory of Experimenters." Response to criticism.

Applications

  • Banerjee et al. (2015). "The Miracle of Microfinance?" Six-country study.

  • Finkelstein et al. (2012). "The Oregon Health Insurance Experiment." Health insurance effects.

  • Chetty, Friedman & Rockoff (2014). "Measuring the Impacts of Teachers." Value-added with experimental validation.

Experiments as Methodological Benchmarks

  • Imbens & Xu (2025). "Comparing Experimental and Nonexperimental Methods: What Lessons Have We Learned Four Decades after LaLonde (1986)?" Journal of Economic Perspectives 39(4): 173–202. Essential retrospective on using experiments to validate observational methods.

  • LaLonde (1986). "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." AER. The original within-study comparison.

  • Cook, Shadish & Wong (2008). "Three Conditions Under Which Experiments and Observational Studies Produce Comparable Causal Estimates." Journal of Policy Analysis and Management. When do methods agree?


Exercises

Conceptual

  1. Explain why random assignment ensures that E[Y(0)D=1]=E[Y(0)D=0]E[Y(0) | D = 1] = E[Y(0) | D = 0]. What would violate this equality even with random assignment?

  2. A researcher conducts an experiment with 50% compliance in the treatment arm and 10% "contamination" (control subjects obtaining treatment).

    • What is the ITT effect if mean outcomes are 100 in the treatment arm and 90 in control?

    • What is the TOT/LATE estimate?

    • What assumption is required for the TOT interpretation?

Applied

  1. Design a field experiment to test whether text message reminders increase savings among low-income households.

    • What is the treatment and control?

    • Would you randomize at the individual or household level? Why?

    • What outcomes would you measure?

    • Conduct a power calculation assuming 20% outcome standard deviation, 5% minimum detectable effect, 80% power, and ICC of 0.05 if clustering.

  2. A job training experiment finds large positive effects (20% earnings increase) in a pilot with motivated volunteers at high-performing training centers. Discuss five reasons why effects might differ when the program is scaled nationally.

Discussion

  1. "Randomized experiments are the gold standard for causal inference." Critically evaluate this claim. Under what circumstances might non-experimental evidence be more valuable than experimental evidence for informing policy?


Technical Appendix

A.1 Variance of Difference-in-Means Estimator

For the estimator τ^=Yˉ1Yˉ0\hat{\tau} = \bar{Y}_1 - \bar{Y}_0:

Var(τ^)=σ12n1+σ02n0\text{Var}(\hat{\tau}) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}

Under homoskedasticity (σ12=σ02=σ2\sigma_1^2 = \sigma_0^2 = \sigma^2) and equal allocation (n1=n0=n/2n_1 = n_0 = n/2):

Var(τ^)=4σ2n\text{Var}(\hat{\tau}) = \frac{4\sigma^2}{n}

A.2 Design Effect for Clustering

The design effect (DEFF) measures variance inflation from clustering:

DEFF=1+(k1)ρ\text{DEFF} = 1 + (k - 1)\rho

where kk is average cluster size and ρ\rho is ICC.

The effective sample size is:

neff=nDEFF=mk1+(k1)ρn_{eff} = \frac{n}{\text{DEFF}} = \frac{mk}{1 + (k-1)\rho}

A.3 Lee Bounds for Attrition

If attrition is monotonic (treatment only increases or only decreases attrition), Lee bounds trim the group with lower attrition to equalize attrition rates.

Let pp be the proportion to trim. Lower and upper bounds on the treatment effect are:

[τ^L,τ^U]=[Yˉ1trimmed,lowYˉ0,Yˉ1trimmed,highYˉ0][\hat{\tau}_L, \hat{\tau}_U] = [\bar{Y}_1^{trimmed,low} - \bar{Y}_0, \bar{Y}_1^{trimmed,high} - \bar{Y}_0]

where trimmed values remove the top or bottom pp fraction of treatment group outcomes.

Last updated