Chapter 10: Experimental Methods
Opening Question
When can we randomize our way to causal knowledge—and when does randomization fall short?
Chapter Overview
Randomized experiments hold a privileged place in causal inference. By randomly assigning treatment, we break the link between treatment and confounders, allowing simple comparisons to reveal causal effects. No other method achieves this so directly.
Yet experiments are not a panacea. They may be infeasible, unethical, or too costly. When feasible, they face threats from attrition, non-compliance, and spillovers. Even a perfect experiment may tell us little about effects in other contexts. The 2019 Nobel Prize in Economics to Banerjee, Duflo, and Kremer recognized the power of field experiments in development—but also sparked renewed debate about their limitations.
This chapter covers the design, analysis, and interpretation of randomized experiments. We explain why randomization works, how to design experiments well, what can go wrong, and when experiments are the right tool. The goal is neither uncritical enthusiasm nor reflexive skepticism, but a clear-eyed understanding of what experiments can and cannot deliver.
What you will learn:
Why random assignment identifies causal effects
How to design experiments: power analysis, stratification, clustering
The distinction between ITT and treatment-on-treated effects
Major threats to experimental validity and how to address them
When experiments are feasible, ethical, and informative
Prerequisites: Chapter 9 (The Causal Framework)
10.1 Why Randomization Works
The Logic of Random Assignment
Recall from Chapter 9 the fundamental problem: we cannot observe the same unit in both treatment and control states. Causal inference requires assumptions about the unobserved counterfactuals.
Randomization solves this problem elegantly. If we randomly assign units to treatment and control, the two groups are comparable in expectation. Any differences in outcomes can be attributed to treatment—not to pre-existing differences between groups.
Formally, random assignment ensures:
Independence Assumption
(Y(0),Y(1))⊥D
Potential outcomes are independent of treatment assignment. Treatment is unrelated to who would benefit (or be harmed) by it.
Under this assumption, the selection bias term from Chapter 9 vanishes:
E[Y(0)∣D=1]=E[Y(0)∣D=0]=E[Y(0)]
The treated and control groups have the same expected potential outcomes under control. Any difference in observed outcomes reflects the treatment effect.

The Simple Difference in Means
With random assignment, causal effects are identified by simple comparisons:
τ^=Yˉ1−Yˉ0=n11∑Di=1Yi−n01∑Di=0Yi
This difference-in-means estimator is:
Unbiased: E[τ^]=ATE
Consistent: τ^pATE as n→∞
Simple: No modeling assumptions required
The variance is: Var(τ^)=n1σ12+n0σ02
where σ12 and σ02 are the outcome variances in treatment and control.
Regression Analysis of Experiments
Although simple differences suffice, regression is often used:
Yi=α+τDi+εi
With random assignment, OLS estimates τ^ equal the difference in means. Why use regression?
Convenience: Standard errors and tests come automatically
Covariates: Adding baseline covariates can improve precision
Subgroups: Interactions allow heterogeneity analysis
Adding Covariates in Experiments
Yi=α+τDi+Xi′β+εi
With random assignment, covariates do not change τ^ in expectation—but they can reduce variance and improve precision. This is because covariates explain some of the outcome variation, reducing residual variance.
What Random Assignment Does Not Do
Random assignment is powerful, but it does not solve all problems:
It does not ensure balance in finite samples: Random assignment balances groups in expectation, but any particular randomization may produce imbalanced groups by chance.
It does not prevent attrition: If subjects drop out differentially by treatment status, balance is compromised.
It does not prevent non-compliance: Subjects may not take the treatment they are assigned.
It does not prevent spillovers: Treatment may affect control group outcomes.
It does not establish external validity: Effects in the experimental sample may not generalize.
The rest of this chapter addresses these challenges.
10.2 Experimental Design
Sample Size and Power
Statistical power is the probability of detecting a true effect. Underpowered experiments waste resources and may produce misleading null results.
Power depends on:
Effect size (τ): Larger effects are easier to detect
Sample size (n): More observations increase power
Outcome variance (σ2): Less noisy outcomes increase power
Significance level (α): Lower α reduces power
The standard power formula for a two-sample t-test:
n=τ22(z1−α/2+z1−β)2σ2
where:
z1−α/2 is the critical value for significance level α
z1−β is the critical value for power 1−β
σ2 is the outcome variance
τ is the minimum detectable effect
Example: Power Calculation
Suppose we want 80% power (β=0.20) at the 5% level (α=0.05) to detect an effect of 0.2 standard deviations.
With z0.975=1.96 and z0.80=0.84:
n=0.222(1.96+0.84)2=0.042×7.84=392
We need about 400 subjects per arm (800 total).

Stratification and Blocking
Stratification (or blocking) improves precision by ensuring balance on key covariates:
Divide subjects into strata based on baseline characteristics
Randomize separately within each stratum
Analyze with stratum fixed effects or weights
Benefits:
Guarantees balance on stratifying variables
Reduces variance, increasing power
Enables subgroup analysis
Practical Box: What to Stratify On
Stratify on variables that:
Strongly predict the outcome
Might be imbalanced by chance
Define subgroups of interest
Common choices: baseline outcome, geographic region, demographics. Don't over-stratify—too many strata with few units per stratum creates problems.

Cluster Randomization
Sometimes individual randomization is infeasible or undesirable. Cluster randomization assigns treatment at the group level:
Schools, not students
Villages, not households
Clinics, not patients
Why cluster?
Logistical necessity (cannot treat individuals differently within a classroom)
Reduce spillovers (if treatment affects nearby individuals)
Policy relevance (policies often implemented at group level)
The cost of clustering: Effective sample size is reduced. With m clusters of size k:
neff=1+(k−1)ρmk
where ρ is the intraclass correlation coefficient (ICC)—the fraction of variance between clusters.
Example: Clustering Dramatically Reduces Power
With 100 clusters of 50 students each (5,000 total) and ICC = 0.10:
neff=1+49×0.105000=5.95000≈847
The effective sample size is only 847, not 5,000. Power calculations must account for this.
Randomization Inference
Classical inference assumes random sampling from a population and relies on asymptotic approximations. Randomization inference (also called permutation inference or Fisher's exact test) derives p-values from the randomization itself—valid even with tiny samples.
The Sharp Null Hypothesis: H0:Yi(1)=Yi(0) for all units i
Under this null, treatment has zero effect on every unit. This is stronger than the usual null of zero average effect.
The Logic: Under the sharp null, each unit's outcome is fixed regardless of treatment assignment. The only randomness comes from which units received treatment. We can compute what the test statistic would have been under every possible randomization.
Algorithm: Randomization Inference
Implementation in R:
Key advantages:
Valid in finite samples (no asymptotic approximation)
Does not require distributional assumptions
Automatically accounts for actual randomization procedure (clustering, stratification, blocking)
Particularly valuable when n is small or distributions are skewed
Limitation: Tests the sharp null of zero effect for everyone, not just zero average effect. Rejection means treatment affected at least some units, not necessarily that the average effect is nonzero.
Optimal and Adaptive Designs
Standard experiments assign treatment with fixed probability (often 50-50). More sophisticated designs can improve efficiency or serve multiple goals simultaneously.
Optimal treatment allocation
When groups have different outcome variances or sizes, equal allocation is suboptimal. The variance-minimizing allocation assigns more treated units to groups with higher outcome variance:
pg∗∝σg
Example: In a wage experiment, if wage variance is higher for college graduates than non-graduates, allocate more treatment to college graduates.
Neyman Allocation for Stratified Experiments
With strata g=1,…,G, the optimal within-stratum treatment proportion minimizes variance of the overall ATE estimator. This allocates more observations to strata with higher outcome variance or larger population share.
Multi-arm experiments
When comparing multiple treatments (not just treatment vs. control), power calculations must account for multiple comparisons. Common designs:
Factorial designs: Test combinations of factors efficiently (e.g., price × messaging)
Balanced incomplete block designs: Not everyone sees every treatment, but comparisons are balanced
Adaptive experiments (bandits)
In some settings, we want to learn while deciding—assigning treatment based on accumulating evidence:
Box: Multi-Armed Bandits and Adaptive Assignment
In a bandit problem, assignment probabilities adjust based on observed outcomes. Arms (treatments) that appear more effective get assigned more often.
Thompson Sampling: Maintain a posterior distribution over each arm's effectiveness; sample from posteriors and assign the arm with the highest sampled value.
Explore-exploit tradeoff: Early in the experiment, explore all arms to learn their effects. Later, exploit by assigning the best-performing arm more often.
When to use adaptive designs:
Ethical imperative to minimize harm (clinical trials)
Objective is welfare during the experiment, not just learning
Large-scale tech experiments where regret is costly
Limitations for causal inference:
Inference is more complex (assignment depends on previous outcomes)
May not have enough data on some arms for precise estimates
Standard confidence intervals don't apply without adjustment
Key references: Russo et al. (2018) for bandits; Hadad et al. (2021) for inference in adaptive experiments.
10.3 Types of Experiments
Laboratory Experiments
Lab experiments bring subjects to a controlled environment:
Advantages:
Maximum control over environment and treatment
Can implement complex treatments and measure fine-grained outcomes
Relatively cheap per observation
Limitations:
Artificial setting may not reflect real-world behavior
Subject pools (often students) may not be representative
Demand effects: subjects may behave to please the experimenter
Hawthorne effects: behavior changes from being observed
Common applications:
Behavioral economics (risk, time, social preferences)
Psychology (cognition, decision-making)
Game theory (strategic behavior)
Field Experiments
Field experiments randomize in real-world settings:
Advantages:
Natural environment and real stakes
Outcomes are actual behavior, not stated intentions
Policy-relevant by design
Limitations:
Less control over implementation
More expensive and time-consuming
Ethical and logistical constraints
May still have limited external validity
Examples:
Microfinance RCTs across developing countries
Get-out-the-vote experiments
Resume audit studies for discrimination
Running Example: Microfinance
Banerjee et al. (2015) conducted randomized evaluations of microfinance in six countries. Villages were randomly assigned to receive microfinance access or not. The studies found modest effects on business outcomes, no transformation of poverty, and considerable heterogeneity across contexts.
This example illustrates both the power of experiments (clean identification of average effects) and their limitations (effects varied across sites; mechanisms remained unclear).
Meager (2019) later showed how to synthesize these experiments using Bayesian hierarchical models. Her analysis revealed that roughly 60% of the observed cross-study variation was sampling error---the studies were actually more consistent than they appeared. The framework also enabled prediction of effects in new sites, directly addressing external validity. This demonstrates how multiple experiments can be combined to learn more than any single study could reveal (see Chapter 24).
Natural Field Experiments
Some studies embed randomization in naturally occurring processes without subjects knowing they are in an experiment:
Audit studies (send resumes with randomly varied characteristics)
Direct mail experiments (randomly vary marketing messages)
A/B testing on platforms
These combine the control of experiments with the naturalness of observational settings.
Online Experiments and A/B Testing
Digital platforms enable massive-scale experimentation:
Advantages:
Huge sample sizes at low cost
Rapid iteration
Precise measurement of digital outcomes
Limitations:
Outcomes limited to platform behavior
User populations may not be representative
Ethical concerns about manipulation without consent
Multiple simultaneous tests create inference problems
Example: A/B Testing at Scale
Tech companies run thousands of experiments yearly. A typical test might randomize users to different website layouts and measure click-through rates. Sample sizes can be millions, detecting effects of 0.1% with high precision.
But: Effect sizes in user experience are often tiny. Statistically significant effects may be practically meaningless.
Survey Experiments
Survey experiments randomize within questionnaires:
Randomly vary question wording, order, or information provided
Measure how responses change
Applications:
Measuring sensitive attitudes (list experiments, endorsement experiments)
Testing framing effects
Conjoint analysis for preferences
Advantages: Cheap, fast, large samples Limitations: Measures stated preferences, not behavior
10.4 Threats to Validity
Internal Validity Threats
Internal validity asks: Did the experiment correctly estimate the causal effect for the study population?

Attrition
Subjects who drop out may differ from those who remain. If attrition relates to treatment, balance is compromised.
Diagnosis:
Compare attrition rates across arms
Compare baseline characteristics of attriters vs. completers
Test whether attrition predicts treatment
Solutions:
Minimize attrition through design (incentives, tracking)
Report bounds under different assumptions about missing outcomes
Lee bounds: Trim the group with less attrition to restore balance
Non-Compliance
Subjects may not take the treatment they are assigned:
One-sided: Some assigned to treatment don't take it
Two-sided: Some assigned to control take treatment anyway
With non-compliance, we distinguish:
Intent-to-Treat (ITT)
ITT=E[Y∣Z=1]−E[Y∣Z=0]
The effect of assignment to treatment, regardless of actual treatment.
Treatment-on-Treated (TOT) / LATE
TOT=Compliance RateITT=E[D∣Z=1]−E[D∣Z=0]E[Y∣Z=1]−E[Y∣Z=0]
The effect for those who comply with their assignment. This is the LATE from Chapter 12.
ITT is always identified. TOT requires assuming no defiers (monotonicity) and interprets the effect as applying to compliers only.
When to Report What
ITT: Always report. It is the effect of the policy (offering treatment).
TOT: Report when the effect of actually receiving treatment is of interest. Be clear about the complier population.

Spillovers and Interference
The Stable Unit Treatment Value Assumption (SUTVA) requires that one unit's treatment doesn't affect another's outcome. Spillovers violate SUTVA:
Vaccination reduces disease for unvaccinated (positive spillover)
Job training helps trainees but may hurt non-trainees competing for jobs (negative spillover)
Information interventions spread through networks
Diagnosis:
Test whether control outcomes vary with treatment density nearby
Measure outcomes for indirect contacts
Solutions:
Cluster randomization at a level where spillovers are contained
Design studies to estimate spillover effects directly
Partial population experiments: randomize treatment intensity
Hawthorne and Experimenter Effects
Being in an experiment may change behavior:
Hawthorne effect: Subjects behave differently because they're observed
Experimenter demand: Subjects try to confirm what they think the experimenter wants
Mitigation:
Blind subjects to treatment when possible (placebo controls)
Minimize salience of observation
Natural field experiments where subjects don't know they're in a study
External Validity Threats
External validity asks: Do findings generalize beyond the study?
Sample Selection
Experimental samples are often unrepresentative:
Volunteers differ from non-volunteers
Sites are chosen for feasibility, not representativeness
Developing country field experiments may not inform developed country policy
Site Selection
Researchers may choose sites where effects are likely large:
Programs implemented by capable NGOs
Motivated local partners
Contexts where treatment should work
Results may not replicate under routine implementation.
Equilibrium Effects
Small-scale experiments cannot capture market-wide effects:
A job training program helps trainees but would depress wages if scaled
A school voucher experiment cannot reveal effects of universal vouchers
The Deaton Critique
Angus Deaton (2010) articulated several concerns about the RCT movement:
External validity: Experiments identify effects in specific contexts. Without theory, we cannot extrapolate.
Mechanisms: Experiments show that something works, not why. Without understanding mechanisms, we cannot generalize or improve.
Ethical constraints: We cannot randomize many important treatments. The set of randomizable questions may not include the most important ones.
Local effects: LATE applies to compliers. Policy effects on a different population may differ.
Implementation at scale: Experimental effects under careful implementation may vanish under routine delivery.
Deaton's Challenge
"The RCT is a useful tool, but I think that it is a mistake to put method ahead of substance. Suppose we had had RCTs of the effects of the demographic transition on economic development, or of democracy on development, or of the effect of the welfare state on poverty. While such RCTs might be attractive in principle, it is hard to imagine what they would look like in practice... we have learned a great deal about these topics without RCTs."
— Deaton (2010)
The response is not to abandon experiments but to use them wisely—combining experimental evidence with theory, mechanism, and judgment about external validity.
Running Example Connection: China's Growth
Deaton's critique resonates powerfully with our China running example. What explains China's post-1978 economic growth? We cannot randomly assign reform packages to countries, randomize WTO accession, or experimentally vary initial conditions. The question is simply not amenable to experimental methods—yet it may be the most important economic question of the past half-century. As Chapter 1 discusses, such questions require combining description, quasi-experimental evidence (like comparing Special Economic Zones), time series analysis, and theoretical reasoning. Experiments are powerful where they work, but they cannot answer every important question.
10.5 Practical Guidance
When Are Experiments Feasible?
Treatment
Discrete, deliverable intervention
Structural change, long-term process
Scale
Small groups, defined populations
Entire economies, political systems
Timeline
Short-term outcomes
Effects over decades
Ethics
Clear equipoise
Withholding known benefits
Control
Researcher can assign
Determined by politics, choice
When Are Experiments Ethical?
Ethical experiments require:
Equipoise: Genuine uncertainty about which treatment is better. If we knew treatment was beneficial, withholding it would be unethical.
Informed consent: Subjects understand the study and agree to participate. Waived only when risk is minimal and consent would compromise validity.
Proportionality: Expected benefits (knowledge gained) justify risks to subjects.
Fair subject selection: Burdens and benefits distributed equitably, not exploiting vulnerable populations.
Practical Box: Ethics Checklist
Pre-Analysis Plans
Pre-analysis plans (PAPs) specify hypotheses, outcomes, and analysis methods before examining data:
Benefits:
Distinguish confirmatory from exploratory analysis
Prevent p-hacking and selective reporting
Increase credibility
Costs:
May discourage useful exploration
Difficult to anticipate all analyses
Not always feasible (secondary data)
Best practices:
Pre-specify primary outcomes and main analysis
Register plan publicly (AEA registry, OSF)
Report deviations transparently
Clearly label exploratory analyses
Common Pitfalls
Pitfall 1: Underpowered Studies
Many experiments lack power to detect plausible effects. A null result from an underpowered study tells us little.
How to avoid: Conduct power analysis before collecting data. Be realistic about minimum detectable effects.
Pitfall 2: Ignoring Clustering
Analyzing clustered data as if individually randomized inflates precision and produces misleading inference.
How to avoid: Cluster standard errors at the level of randomization. Account for clustering in power calculations.
Pitfall 3: Multiple Hypothesis Testing
Testing many outcomes without adjustment inflates false positive rates.
How to avoid: Pre-specify primary outcomes. Adjust for multiple testing (Bonferroni, FDR). Distinguish primary from exploratory.
Pitfall 4: Overgeneralizing
Concluding that an effect found in one context will hold everywhere.
How to avoid: Characterize the study population. Discuss mechanisms. Acknowledge limitations on external validity.
Pitfall 5: Balance Tests (t-Tests on Baseline Covariates)
A common but conceptually flawed practice: running t-tests on baseline covariates to "verify randomization worked." If randomization was conducted properly, it guarantees balance in expectation. Any statistically significant difference is, by construction, a Type I error—you will find "significant imbalance" 5% of the time by chance alone.
The problem:
With 20 baseline covariates, you expect 1 "significant" imbalance at α=0.05, even with perfect randomization
P-values from balance tests are uniformly distributed under the null (randomization was valid)
Conditioning analysis on balance test results introduces bias
What to do instead:
Report standardized differences (e.g., (∣Xˉ1−Xˉ0∣)/σ) to assess magnitude of imbalance, not statistical significance
Use randomization inference to verify the randomization procedure was implemented correctly
If substantial imbalance exists, include the imbalanced covariate as a control variable—this is always valid in an experiment
Bottom line: Significant balance test results should not lead you to question your randomization (if properly implemented) or exclude the data. They're expected by chance.
Implementation Checklist
Before the experiment:
During implementation:
Analysis:
Qualitative Bridge
What Experiments Cannot Tell Us
Experiments establish that something works. They typically do not tell us:
Why it works (mechanism)
How it works in practice (implementation)
For whom it works best (without pre-specified subgroups)
Whether it will work elsewhere (external validity)
Qualitative methods can address these gaps.
Combining Experiments with Qualitative Methods
Process evaluation: Qualitative research alongside an experiment to understand implementation:
Were treatments delivered as planned?
How did subjects experience the intervention?
What barriers and facilitators emerged?
Mechanism investigation: Qualitative work to understand how effects occur:
Interviews with participants about behavior change
Observation of how interventions are used
Theory development to explain findings
Context understanding: Qualitative research to interpret findings:
Why did effects differ across sites?
What local factors shaped implementation?
How might results transfer to other contexts?
Example: Combining Methods in Microfinance Research
The microfinance RCTs found modest average effects but substantial heterogeneity. Qualitative research helped explain:
Some borrowers used loans productively; others for consumption smoothing
Local economic conditions shaped opportunities
Social dynamics affected repayment and borrowing decisions
Implementation quality varied across sites
This mixed-methods understanding is richer than either approach alone.
Integration Note
Connections to Other Chapters
Ch. 9 (Causal Framework)
Experiments as the "gold standard" that other methods approximate
Ch. 11 (Selection on Observables)
Experiments provide benchmarks for validating observational methods (LaLonde); when randomization is infeasible, SOO is an alternative
Ch. 12 (IV)
Non-compliance creates IV interpretation; ITT/LATE distinction
Ch. 19 (Mechanisms)
Experiments establish effects; mechanism analysis explains them
Ch. 20 (Heterogeneity)
Experimental analysis of treatment effect heterogeneity
Ch. 24 (Evidence Synthesis)
Meta-analysis aggregates experimental evidence
Triangulation with Non-Experimental Methods
Experimental evidence is strongest when triangulated with:
Quasi-experimental evidence: Do natural experiments yield similar estimates?
Observational evidence: Do selection-on-observables analyses align?
Theoretical predictions: Is the effect size consistent with theory?
Mechanism evidence: Do proposed mechanisms explain the findings?
Agreement across methods strengthens confidence. Disagreement prompts investigation.
Box: Experiments as Methodological Benchmarks
Beyond estimating treatment effects, experiments serve a second important function: validating nonexperimental methods. When experimental and nonexperimental estimates exist for the same question, comparing them reveals whether observational methods work in that setting.
The landmark example is LaLonde (1986). The National Supported Work demonstration randomly assigned participants to a job training program, providing an experimental benchmark. LaLonde then asked: Can nonexperimental methods—applied to the same treated individuals but using comparison groups from surveys—replicate the experimental estimate?
His answer was discouraging: nonexperimental estimates varied wildly and often had the wrong sign. This finding helped catalyze the credibility revolution's emphasis on research design over statistical adjustment.
Four decades of subsequent research have refined this picture (Imbens & Xu 2025):
Overlap is essential: With comparable treated and control groups, modern methods (matching, IPW, AIPW) yield stable estimates. Without overlap, estimates depend heavily on extrapolation.
Stable estimates ≠ valid estimates: Several modern methods produce ATT estimates close to the LaLonde experimental benchmark—seemingly vindicating propensity score methods. But placebo tests (estimating "effects" on pre-treatment outcomes) fail badly, suggesting unconfoundedness does not hold.
The sobering lesson: Methods that robustly estimate the statistical estimand may still fail to recover the causal estimand. Passing the benchmark test in one sample doesn't guarantee validity in another.
This "within-study comparison" design—using experiments to test nonexperimental methods—has become a methodological research program of its own. It offers the field honest feedback about what our methods can and cannot do.
See Chapter 11 for full discussion of selection on observables and the LaLonde benchmark.
Summary
Key takeaways:
Random assignment breaks confounding: By making treatment independent of potential outcomes, experiments enable simple comparisons to identify causal effects.
Design matters: Power analysis, stratification, and proper clustering are essential for valid and efficient experiments.
Threats require attention: Attrition, non-compliance, and spillovers can compromise even randomized experiments. ITT is always identified; TOT requires additional assumptions.
External validity is a challenge: Experimental effects in specific contexts may not generalize. Theory, replication, and mechanism understanding help bridge this gap.
Experiments are one tool among many: Powerful for some questions, infeasible or uninformative for others. The best research uses experiments where appropriate and complements them with other evidence.
Returning to the opening question: We can randomize our way to causal knowledge when experiments are feasible, ethical, and properly designed. But randomization does not solve all problems: we must still address attrition, compliance, spillovers, and external validity. The credibility of experimental evidence comes not from the method alone, but from careful design, transparent reporting, and thoughtful interpretation of what experiments can and cannot tell us.
Further Reading
Essential
Duflo, Glennerster & Kremer (2007). "Using Randomization in Development Economics Research." Handbook of Development Economics. The field experiment toolkit.
Gerber & Green (2012). Field Experiments: Design, Analysis, and Interpretation. Comprehensive textbook.
For Deeper Understanding
Athey & Imbens (2017). "The Econometrics of Randomized Experiments." Handbook of Economic Field Experiments.
Bruhn & McKenzie (2009). "In Pursuit of Balance." American Economic Journal: Applied.
Critical Perspectives
Deaton (2010). "Instruments, Randomization, and Learning about Development." JEL. The critique.
Heckman & Smith (1995). "Assessing the Case for Social Experiments." JEP. Limitations of experiments.
Banerjee et al. (2017). "A Theory of Experimenters." Response to criticism.
Applications
Banerjee et al. (2015). "The Miracle of Microfinance?" Six-country study.
Finkelstein et al. (2012). "The Oregon Health Insurance Experiment." Health insurance effects.
Chetty, Friedman & Rockoff (2014). "Measuring the Impacts of Teachers." Value-added with experimental validation.
Experiments as Methodological Benchmarks
Imbens & Xu (2025). "Comparing Experimental and Nonexperimental Methods: What Lessons Have We Learned Four Decades after LaLonde (1986)?" Journal of Economic Perspectives 39(4): 173–202. Essential retrospective on using experiments to validate observational methods.
LaLonde (1986). "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." AER. The original within-study comparison.
Cook, Shadish & Wong (2008). "Three Conditions Under Which Experiments and Observational Studies Produce Comparable Causal Estimates." Journal of Policy Analysis and Management. When do methods agree?
Exercises
Conceptual
Explain why random assignment ensures that E[Y(0)∣D=1]=E[Y(0)∣D=0]. What would violate this equality even with random assignment?
A researcher conducts an experiment with 50% compliance in the treatment arm and 10% "contamination" (control subjects obtaining treatment).
What is the ITT effect if mean outcomes are 100 in the treatment arm and 90 in control?
What is the TOT/LATE estimate?
What assumption is required for the TOT interpretation?
Applied
Design a field experiment to test whether text message reminders increase savings among low-income households.
What is the treatment and control?
Would you randomize at the individual or household level? Why?
What outcomes would you measure?
Conduct a power calculation assuming 20% outcome standard deviation, 5% minimum detectable effect, 80% power, and ICC of 0.05 if clustering.
A job training experiment finds large positive effects (20% earnings increase) in a pilot with motivated volunteers at high-performing training centers. Discuss five reasons why effects might differ when the program is scaled nationally.
Discussion
"Randomized experiments are the gold standard for causal inference." Critically evaluate this claim. Under what circumstances might non-experimental evidence be more valuable than experimental evidence for informing policy?
Technical Appendix
A.1 Variance of Difference-in-Means Estimator
For the estimator τ^=Yˉ1−Yˉ0:
Var(τ^)=n1σ12+n0σ02
Under homoskedasticity (σ12=σ02=σ2) and equal allocation (n1=n0=n/2):
Var(τ^)=n4σ2
A.2 Design Effect for Clustering
The design effect (DEFF) measures variance inflation from clustering:
DEFF=1+(k−1)ρ
where k is average cluster size and ρ is ICC.
The effective sample size is:
neff=DEFFn=1+(k−1)ρmk
A.3 Lee Bounds for Attrition
If attrition is monotonic (treatment only increases or only decreases attrition), Lee bounds trim the group with lower attrition to equalize attrition rates.
Let p be the proportion to trim. Lower and upper bounds on the treatment effect are:
[τ^L,τ^U]=[Yˉ1trimmed,low−Yˉ0,Yˉ1trimmed,high−Yˉ0]
where trimmed values remove the top or bottom p fraction of treatment group outcomes.
Last updated