Chapter 20: Heterogeneity and Generalization

Opening Question

If we know the average treatment effect, what more do we need to know—and can we know—about for whom it works and where else it would work?


Chapter Overview

The average treatment effect is an average. It tells us what happens on average when a population receives treatment. But averages conceal variation. A drug with no average effect might cure some patients and harm others. A policy that "works" on average might benefit the privileged while hurting the disadvantaged. Understanding heterogeneity—variation in treatment effects across individuals, groups, or contexts—is essential for targeting interventions and translating findings across settings.

This chapter addresses two related questions. First, for whom does the treatment work? This is the heterogeneity question, which we address through subgroup analysis and modern machine learning methods. Second, where else would it work? This is the generalization question, concerning external validity and transportability.

What you will learn:

  • The risks of subgroup analysis and how to do it responsibly

  • Machine learning methods for discovering heterogeneous treatment effects (causal forests)

  • The external validity problem and when findings generalize

  • Transportability methods for extrapolating to new populations

  • Multi-site trials and meta-analytic approaches to heterogeneity

Prerequisites: Chapter 9 (Causal Framework), Chapter 10 (Experimental Methods), Chapter 21 (ML for Causal Inference—may be read concurrently)


20.1 Why Heterogeneity Matters

Beyond Average Effects

Consider a job training program evaluated by RCT. The average treatment effect is a 5 percentage point increase in employment. This is policy-relevant, but incomplete:

  • Targeting: Should everyone receive training, or only those who benefit most? If effects are concentrated among high school dropouts, targeting them is more efficient.

  • Equity: Does the program reduce or exacerbate inequality? If it helps only the already-advantaged, equity concerns arise.

  • Mechanism: Why does it work? Heterogeneity by baseline characteristics may reveal mechanisms (Chapter 19).

  • Scaling: Will effects persist at scale? If early adopters differ from later participants, average effects may not generalize.

Types of Heterogeneity

Effect heterogeneity by observable characteristics: Effects vary by gender, age, education, baseline outcomes, etc.

Effect heterogeneity by context: Effects vary by location, time period, implementing organization, etc.

Essential heterogeneity: Effects vary by unobserved characteristics that also influence selection into treatment.

Distributional effects: Effects on different quantiles of the outcome distribution (e.g., minimum wage affects low-wage workers differently than high-wage).


20.2 The Conditional Average Treatment Effect

Definition

Definition 20.1 (Conditional Average Treatment Effect, CATE): The CATE is the average treatment effect for a subgroup defined by covariates XX: τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x]

The ATE is the average of CATE over the population distribution of XX: ATE=E[τ(X)]=τ(x)dF(x)ATE = E[\tau(X)] = \int \tau(x) \, dF(x)

Figure 20.1: Distribution of Treatment Effects

Figure 20.1: The same ATE can arise from very different distributions of individual treatment effects. With homogeneous effects (left), the ATE is representative of most individuals. With heterogeneous effects (right), the ATE masks important variation—some individuals may be harmed while others greatly benefit.

Identification

If we have identified the ATE (via RCT, DiD, IV, etc.), we can often identify CATE by:

  1. Stratification: Estimate effects separately within subgroups

  2. Interaction models: Include treatment × covariate interactions in regression

  3. Machine learning: Use flexible methods to estimate τ(x)\tau(x) as a function

Example: In an RCT, CATE is identified by: τ(x)=E[YD=1,X=x]E[YD=0,X=x]\tau(x) = E[Y | D=1, X=x] - E[Y | D=0, X=x]

The same selection-on-observables, DiD, or IV assumptions that identify ATE also identify CATE if imposed conditional on XX.


20.3 Subgroup Analysis: Risks and Best Practices

The Dangers of Subgroup Analysis

Subgroup analysis is notoriously unreliable. Classic problems:

1. Multiple testing

With 20 subgroups, we expect one false positive at α=0.05\alpha = 0.05 even if no true heterogeneity exists. Researchers often test many subgroups and report only significant findings.

2. Specification searching

The definition of subgroups (age cutoffs, category groupings) can be manipulated to achieve significance.

3. Low power

Subgroups are smaller than the full sample. Effects that are significant overall may be insignificant within subgroups, leading to false conclusions of "no effect" for some groups.

4. Regression to the mean

Subgroups selected for having extreme outcomes in one period tend to have less extreme outcomes in another.

Sobering Evidence

The ISIS-2 trial tested aspirin for heart attack patients. In pre-specified subgroups, effects were consistent. But when analyzed by astrological sign, Geminis and Libras showed no benefit—a "finding" reflecting noise, not biology.

This illustrates: with enough subgroups, some will show spurious heterogeneity.

Best Practices

Principle 20.1 (Subgroup Analysis Guidelines):

  1. Pre-specify subgroups before seeing outcomes

  2. Limit the number of subgroups examined

  3. Test for interaction, not just significance within subgroups

  4. Adjust for multiple comparisons

  5. Report all pre-specified subgroup analyses, not just significant ones

  6. Treat exploratory subgroup analysis as hypothesis-generating, not confirmatory

Pre-specification: Register subgroup analyses in a pre-analysis plan. This prevents cherry-picking.

Interaction tests: Don't just report "the effect is significant for women but not for men." Test whether the effects differ significantly:

Yi=β0+β1Di+β2Femalei+β3(Di×Femalei)+εiY_i = \beta_0 + \beta_1 D_i + \beta_2 Female_i + \beta_3 (D_i \times Female_i) + \varepsilon_i

β3\beta_3 tests heterogeneity; β1+β3\beta_1 + \beta_3 gives the female-specific effect.

Multiple comparison adjustments: Bonferroni, Benjamini-Hochberg, or other corrections for multiple testing.


20.4 Machine Learning for Heterogeneous Treatment Effects

The Promise of ML

Machine learning offers an alternative to pre-specified subgroup analysis: let the algorithm find heterogeneity.

Advantages:

  • Discovers interactions and nonlinearities humans wouldn't specify

  • Handles high-dimensional covariates

  • Provides principled variable importance measures

Challenges:

  • May overfit, finding spurious heterogeneity

  • "Black box" methods are hard to interpret

  • Standard ML optimizes prediction, not causal inference

Causal Forests (Generalized Random Forests)

Athey and Wager (2018) and Wager and Athey (2018) develop causal forests—random forests adapted for treatment effect estimation.

Key idea: Standard random forests predict E[YX]E[Y|X]. Causal forests estimate τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x] by:

  1. Growing trees that partition the covariate space

  2. At each leaf, estimating the treatment effect using observations in that leaf

  3. Aggregating across trees (forest)

Splitting criterion: Instead of minimizing prediction error, splits maximize heterogeneity in treatment effects between child nodes.

Honest Estimation

A key innovation is honesty: the same data should not be used to both determine splits and estimate effects within leaves.

Definition 20.2 (Honest Estimation): An estimation procedure is honest if it uses different data for determining the model structure (splits) and estimating parameters (leaf effects).

Implementation: Split the sample. Use one portion to grow trees (determine structure). Use another portion to estimate treatment effects within leaves.

Why does this matter? Without honesty, trees overfit—they find splits that look like heterogeneity in sample but don't generalize.

The GRF Package

The grf package (R) implements generalized random forests:

What you get:

  • CATE estimates τ^(xi)\hat{\tau}(x_i) for each observation

  • Confidence intervals (via jackknife)

  • Variable importance scores

Interpreting Results

Causal forests produce individual-level effect estimates. How to summarize?

Best Linear Predictor (BLP): Regress τ^(Xi)\hat{\tau}(X_i) on XiX_i to find which variables drive heterogeneity.

Calibration test: Check whether τ^(X)\hat{\tau}(X) actually predicts variation in effects: Yi=α+βτ^(Xi)Di+γDi+δXi+εiY_i = \alpha + \beta \cdot \hat{\tau}(X_i) \cdot D_i + \gamma D_i + \delta X_i + \varepsilon_i

If β=1\beta = 1, CATE estimates are well-calibrated.

Targeting: Use τ^(X)\hat{\tau}(X) to identify high-benefit groups for treatment targeting.

Box: Heterogeneous Effects Are Harder to Recover Than Average Effects

A sobering finding from the LaLonde literature: even when modern methods successfully recover average treatment effects, they may fail to recover heterogeneous effects.

Imbens & Xu (2025) demonstrate this vividly. Using LaLonde data, they estimate both the ATT (average treatment effect on the treated) and CATTs (conditional average treatment effects on the treated)—effects for subgroups defined by baseline characteristics.

For ATT: Several modern methods (matching, IPW, AIPW, causal forests) produce estimates close to the experimental benchmark of roughly $1,800, especially after trimming for overlap.

For CATTs: The picture is starkly different. Scatter plots of experimental vs. nonexperimental CATT estimates show weak correlation. Subgroups where the experimental estimate is positive often have negative nonexperimental estimates, and vice versa.

Why the divergence? Average effects allow errors to cancel: overestimates for some subgroups offset underestimates for others. Conditional effects require getting each subgroup right separately—a much harder task. Small amounts of unobserved confounding that wash out in the average can create large errors in subgroup-specific estimates.

Implications for practice:

  • Successful ATT estimation doesn't validate CATE estimation

  • Be cautious about targeting interventions based on estimated CATEs from observational data

  • Validation (placebo tests, held-out data, experimental benchmarks) becomes even more important for heterogeneous effects

  • When possible, estimate heterogeneity within experimental data, not by extending observational methods


20.5 Other ML Approaches

Meta-Learners

Meta-learners use any ML algorithm as a base learner and construct treatment effect estimates.

T-learner:

  1. Fit μ^1(x)=E[YX=x,D=1]\hat{\mu}_1(x) = E[Y|X=x, D=1] on treated observations

  2. Fit μ^0(x)=E[YX=x,D=0]\hat{\mu}_0(x) = E[Y|X=x, D=0] on control observations

  3. τ^(x)=μ^1(x)μ^0(x)\hat{\tau}(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x)

S-learner:

  1. Fit μ^(x,d)=E[YX=x,D=d]\hat{\mu}(x,d) = E[Y|X=x, D=d] including treatment as a covariate

  2. τ^(x)=μ^(x,1)μ^(x,0)\hat{\tau}(x) = \hat{\mu}(x,1) - \hat{\mu}(x,0)

X-learner (Künzel et al. 2019):

  1. Fit T-learner models

  2. Impute treatment effects for each observation

  3. Fit models for imputed effects

  4. Weight by propensity score

Double Machine Learning

Double ML (Chapter 21) can estimate heterogeneous effects by:

  1. Using ML to estimate nuisance functions E[YX]E[Y|X] and E[DX]E[D|X]

  2. Estimating CATE in a second stage using the orthogonalized residuals

Bayesian Approaches

BART (Bayesian Additive Regression Trees) provides:

  • Posterior distributions for treatment effects

  • Natural uncertainty quantification

  • Handles heterogeneity through tree ensemble


20.6 Policy Learning: From CATE to Optimal Treatment

Estimating treatment effect heterogeneity is valuable, but the ultimate question is: Who should we treat? Policy learning translates CATE estimates into actionable treatment assignment rules.

The Policy Learning Problem

Given estimated τ^(x)\hat{\tau}(x), we want to find a treatment rule π(x){0,1}\pi(x) \in \{0, 1\} that maximizes expected welfare:

π=argmaxπE[π(X)τ(X)]\pi^* = \arg\max_\pi E[\pi(X) \cdot \tau(X)]

Subject to constraints:

  • Budget: E[π(X)]BE[\pi(X)] \leq B (can only treat fraction BB)

  • Fairness: Treatment cannot depend on protected characteristics

  • Interpretability: Rule must be explainable

Simple Policy Rules

Threshold rules: Treat if τ^(X)>c\hat{\tau}(X) > c

For a budget constraint treating fraction BB:

  1. Estimate τ^(Xi)\hat{\tau}(X_i) for all units

  2. Find cc^* such that P(τ^(X)>c)=BP(\hat{\tau}(X) > c^*) = B

  3. Treat units with τ^(Xi)>c\hat{\tau}(X_i) > c^*

Welfare Analysis

The grf package provides formal welfare comparisons:

Policy Trees

For interpretable rules, policy trees learn optimal treatment assignment as a decision tree:

Advantages of policy trees:

  • Human-interpretable (can explain to stakeholders/regulators)

  • Transparent about which characteristics drive targeting

  • Can incorporate fairness constraints directly

Figure 20.2: Policy Tree for Treatment Assignment

Figure 20.2: A policy tree translates CATE estimates into actionable treatment rules. The tree identifies subgroups with different treatment effects: young workers with low education benefit most from job training (CATE = +0.8), while older workers with prior employment may be harmed (CATE = -0.1). Such trees are interpretable and can be explained to stakeholders.

When Policy Learning Makes Sense

Situation
Appropriate?

Treatment is costly, effects vary

Yes - targeting saves resources

Universal treatment is feasible/desirable

Maybe - equity may trump efficiency

Fairness constraints are important

Yes - can build in constraints

Need interpretable rules for implementation

Use policy trees

Just want to understand who benefits

CATE estimation may suffice

Cautions

Warning: Estimation Error Propagates

Policy learning uses estimated τ^(X)\hat{\tau}(X), which has error. High-variance CATE estimates lead to noisy treatment rules. Always:

  • Report confidence intervals on expected welfare

  • Compare to simpler policies (treat all, treat none)

  • Consider robustness to CATE estimation method

Ethical considerations:

  • Targeting based on race, gender, or other protected characteristics may be illegal or unethical even if "optimal"

  • Algorithmic assignment may lack transparency—stakeholders may object

  • Trade-off between efficiency (treat those who benefit most) and equity (treat those most in need)


20.7 The External Validity Problem

What Is External Validity?

Definition 20.3 (External Validity): A finding has external validity if the causal relationship identified in the study generalizes to other populations, settings, treatments, or outcomes.

Internal validity: Did we correctly identify the effect in this study? External validity: Does that effect apply elsewhere?

A study can have perfect internal validity (well-identified effect in sample) but poor external validity (effect doesn't generalize).

Sources of External Validity Failure

1. Population differences: The study sample differs from the target population.

  • RCTs often use convenience samples (college students, one clinic's patients)

  • Treatment effects may vary with characteristics uncommon in the study

2. Setting differences: Context matters.

  • A tutoring program works in well-resourced schools; would it work in under-resourced schools?

  • Labor market programs depend on labor market conditions

3. Treatment differences: The studied treatment differs from the deployed treatment.

  • Efficacy trials (ideal conditions) vs. effectiveness (real-world conditions)

  • Small-scale pilots vs. at-scale implementation

4. Time differences: Effects may change over time.

  • Technology evolves

  • Populations adapt

  • Equilibrium effects emerge at scale

The LATE Problem Revisited

Instrumental variables identify the Local Average Treatment Effect (LATE)—the effect for compliers. But compliers may differ from:

  • Always-takers: Who would take treatment regardless

  • Never-takers: Who would never take treatment

  • The policy target population: Who would be affected by a proposed policy

The Heckman critique emphasizes that different instruments identify different LATEs, and none may be policy-relevant.

Box: LATE Weights and Heterogeneity

When treatment effects are heterogeneous, IV estimates are weighted averages of individual treatment effects, with weights determined by how much each individual's treatment status responds to the instrument.

The weighting formula (Angrist & Imbens 1995, Mogstad & Torgovitsky 2018):

βIV=E[ωiτi]E[ωi]\beta_{IV} = \frac{E[\omega_i \cdot \tau_i]}{E[\omega_i]}

where:

  • τi\tau_i is individual ii's treatment effect

  • ωi\omega_i is ii's weight, proportional to how strongly the instrument shifts their treatment

Implications:

1. Different instruments, different estimates A quarter-of-birth instrument weights compliers who respond to compulsory schooling laws. A distance-to-college instrument weights compliers who respond to college proximity. These are different people, so estimates differ even if both instruments are valid.

2. Marginal vs. inframarginal IV weights those at the margin of treatment. If a job training program serves the most motivated (always-takers) and the least motivated are never-takers, IV identifies effects for the middle—those who would enroll if slots are available. This may or may not be policy-relevant.

3. Negative weights are possible With multiple treatments or complex designs, some individuals can receive negative weights—their treatment effects are subtracted from the weighted average. This can produce estimates outside the range of individual effects.

Connecting to CATE estimation: If you've estimated τ^(X)\hat{\tau}(X) via causal forests (Section 20.4), you can:

  1. Identify likely compliers based on characteristics

  2. Compare τ^(X)\hat{\tau}(X) for compliers vs. always/never-takers

  3. Assess whether IV-identified effects plausibly generalize

Practical guidance: When reporting IV estimates, characterize the complier population. Who are these marginal individuals? Are they representative of those a policy would target? If not, acknowledge the limitation.


20.8 Transportability

Formalizing Generalization

Transportability theory (Pearl and Bareinboim 2014) formalizes when and how findings generalize across populations.

Setting: We have data from a study population SS and want to estimate effects in a target population TT.

Key question: Under what conditions can we "transport" the causal effect from SS to TT?

Figure 20.3: Transportability from Source to Target Population

Figure 20.3: The transportability challenge. Findings from a source population (e.g., an RCT sample) may not apply to a target population (e.g., the policy-relevant population) if the populations differ on effect modifiers. The key questions are: what determines selection into each population, and do those factors modify the treatment effect?

When Can We Transport?

Case 1: Random sampling

If the study is a random sample from the target population, study findings apply directly (subject to sampling error).

Case 2: Selection on observables

If the study differs from the target on observed covariates XX, but XX captures all relevant effect modifiers: τT=τS(x)dFT(x)\tau^T = \int \tau^S(x) \, dF^T(x)

We reweight the study effects by the target population distribution of XX.

Case 3: Selection on unobservables

If the study differs on unobserved effect modifiers, transportability fails without additional assumptions.

Practical Approaches

1. Reweighting: Estimate τ(x)\tau(x) in the study, then reweight to the target:

τ^T=1NTiSτ^(xi)wi\hat{\tau}^T = \frac{1}{N^T} \sum_{i \in S} \hat{\tau}(x_i) \cdot w_i

where wiw_i adjusts for the difference between study and target distributions.

2. Modeling heterogeneity: Estimate CATE as a function of effect modifiers; apply to target population characteristics.

3. Sensitivity analysis: Assess how much unobserved effect modification would be needed to change conclusions.

Worked Example: Transporting a Job Training Effect

A job training RCT was conducted in urban centers. We want to predict the effect in rural areas.

Study population (urban):

  • 60% high school graduates, 40% college graduates

  • Average baseline earnings: $35,000

Target population (rural):

  • 80% high school graduates, 20% college graduates

  • Average baseline earnings: $28,000

Estimated CATEs from study (by education):

  • High school: τ^HS=$2,500\hat{\tau}_{HS} = \$2{,}500 (SE = $400)

  • College: τ^College=$1,200\hat{\tau}_{College} = \$1{,}200 (SE = $350)

Step 1: Check if education is an effect modifier

The effects differ by education (p < 0.05 for difference), so we should transport using stratified effects.

Step 2: Reweight to target population

τ^Rural=0.80×$2,500+0.20×$1,200=$2,240\hat{\tau}^{Rural} = 0.80 \times \$2,500 + 0.20 \times \$1,200 = \$2,240

Compare to naive study average: τ^Urban=0.60×$2,500+0.40×$1,200=$1,980\hat{\tau}^{Urban} = 0.60 \times \$2,500 + 0.40 \times \$1,200 = \$1,980

Step 3: Propagate uncertainty

SE(τ^Rural)=0.802×4002+0.202×3502=$329SE(\hat{\tau}^{Rural}) = \sqrt{0.80^2 \times 400^2 + 0.20^2 \times 350^2} = \$329

95% CI: [$1,595,$2,885][\$1{,}595, \$2{,}885]

Step 4: Sensitivity to unobserved modifiers

We assumed education captures all relevant differences. But rural workers might differ in:

  • Labor market conditions (fewer employers)

  • Job types available

  • Transportation constraints

If these factors reduce effectiveness by 20%, the transported effect would be $2,240×0.80=$1,792\$2,240 \times 0.80 = \$1,792—still positive but smaller.

Conclusion: The transported estimate ($2,240) is higher than the urban estimate because rural areas have more high school graduates, who benefit more from training. But this relies on education fully capturing heterogeneity—an assumption we should probe with sensitivity analysis.


20.9 Multi-Site Trials and Meta-Analysis

Learning from Multiple Studies

When multiple studies (sites, replications) address the same question, we can:

  1. Pool results for more precise average effects

  2. Examine cross-site heterogeneity

  3. Identify effect moderators

Multi-Site RCTs

Multi-site trials randomize treatment at multiple locations, then examine:

  • Average effect: Pooled across sites

  • Heterogeneity: Variation in site-specific effects

  • Moderators: Site characteristics that predict larger/smaller effects

Example: The STAR class size experiment randomized class sizes across Tennessee schools. Site-specific effects varied, with larger effects in schools serving disadvantaged students.

Meta-Analysis for Heterogeneity

Meta-analysis pools estimates across studies. Random-effects models allow for heterogeneity:

τ^jN(τj,sj2)\hat{\tau}_j \sim N(\tau_j, s_j^2) τjN(μ,τ2)\tau_j \sim N(\mu, \tau^2)

where:

  • τ^j\hat{\tau}_j is the estimate from study jj

  • τj\tau_j is the true effect in study jj

  • μ\mu is the overall average effect

  • τ2\tau^2 is between-study variance (heterogeneity)

I2I^2 statistic: Proportion of total variance due to between-study heterogeneity: I2=τ2τ2+sˉ2I^2 = \frac{\tau^2}{\tau^2 + \bar{s}^2}

High I2I^2 (>50%) indicates substantial heterogeneity.

Meta-Regression

If we observe study characteristics ZjZ_j, we can estimate:

τj=β0+β1Zj+uj\tau_j = \beta_0 + \beta_1 Z_j + u_j

This identifies study-level moderators: which study features predict larger effects?


20.10 Running Example: Returns to Education

Heterogeneity in Returns

The average return to education masks substantial variation:

By education level: Returns to completing high school may differ from returns to college.

By ability: High-ability students may have higher returns (complementarity) or lower returns (already would have succeeded).

By family background: Returns may be higher for disadvantaged students (marginal students).

By time period: Returns have increased over recent decades.

By country: Returns vary with labor market institutions and skill premiums.

Evidence

Causal forest approaches (Bertrand et al., various contexts):

  • Find that returns are higher for individuals from disadvantaged backgrounds

  • This suggests compulsory schooling (which affects marginal students) has above-average returns

LATE vs. ATE:

  • IV estimates (using compulsory schooling) identify effects for compliers—students at the margin

  • These may exceed OLS estimates if returns are higher for marginal students

Cross-country variation:

  • Psacharopoulos and Patrinos compile returns across countries

  • Returns are generally higher in developing countries

  • But heterogeneity within countries is substantial

Implications for Generalization

Findings from one context may not transport:

  • U.S. returns may not apply to European labor markets

  • Returns for college-educated may not apply to high school

  • Historical returns may not predict future returns as education expands

Understanding heterogeneity helps assess when findings generalize.


Practical Guidance

When to Investigate Heterogeneity

Situation
Recommended Approach

Policy requires targeting

Estimate CATE; identify high-benefit groups

Equity concerns

Examine effects by disadvantaged status

Scaling planned

Assess heterogeneity by early/late adopter characteristics

Generalization needed

Estimate moderators; model transportability

Exploratory analysis

Use ML methods; treat as hypothesis-generating

Common Pitfalls

Pitfall 1: Subgroup fishing Searching through subgroups until finding significance, then presenting as confirmatory.

How to avoid: Pre-register subgroup analyses; adjust for multiple comparisons; report all analyses.

Pitfall 2: Confusing "not significant in subgroup" with "no effect" Small subgroups have low power; failure to reject null doesn't mean no effect.

How to avoid: Test for interaction (difference between subgroups), not just within-subgroup significance.

Pitfall 3: Over-interpreting ML heterogeneity Causal forests find heterogeneity; some of it is noise.

How to avoid: Use honest estimation; conduct calibration tests; replicate in held-out data.

Pitfall 4: Ignoring external validity Precisely estimated effect in one context may not apply elsewhere.

How to avoid: Discuss what effect modifiers might differ; conduct sensitivity analysis; seek multi-site evidence.

Implementation Checklist


Integration Note

Connections to Other Methods

Method
Relationship
See Chapter

Mechanisms

Heterogeneity may reveal mechanisms

Ch. 19

Experiments

Multi-arm RCTs can test heterogeneity

Ch. 10

ML for Causal Inference

Causal forests, DML for CATE

Ch. 21

Evidence Synthesis

Meta-analysis pools heterogeneous estimates

Ch. 24

Triangulation Strategies

Evidence for heterogeneity is stronger when:

  1. Multiple methods agree: Subgroup analysis, causal forests, interaction tests

  2. Theoretical prediction: Heterogeneity aligns with theory

  3. Pre-specification: Analysis was planned before seeing results

  4. Replication: Heterogeneity pattern replicates in new data

  5. Mechanism evidence: Heterogeneity explained by plausible mechanisms


Summary

Key takeaways:

  1. Heterogeneity matters for targeting, equity, mechanism, and generalization. Average effects conceal important variation.

  2. Subgroup analysis is dangerous due to multiple testing and specification searching. Pre-specify, test for interactions, and adjust for multiplicity.

  3. Machine learning methods (causal forests, meta-learners) can discover heterogeneity, but require honest estimation and careful interpretation.

  4. External validity concerns whether findings generalize. Population, setting, treatment, and time differences all threaten generalization.

  5. Transportability formalizes when we can extrapolate. Reweighting and effect modifier modeling help, but sensitivity analysis is essential.

  6. Multi-site trials and meta-analysis provide direct evidence on heterogeneity across contexts.

Returning to the opening question: Knowing the average treatment effect leaves important questions unanswered. We need to know for whom effects are larger or smaller (heterogeneity) and whether findings apply elsewhere (generalization). Modern methods—from pre-specified subgroup analysis to causal forests to transportability theory—provide tools for these questions, but none eliminates the fundamental uncertainty about how effects vary and generalize.


Further Reading

Essential

  • Athey and Imbens (2017), "The State of Applied Econometrics: Causality and Policy Evaluation" - Heterogeneity section overview

  • Wager and Athey (2018), "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" - Causal forests

For Deeper Understanding

  • Heckman and Vytlacil (2005), "Structural Equations, Treatment Effects, and Econometric Policy Evaluation" - MTE framework

  • Pearl and Bareinboim (2014), "External Validity: From Do-Calculus to Transportability Across Populations" - Transportability theory

  • Künzel et al. (2019), "Metalearners for Estimating Heterogeneous Treatment Effects" - Meta-learner comparison

Advanced/Specialized

  • Athey, Tibshirani, and Wager (2019), "Generalized Random Forests" - GRF methodology

  • Stuart et al. (2011), "The Use of Propensity Scores to Assess the Generalizability of Results" - Transportability methods

  • Meager (2019), "Understanding the Average Impact of Microcredit Expansions" - Hierarchical models for heterogeneity

Applications

  • Chetty et al. (2014), "Where is the Land of Opportunity?" - Geographic heterogeneity in mobility

  • Burke et al. (2019), "Sustainable Climate Policy" - Heterogeneous climate impacts

  • Banerjee et al. (2019), "A Multi-Faceted Program Causes Lasting Progress" - Multi-site development RCT

Validating Heterogeneity Estimates

  • Imbens & Xu (2025), "Comparing Experimental and Nonexperimental Methods: What Lessons Have We Learned Four Decades after LaLonde (1986)?" JEP. Shows CATE recovery is harder than ATE recovery; observational CATT estimates diverge from experimental benchmarks even when ATT matches.

  • Chernozhukov et al. (2018), "Generic Machine Learning Inference on Heterogeneous Treatment Effects" - Best linear predictor and calibration tests for CATEs.


Exercises

Conceptual

  1. Explain the difference between testing for heterogeneity (interaction) and testing for effects within subgroups. Why does this distinction matter for inference?

  2. What is "honest estimation" in the context of causal forests? Why is it necessary? What happens without it?

  3. When does LATE (for compliers) fail to answer the policy-relevant question? Provide an example.

Applied

  1. Using data from an RCT with rich baseline covariates:

    • Conduct pre-specified subgroup analysis for 3-5 theory-motivated subgroups

    • Estimate a causal forest and identify important effect modifiers

    • Compare the two approaches: do they agree?

  2. You have an RCT result from urban schools. The policy will be implemented in rural schools. Describe how you would assess generalizability. What data would you need?

Discussion

  1. Some argue that heterogeneity analysis should be the primary focus of empirical work—average effects are uninformative for policy. Others argue that we rarely have power to detect heterogeneity and should focus on well-identified average effects. Where do you stand?

Last updated