Chapter 20: Heterogeneity and Generalization
Opening Question
If we know the average treatment effect, what more do we need to know—and can we know—about for whom it works and where else it would work?
Chapter Overview
The average treatment effect is an average. It tells us what happens on average when a population receives treatment. But averages conceal variation. A drug with no average effect might cure some patients and harm others. A policy that "works" on average might benefit the privileged while hurting the disadvantaged. Understanding heterogeneity—variation in treatment effects across individuals, groups, or contexts—is essential for targeting interventions and translating findings across settings.
This chapter addresses two related questions. First, for whom does the treatment work? This is the heterogeneity question, which we address through subgroup analysis and modern machine learning methods. Second, where else would it work? This is the generalization question, concerning external validity and transportability.
What you will learn:
The risks of subgroup analysis and how to do it responsibly
Machine learning methods for discovering heterogeneous treatment effects (causal forests)
The external validity problem and when findings generalize
Transportability methods for extrapolating to new populations
Multi-site trials and meta-analytic approaches to heterogeneity
Prerequisites: Chapter 9 (Causal Framework), Chapter 10 (Experimental Methods), Chapter 21 (ML for Causal Inference—may be read concurrently)
20.1 Why Heterogeneity Matters
Beyond Average Effects
Consider a job training program evaluated by RCT. The average treatment effect is a 5 percentage point increase in employment. This is policy-relevant, but incomplete:
Targeting: Should everyone receive training, or only those who benefit most? If effects are concentrated among high school dropouts, targeting them is more efficient.
Equity: Does the program reduce or exacerbate inequality? If it helps only the already-advantaged, equity concerns arise.
Mechanism: Why does it work? Heterogeneity by baseline characteristics may reveal mechanisms (Chapter 19).
Scaling: Will effects persist at scale? If early adopters differ from later participants, average effects may not generalize.
Types of Heterogeneity
Effect heterogeneity by observable characteristics: Effects vary by gender, age, education, baseline outcomes, etc.
Effect heterogeneity by context: Effects vary by location, time period, implementing organization, etc.
Essential heterogeneity: Effects vary by unobserved characteristics that also influence selection into treatment.
Distributional effects: Effects on different quantiles of the outcome distribution (e.g., minimum wage affects low-wage workers differently than high-wage).
20.2 The Conditional Average Treatment Effect
Definition
Definition 20.1 (Conditional Average Treatment Effect, CATE): The CATE is the average treatment effect for a subgroup defined by covariates X: τ(x)=E[Y(1)−Y(0)∣X=x]
The ATE is the average of CATE over the population distribution of X: ATE=E[τ(X)]=∫τ(x)dF(x)

Figure 20.1: The same ATE can arise from very different distributions of individual treatment effects. With homogeneous effects (left), the ATE is representative of most individuals. With heterogeneous effects (right), the ATE masks important variation—some individuals may be harmed while others greatly benefit.
Identification
If we have identified the ATE (via RCT, DiD, IV, etc.), we can often identify CATE by:
Stratification: Estimate effects separately within subgroups
Interaction models: Include treatment × covariate interactions in regression
Machine learning: Use flexible methods to estimate τ(x) as a function
Example: In an RCT, CATE is identified by: τ(x)=E[Y∣D=1,X=x]−E[Y∣D=0,X=x]
The same selection-on-observables, DiD, or IV assumptions that identify ATE also identify CATE if imposed conditional on X.
20.3 Subgroup Analysis: Risks and Best Practices
The Dangers of Subgroup Analysis
Subgroup analysis is notoriously unreliable. Classic problems:
1. Multiple testing
With 20 subgroups, we expect one false positive at α=0.05 even if no true heterogeneity exists. Researchers often test many subgroups and report only significant findings.
2. Specification searching
The definition of subgroups (age cutoffs, category groupings) can be manipulated to achieve significance.
3. Low power
Subgroups are smaller than the full sample. Effects that are significant overall may be insignificant within subgroups, leading to false conclusions of "no effect" for some groups.
4. Regression to the mean
Subgroups selected for having extreme outcomes in one period tend to have less extreme outcomes in another.
Sobering Evidence
The ISIS-2 trial tested aspirin for heart attack patients. In pre-specified subgroups, effects were consistent. But when analyzed by astrological sign, Geminis and Libras showed no benefit—a "finding" reflecting noise, not biology.
This illustrates: with enough subgroups, some will show spurious heterogeneity.
Best Practices
Principle 20.1 (Subgroup Analysis Guidelines):
Pre-specify subgroups before seeing outcomes
Limit the number of subgroups examined
Test for interaction, not just significance within subgroups
Adjust for multiple comparisons
Report all pre-specified subgroup analyses, not just significant ones
Treat exploratory subgroup analysis as hypothesis-generating, not confirmatory
Pre-specification: Register subgroup analyses in a pre-analysis plan. This prevents cherry-picking.
Interaction tests: Don't just report "the effect is significant for women but not for men." Test whether the effects differ significantly:
Yi=β0+β1Di+β2Femalei+β3(Di×Femalei)+εi
β3 tests heterogeneity; β1+β3 gives the female-specific effect.
Multiple comparison adjustments: Bonferroni, Benjamini-Hochberg, or other corrections for multiple testing.
20.4 Machine Learning for Heterogeneous Treatment Effects
The Promise of ML
Machine learning offers an alternative to pre-specified subgroup analysis: let the algorithm find heterogeneity.
Advantages:
Discovers interactions and nonlinearities humans wouldn't specify
Handles high-dimensional covariates
Provides principled variable importance measures
Challenges:
May overfit, finding spurious heterogeneity
"Black box" methods are hard to interpret
Standard ML optimizes prediction, not causal inference
Causal Forests (Generalized Random Forests)
Athey and Wager (2018) and Wager and Athey (2018) develop causal forests—random forests adapted for treatment effect estimation.
Key idea: Standard random forests predict E[Y∣X]. Causal forests estimate τ(x)=E[Y(1)−Y(0)∣X=x] by:
Growing trees that partition the covariate space
At each leaf, estimating the treatment effect using observations in that leaf
Aggregating across trees (forest)
Splitting criterion: Instead of minimizing prediction error, splits maximize heterogeneity in treatment effects between child nodes.
Honest Estimation
A key innovation is honesty: the same data should not be used to both determine splits and estimate effects within leaves.
Definition 20.2 (Honest Estimation): An estimation procedure is honest if it uses different data for determining the model structure (splits) and estimating parameters (leaf effects).
Implementation: Split the sample. Use one portion to grow trees (determine structure). Use another portion to estimate treatment effects within leaves.
Why does this matter? Without honesty, trees overfit—they find splits that look like heterogeneity in sample but don't generalize.
The GRF Package
The grf package (R) implements generalized random forests:
What you get:
CATE estimates τ^(xi) for each observation
Confidence intervals (via jackknife)
Variable importance scores
Interpreting Results
Causal forests produce individual-level effect estimates. How to summarize?
Best Linear Predictor (BLP): Regress τ^(Xi) on Xi to find which variables drive heterogeneity.
Calibration test: Check whether τ^(X) actually predicts variation in effects: Yi=α+β⋅τ^(Xi)⋅Di+γDi+δXi+εi
If β=1, CATE estimates are well-calibrated.
Targeting: Use τ^(X) to identify high-benefit groups for treatment targeting.
Box: Heterogeneous Effects Are Harder to Recover Than Average Effects
A sobering finding from the LaLonde literature: even when modern methods successfully recover average treatment effects, they may fail to recover heterogeneous effects.
Imbens & Xu (2025) demonstrate this vividly. Using LaLonde data, they estimate both the ATT (average treatment effect on the treated) and CATTs (conditional average treatment effects on the treated)—effects for subgroups defined by baseline characteristics.
For ATT: Several modern methods (matching, IPW, AIPW, causal forests) produce estimates close to the experimental benchmark of roughly $1,800, especially after trimming for overlap.
For CATTs: The picture is starkly different. Scatter plots of experimental vs. nonexperimental CATT estimates show weak correlation. Subgroups where the experimental estimate is positive often have negative nonexperimental estimates, and vice versa.
Why the divergence? Average effects allow errors to cancel: overestimates for some subgroups offset underestimates for others. Conditional effects require getting each subgroup right separately—a much harder task. Small amounts of unobserved confounding that wash out in the average can create large errors in subgroup-specific estimates.
Implications for practice:
Successful ATT estimation doesn't validate CATE estimation
Be cautious about targeting interventions based on estimated CATEs from observational data
Validation (placebo tests, held-out data, experimental benchmarks) becomes even more important for heterogeneous effects
When possible, estimate heterogeneity within experimental data, not by extending observational methods
20.5 Other ML Approaches
Meta-Learners
Meta-learners use any ML algorithm as a base learner and construct treatment effect estimates.
T-learner:
Fit μ^1(x)=E[Y∣X=x,D=1] on treated observations
Fit μ^0(x)=E[Y∣X=x,D=0] on control observations
τ^(x)=μ^1(x)−μ^0(x)
S-learner:
Fit μ^(x,d)=E[Y∣X=x,D=d] including treatment as a covariate
τ^(x)=μ^(x,1)−μ^(x,0)
X-learner (Künzel et al. 2019):
Fit T-learner models
Impute treatment effects for each observation
Fit models for imputed effects
Weight by propensity score
Double Machine Learning
Double ML (Chapter 21) can estimate heterogeneous effects by:
Using ML to estimate nuisance functions E[Y∣X] and E[D∣X]
Estimating CATE in a second stage using the orthogonalized residuals
Bayesian Approaches
BART (Bayesian Additive Regression Trees) provides:
Posterior distributions for treatment effects
Natural uncertainty quantification
Handles heterogeneity through tree ensemble
20.6 Policy Learning: From CATE to Optimal Treatment
Estimating treatment effect heterogeneity is valuable, but the ultimate question is: Who should we treat? Policy learning translates CATE estimates into actionable treatment assignment rules.
The Policy Learning Problem
Given estimated τ^(x), we want to find a treatment rule π(x)∈{0,1} that maximizes expected welfare:
π∗=argmaxπE[π(X)⋅τ(X)]
Subject to constraints:
Budget: E[π(X)]≤B (can only treat fraction B)
Fairness: Treatment cannot depend on protected characteristics
Interpretability: Rule must be explainable
Simple Policy Rules
Threshold rules: Treat if τ^(X)>c
For a budget constraint treating fraction B:
Estimate τ^(Xi) for all units
Find c∗ such that P(τ^(X)>c∗)=B
Treat units with τ^(Xi)>c∗
Welfare Analysis
The grf package provides formal welfare comparisons:
Policy Trees
For interpretable rules, policy trees learn optimal treatment assignment as a decision tree:
Advantages of policy trees:
Human-interpretable (can explain to stakeholders/regulators)
Transparent about which characteristics drive targeting
Can incorporate fairness constraints directly

Figure 20.2: A policy tree translates CATE estimates into actionable treatment rules. The tree identifies subgroups with different treatment effects: young workers with low education benefit most from job training (CATE = +0.8), while older workers with prior employment may be harmed (CATE = -0.1). Such trees are interpretable and can be explained to stakeholders.
When Policy Learning Makes Sense
Treatment is costly, effects vary
Yes - targeting saves resources
Universal treatment is feasible/desirable
Maybe - equity may trump efficiency
Fairness constraints are important
Yes - can build in constraints
Need interpretable rules for implementation
Use policy trees
Just want to understand who benefits
CATE estimation may suffice
Cautions
Warning: Estimation Error Propagates
Policy learning uses estimated τ^(X), which has error. High-variance CATE estimates lead to noisy treatment rules. Always:
Report confidence intervals on expected welfare
Compare to simpler policies (treat all, treat none)
Consider robustness to CATE estimation method
Ethical considerations:
Targeting based on race, gender, or other protected characteristics may be illegal or unethical even if "optimal"
Algorithmic assignment may lack transparency—stakeholders may object
Trade-off between efficiency (treat those who benefit most) and equity (treat those most in need)
20.7 The External Validity Problem
What Is External Validity?
Definition 20.3 (External Validity): A finding has external validity if the causal relationship identified in the study generalizes to other populations, settings, treatments, or outcomes.
Internal validity: Did we correctly identify the effect in this study? External validity: Does that effect apply elsewhere?
A study can have perfect internal validity (well-identified effect in sample) but poor external validity (effect doesn't generalize).
Sources of External Validity Failure
1. Population differences: The study sample differs from the target population.
RCTs often use convenience samples (college students, one clinic's patients)
Treatment effects may vary with characteristics uncommon in the study
2. Setting differences: Context matters.
A tutoring program works in well-resourced schools; would it work in under-resourced schools?
Labor market programs depend on labor market conditions
3. Treatment differences: The studied treatment differs from the deployed treatment.
Efficacy trials (ideal conditions) vs. effectiveness (real-world conditions)
Small-scale pilots vs. at-scale implementation
4. Time differences: Effects may change over time.
Technology evolves
Populations adapt
Equilibrium effects emerge at scale
The LATE Problem Revisited
Instrumental variables identify the Local Average Treatment Effect (LATE)—the effect for compliers. But compliers may differ from:
Always-takers: Who would take treatment regardless
Never-takers: Who would never take treatment
The policy target population: Who would be affected by a proposed policy
The Heckman critique emphasizes that different instruments identify different LATEs, and none may be policy-relevant.
Box: LATE Weights and Heterogeneity
When treatment effects are heterogeneous, IV estimates are weighted averages of individual treatment effects, with weights determined by how much each individual's treatment status responds to the instrument.
The weighting formula (Angrist & Imbens 1995, Mogstad & Torgovitsky 2018):
βIV=E[ωi]E[ωi⋅τi]
where:
τi is individual i's treatment effect
ωi is i's weight, proportional to how strongly the instrument shifts their treatment
Implications:
1. Different instruments, different estimates A quarter-of-birth instrument weights compliers who respond to compulsory schooling laws. A distance-to-college instrument weights compliers who respond to college proximity. These are different people, so estimates differ even if both instruments are valid.
2. Marginal vs. inframarginal IV weights those at the margin of treatment. If a job training program serves the most motivated (always-takers) and the least motivated are never-takers, IV identifies effects for the middle—those who would enroll if slots are available. This may or may not be policy-relevant.
3. Negative weights are possible With multiple treatments or complex designs, some individuals can receive negative weights—their treatment effects are subtracted from the weighted average. This can produce estimates outside the range of individual effects.
Connecting to CATE estimation: If you've estimated τ^(X) via causal forests (Section 20.4), you can:
Identify likely compliers based on characteristics
Compare τ^(X) for compliers vs. always/never-takers
Assess whether IV-identified effects plausibly generalize
Practical guidance: When reporting IV estimates, characterize the complier population. Who are these marginal individuals? Are they representative of those a policy would target? If not, acknowledge the limitation.
20.8 Transportability
Formalizing Generalization
Transportability theory (Pearl and Bareinboim 2014) formalizes when and how findings generalize across populations.
Setting: We have data from a study population S and want to estimate effects in a target population T.
Key question: Under what conditions can we "transport" the causal effect from S to T?

Figure 20.3: The transportability challenge. Findings from a source population (e.g., an RCT sample) may not apply to a target population (e.g., the policy-relevant population) if the populations differ on effect modifiers. The key questions are: what determines selection into each population, and do those factors modify the treatment effect?
When Can We Transport?
Case 1: Random sampling
If the study is a random sample from the target population, study findings apply directly (subject to sampling error).
Case 2: Selection on observables
If the study differs from the target on observed covariates X, but X captures all relevant effect modifiers: τT=∫τS(x)dFT(x)
We reweight the study effects by the target population distribution of X.
Case 3: Selection on unobservables
If the study differs on unobserved effect modifiers, transportability fails without additional assumptions.
Practical Approaches
1. Reweighting: Estimate τ(x) in the study, then reweight to the target:
τ^T=NT1∑i∈Sτ^(xi)⋅wi
where wi adjusts for the difference between study and target distributions.
2. Modeling heterogeneity: Estimate CATE as a function of effect modifiers; apply to target population characteristics.
3. Sensitivity analysis: Assess how much unobserved effect modification would be needed to change conclusions.
Worked Example: Transporting a Job Training Effect
A job training RCT was conducted in urban centers. We want to predict the effect in rural areas.
Study population (urban):
60% high school graduates, 40% college graduates
Average baseline earnings: $35,000
Target population (rural):
80% high school graduates, 20% college graduates
Average baseline earnings: $28,000
Estimated CATEs from study (by education):
High school: τ^HS=$2,500 (SE = $400)
College: τ^College=$1,200 (SE = $350)
Step 1: Check if education is an effect modifier
The effects differ by education (p < 0.05 for difference), so we should transport using stratified effects.
Step 2: Reweight to target population
τ^Rural=0.80×$2,500+0.20×$1,200=$2,240
Compare to naive study average: τ^Urban=0.60×$2,500+0.40×$1,200=$1,980
Step 3: Propagate uncertainty
SE(τ^Rural)=0.802×4002+0.202×3502=$329
95% CI: [$1,595,$2,885]
Step 4: Sensitivity to unobserved modifiers
We assumed education captures all relevant differences. But rural workers might differ in:
Labor market conditions (fewer employers)
Job types available
Transportation constraints
If these factors reduce effectiveness by 20%, the transported effect would be $2,240×0.80=$1,792—still positive but smaller.
Conclusion: The transported estimate ($2,240) is higher than the urban estimate because rural areas have more high school graduates, who benefit more from training. But this relies on education fully capturing heterogeneity—an assumption we should probe with sensitivity analysis.
20.9 Multi-Site Trials and Meta-Analysis
Learning from Multiple Studies
When multiple studies (sites, replications) address the same question, we can:
Pool results for more precise average effects
Examine cross-site heterogeneity
Identify effect moderators
Multi-Site RCTs
Multi-site trials randomize treatment at multiple locations, then examine:
Average effect: Pooled across sites
Heterogeneity: Variation in site-specific effects
Moderators: Site characteristics that predict larger/smaller effects
Example: The STAR class size experiment randomized class sizes across Tennessee schools. Site-specific effects varied, with larger effects in schools serving disadvantaged students.
Meta-Analysis for Heterogeneity
Meta-analysis pools estimates across studies. Random-effects models allow for heterogeneity:
τ^j∼N(τj,sj2) τj∼N(μ,τ2)
where:
τ^j is the estimate from study j
τj is the true effect in study j
μ is the overall average effect
τ2 is between-study variance (heterogeneity)
I2 statistic: Proportion of total variance due to between-study heterogeneity: I2=τ2+sˉ2τ2
High I2 (>50%) indicates substantial heterogeneity.
Meta-Regression
If we observe study characteristics Zj, we can estimate:
τj=β0+β1Zj+uj
This identifies study-level moderators: which study features predict larger effects?
20.10 Running Example: Returns to Education
Heterogeneity in Returns
The average return to education masks substantial variation:
By education level: Returns to completing high school may differ from returns to college.
By ability: High-ability students may have higher returns (complementarity) or lower returns (already would have succeeded).
By family background: Returns may be higher for disadvantaged students (marginal students).
By time period: Returns have increased over recent decades.
By country: Returns vary with labor market institutions and skill premiums.
Evidence
Causal forest approaches (Bertrand et al., various contexts):
Find that returns are higher for individuals from disadvantaged backgrounds
This suggests compulsory schooling (which affects marginal students) has above-average returns
LATE vs. ATE:
IV estimates (using compulsory schooling) identify effects for compliers—students at the margin
These may exceed OLS estimates if returns are higher for marginal students
Cross-country variation:
Psacharopoulos and Patrinos compile returns across countries
Returns are generally higher in developing countries
But heterogeneity within countries is substantial
Implications for Generalization
Findings from one context may not transport:
U.S. returns may not apply to European labor markets
Returns for college-educated may not apply to high school
Historical returns may not predict future returns as education expands
Understanding heterogeneity helps assess when findings generalize.
Practical Guidance
When to Investigate Heterogeneity
Policy requires targeting
Estimate CATE; identify high-benefit groups
Equity concerns
Examine effects by disadvantaged status
Scaling planned
Assess heterogeneity by early/late adopter characteristics
Generalization needed
Estimate moderators; model transportability
Exploratory analysis
Use ML methods; treat as hypothesis-generating
Common Pitfalls
Pitfall 1: Subgroup fishing Searching through subgroups until finding significance, then presenting as confirmatory.
How to avoid: Pre-register subgroup analyses; adjust for multiple comparisons; report all analyses.
Pitfall 2: Confusing "not significant in subgroup" with "no effect" Small subgroups have low power; failure to reject null doesn't mean no effect.
How to avoid: Test for interaction (difference between subgroups), not just within-subgroup significance.
Pitfall 3: Over-interpreting ML heterogeneity Causal forests find heterogeneity; some of it is noise.
How to avoid: Use honest estimation; conduct calibration tests; replicate in held-out data.
Pitfall 4: Ignoring external validity Precisely estimated effect in one context may not apply elsewhere.
How to avoid: Discuss what effect modifiers might differ; conduct sensitivity analysis; seek multi-site evidence.
Implementation Checklist
Integration Note
Connections to Other Methods
Mechanisms
Heterogeneity may reveal mechanisms
Ch. 19
Experiments
Multi-arm RCTs can test heterogeneity
Ch. 10
ML for Causal Inference
Causal forests, DML for CATE
Ch. 21
Evidence Synthesis
Meta-analysis pools heterogeneous estimates
Ch. 24
Triangulation Strategies
Evidence for heterogeneity is stronger when:
Multiple methods agree: Subgroup analysis, causal forests, interaction tests
Theoretical prediction: Heterogeneity aligns with theory
Pre-specification: Analysis was planned before seeing results
Replication: Heterogeneity pattern replicates in new data
Mechanism evidence: Heterogeneity explained by plausible mechanisms
Summary
Key takeaways:
Heterogeneity matters for targeting, equity, mechanism, and generalization. Average effects conceal important variation.
Subgroup analysis is dangerous due to multiple testing and specification searching. Pre-specify, test for interactions, and adjust for multiplicity.
Machine learning methods (causal forests, meta-learners) can discover heterogeneity, but require honest estimation and careful interpretation.
External validity concerns whether findings generalize. Population, setting, treatment, and time differences all threaten generalization.
Transportability formalizes when we can extrapolate. Reweighting and effect modifier modeling help, but sensitivity analysis is essential.
Multi-site trials and meta-analysis provide direct evidence on heterogeneity across contexts.
Returning to the opening question: Knowing the average treatment effect leaves important questions unanswered. We need to know for whom effects are larger or smaller (heterogeneity) and whether findings apply elsewhere (generalization). Modern methods—from pre-specified subgroup analysis to causal forests to transportability theory—provide tools for these questions, but none eliminates the fundamental uncertainty about how effects vary and generalize.
Further Reading
Essential
Athey and Imbens (2017), "The State of Applied Econometrics: Causality and Policy Evaluation" - Heterogeneity section overview
Wager and Athey (2018), "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" - Causal forests
For Deeper Understanding
Heckman and Vytlacil (2005), "Structural Equations, Treatment Effects, and Econometric Policy Evaluation" - MTE framework
Pearl and Bareinboim (2014), "External Validity: From Do-Calculus to Transportability Across Populations" - Transportability theory
Künzel et al. (2019), "Metalearners for Estimating Heterogeneous Treatment Effects" - Meta-learner comparison
Advanced/Specialized
Athey, Tibshirani, and Wager (2019), "Generalized Random Forests" - GRF methodology
Stuart et al. (2011), "The Use of Propensity Scores to Assess the Generalizability of Results" - Transportability methods
Meager (2019), "Understanding the Average Impact of Microcredit Expansions" - Hierarchical models for heterogeneity
Applications
Chetty et al. (2014), "Where is the Land of Opportunity?" - Geographic heterogeneity in mobility
Burke et al. (2019), "Sustainable Climate Policy" - Heterogeneous climate impacts
Banerjee et al. (2019), "A Multi-Faceted Program Causes Lasting Progress" - Multi-site development RCT
Validating Heterogeneity Estimates
Imbens & Xu (2025), "Comparing Experimental and Nonexperimental Methods: What Lessons Have We Learned Four Decades after LaLonde (1986)?" JEP. Shows CATE recovery is harder than ATE recovery; observational CATT estimates diverge from experimental benchmarks even when ATT matches.
Chernozhukov et al. (2018), "Generic Machine Learning Inference on Heterogeneous Treatment Effects" - Best linear predictor and calibration tests for CATEs.
Exercises
Conceptual
Explain the difference between testing for heterogeneity (interaction) and testing for effects within subgroups. Why does this distinction matter for inference?
What is "honest estimation" in the context of causal forests? Why is it necessary? What happens without it?
When does LATE (for compliers) fail to answer the policy-relevant question? Provide an example.
Applied
Using data from an RCT with rich baseline covariates:
Conduct pre-specified subgroup analysis for 3-5 theory-motivated subgroups
Estimate a causal forest and identify important effect modifiers
Compare the two approaches: do they agree?
You have an RCT result from urban schools. The policy will be implemented in rural schools. Describe how you would assess generalizability. What data would you need?
Discussion
Some argue that heterogeneity analysis should be the primary focus of empirical work—average effects are uninformative for policy. Others argue that we rarely have power to detect heterogeneity and should focus on well-identified average effects. Where do you stand?
Last updated