Chapter 21: Machine Learning for Causal Inference
Opening Question
Machine learning excels at prediction—can it also help us answer causal questions, and if so, how?
Chapter Overview
Machine learning has transformed prediction. From image recognition to language translation to demand forecasting, ML methods achieve unprecedented accuracy by learning complex patterns from data. But prediction and causation are fundamentally different. Predicting who will default on a loan is not the same as knowing whether denying them credit would change their behavior. Predicting which patients will die is not the same as knowing whether a treatment would save them.
This chapter explores the intersection of machine learning and causal inference. The key insight is that ML methods can help with causal inference—but only in specific ways and under specific conditions. We'll see how ML can estimate nuisance functions (propensity scores, outcome predictions) that are then used for causal estimation, how causal forests discover heterogeneous treatment effects, and when prediction itself solves the policy problem without requiring causal knowledge.
What you will learn:
Why prediction and causation require different approaches
ML fundamentals relevant to causal inference (regularization, cross-validation, ensembles)
Double/debiased machine learning for treatment effect estimation
Causal forests for heterogeneous effects (extending Chapter 20)
Targeted learning (TMLE) and its advantages
When prediction problems substitute for causal questions
Prerequisites: Chapter 11 (Selection on Observables), Chapter 20 (Heterogeneity), basic familiarity with regression and machine learning concepts
21.1 Prediction vs. Causation
The Fundamental Difference
Prediction asks: Given what I observe about this unit, what outcome do I expect? E[Y∣X=x]
Causation asks: If I intervene to change treatment, what happens to outcomes? E[Y∣do(D=1)]−E[Y∣do(D=0)]
These are different questions with different answers.
Example: Hospital ICU admission
Prediction: Patients in the ICU have higher mortality. Seeing someone in the ICU predicts death.
Causation: ICU admission probably reduces mortality—that's why we admit people. The ICU is correlated with death because sick people go there, not because it kills them.
A model that predicts "ICU → death" is correct for prediction but wrong for the policy question "should we expand ICU capacity?"
Why Standard ML Fails for Causation
Standard ML minimizes prediction error: minfE[(Y−f(X,D))2]
This learns associations, including those driven by confounding. If D is correlated with Y through confounders, ML learns that correlation—useful for prediction, misleading for causation.
Regularization makes it worse: Regularization (LASSO, Ridge) shrinks coefficients toward zero based on predictive value, not causal importance. A treatment effect that's real but noisy gets shrunk; a confounder that's predictive but non-causal gets kept.
Where ML Can Help
ML excels at learning complex functions from data. In causal inference, we often need to estimate:
Propensity scores: P(D=1∣X)—the probability of treatment given covariates
Outcome models: E[Y∣X,D]—expected outcomes conditional on covariates and treatment
Heterogeneous effects: τ(x)=E[Y(1)−Y(0)∣X=x]—how effects vary with covariates
These are prediction problems! ML can estimate them well, and the estimates feed into causal inference methods.
21.2 ML Essentials for Causal Inference
Regularization
High-dimensional covariates create overfitting risk. Regularization penalizes model complexity.
LASSO (L1 penalty): minβ∑i(Yi−Xi′β)2+λ∑j∣βj∣
Shrinks coefficients; sets some to exactly zero
Performs variable selection
Useful when many covariates, few are important
Ridge (L2 penalty): minβ∑i(Yi−Xi′β)2+λ∑jβj2
Shrinks all coefficients toward zero
Keeps all variables; none exactly zero
Handles multicollinearity well
Elastic Net (combined): minβ∑i(Yi−Xi′β)2+λ1∑j∣βj∣+λ2∑jβj2
Figure 21.1: Regularization shrinks treatment effects toward zero. As the penalty parameter λ increases (moving right), both LASSO and Ridge shrink the treatment coefficient away from its true value. The shaded region shows the bias introduced by regularization. This is why naive application of penalized regression to causal inference produces biased estimates.
Cross-Validation
How do we choose tuning parameters (λ, tree depth, etc.)?
K-fold cross-validation:
Split data into K folds
For each fold k: train on other K-1 folds, evaluate on fold k
Average performance across folds
Choose parameters that minimize average error
This prevents overfitting to the training data.
Tree-Based Methods
Decision trees: Recursively partition covariate space; predict constant within each partition (leaf).
Random forests: Average predictions across many trees, each trained on bootstrap samples with random feature subsets. Reduces variance, handles interactions and nonlinearities.
Gradient boosting (XGBoost, LightGBM): Sequentially fit trees to residuals, building up an ensemble. Often achieves best predictive performance.
Why These Methods Matter for Causal Inference
Causal inference methods often require estimating nuisance functions:
Propensity scores in IPW
Outcome models in AIPW
Both in double ML
ML methods estimate these flexibly without imposing restrictive functional forms. This reduces bias from model misspecification.
21.3 Double/Debiased Machine Learning
The Problem with Naive ML
Suppose we want to estimate the effect of D on Y controlling for X. Naive approach:
Use ML to predict Y from (X,D)
Read off the coefficient on D
This fails because:
Regularization biases the treatment coefficient
Variable selection may drop D if it's noisy
ML optimizes prediction, not causal identification
The Orthogonalization Insight
Chernozhukov et al. (2018) develop Double/Debiased Machine Learning (DML). The key insight: make the treatment effect estimator orthogonal to errors in nuisance function estimation.
Frisch-Waugh-Lovell intuition: In linear regression, the coefficient on D equals:
Regress D on X, get residuals D~
Regress Y on X, get residuals Y~
Regress Y~ on D~
The coefficient on D~ is the treatment effect, purged of X's influence.
DML extends this to ML:
Use ML to predict D from X: m^(X)=E[D∣X]
Use ML to predict Y from X: g^(X)=E[Y∣X]
Compute residuals: D~=D−m^(X), Y~=Y−g^(X)
Estimate treatment effect from residuals
Sample Splitting
A key innovation: use different samples to estimate nuisance functions and treatment effects.
Why? If we use the same data for both:
Overfitting in m^(X) or g^(X) biases the treatment effect
Even with orthogonalization, bias persists
Cross-fitting procedure:
Split sample into K folds
For each fold k:
Train m^(X) and g^(X) on other folds
Compute residuals on fold k
Estimate treatment effect using all residuals
This eliminates overfitting bias while using all data efficiently.
The DML Estimator
Definition 21.1 (Double Machine Learning Estimator): Under unconfoundedness given X, the DML estimator of ATE is: τ^DML=n1∑i[g^1(Xi)−g^0(Xi)+m^(Xi)Di(Yi−g^1(Xi))−1−m^(Xi)(1−Di)(Yi−g^0(Xi))]
where g^d(X)=E[Y∣X,D=d] and m^(X)=P(D=1∣X) are estimated by ML with cross-fitting.
Intuition: This is the augmented inverse probability weighted (AIPW) estimator from Chapter 11, with nuisance functions estimated by ML.
Properties
Double robustness: Consistent if either m^(X) or g^(X) is consistent (but not necessarily both).
Root-n consistency: Under regularity conditions, τ^DML is n-consistent and asymptotically normal, even though nuisance functions converge at slower (ML) rates.
Valid inference: Standard errors and confidence intervals are valid despite ML estimation of nuisances.
When to Use DML
Good candidates:
High-dimensional covariates (many potential confounders)
Complex relationships between covariates and outcomes
Selection on observables is credible
Standard parametric models seem too restrictive
Poor candidates:
Few covariates (standard regression suffices)
Unconfoundedness is not credible (need IV, DiD, etc.)
Sample size is small (ML needs data)
21.4 Causal Forests
From Chapter 20 to Here
Chapter 20 introduced causal forests for heterogeneous treatment effect estimation. Here we develop the methodology more fully.
The Causal Forest Algorithm
Goal: Estimate τ(x)=E[Y(1)−Y(0)∣X=x] for any x.
Approach: Adapt random forests to target treatment effects rather than outcomes.
Key modifications:
Splitting criterion: Instead of minimizing outcome variance, maximize heterogeneity in treatment effects between child nodes.
Honest estimation: Use separate samples for determining splits and estimating leaf effects.
Local centering: Remove E[Y∣X] and E[D∣X] before estimation (orthogonalization).
The Generalized Random Forest Framework
Athey, Tibshirani, and Wager (2019) develop Generalized Random Forests (GRF), which nest causal forests as a special case.
General setup: Estimate a parameter θ(x) defined by a local moment condition: E[ψ(Y,D,θ(x))∣X=x]=0
For treatment effects: ψ(Y,D,θ)=(Y−Dθ−μ0)(D−p) where μ0=E[Y(0)∣X] and p=P(D=1∣X).
Forest weighting: The forest produces weights αi(x) indicating how much observation i contributes to estimating θ(x): θ^(x)=argminθ∑iαi(x)⋅ρ(ψ(Yi,Di,θ))
Observations in the same leaf as x get high weight; distant observations get low weight.
Inference
Asymptotic normality: Under regularity conditions: σ^(x)τ^(x)−τ(x)dN(0,1)
Variance estimation: GRF provides variance estimates via infinitesimal jackknife, enabling confidence intervals.
Honest inference: Sample splitting ensures valid inference even with adaptive splitting.
Practical Usage
21.5 Targeted Learning (TMLE)
Motivation
Double ML and causal forests are regression-based. Targeted Maximum Likelihood Estimation (TMLE), developed by van der Laan and colleagues, takes a different approach: it "targets" the initial ML estimates toward the specific causal parameter of interest.
The Core Intuition: Why TMLE Works
ML algorithms optimize for prediction accuracy—minimizing squared error across all observations. But we care about a specific causal parameter (the ATE). These goals don't align perfectly.
The problem: A model that predicts outcomes well overall might be biased for the specific comparison we need (treated vs. control outcomes at the same covariate values).
TMLE's solution: Start with an ML prediction, then adjust it specifically to reduce bias for the causal estimand. The adjustment uses the propensity score to identify where the initial model is most likely wrong for causal purposes—specifically, at covariate values where treatment is rare.
Analogy: Imagine you're estimating average height in a population, but your sample over-represents tall people. TMLE doesn't just reweight—it adjusts the underlying height predictions themselves, pushing them in the direction that corrects for the sampling imbalance. The "clever covariate" tells you which direction to push and by how much.
Why "targeted"? The adjustment is targeted at reducing bias for your specific estimand (e.g., ATE), not improving general prediction accuracy. Different estimands get different adjustments.
The TMLE Procedure (Simplified)
Step 1: Initial estimates
Estimate outcome model: Q^(D,X)=E[Y∣D,X]
Estimate propensity score: g^(X)=P(D=1∣X)
Use any ML method (ensemble, neural network, etc.).
Step 2: Clever covariate Define the "clever covariate": H(D,X)=g^(X)D−1−g^(X)1−D
Step 3: Targeting step Update the initial estimate by regressing: Y=Q^(D,X)+ϵ⋅H(D,X)+residual
The coefficient ϵ^ "targets" the estimate toward the causal parameter.
Step 4: Updated predictions Q^∗(D,X)=Q^(D,X)+ϵ^⋅H(D,X)
Step 5: Parameter estimate τ^TMLE=n1∑i[Q^∗(1,Xi)−Q^∗(0,Xi)]
Properties
Double robustness: Like AIPW, consistent if either outcome model or propensity score is consistent.
Efficiency: Achieves the semiparametric efficiency bound under correct specification.
Bounded estimates: Unlike IPW, TMLE produces bounded estimates even with extreme propensity scores.
Inference: Influence function-based standard errors provide valid confidence intervals.
Super Learner
TMLE is often combined with Super Learner—an ensemble method that combines multiple ML algorithms:
Specify a library of learners (LASSO, random forest, neural net, etc.)
Use cross-validation to find optimal weights for combining them
Final prediction is weighted average of learners
This provides robust nuisance function estimation without choosing a single method.
TMLE vs. DML
Approach
Orthogonalization
Targeting
Software
DoubleML (R/Python)
tmle, tmle3 (R)
Flexibility
Any ML
Super Learner typical
Inference
Asymptotic
Influence function
Bounded
No (can extrapolate)
Yes
In practice, both work well; choice often depends on software familiarity.
21.6 Prediction Policy Problems
When Prediction Solves the Problem
Not all policy questions require causal inference. Some are prediction policy problems (Kleinberg et al. 2015): the policy action depends on predicting an outcome, not on knowing causal effects.
Example: Bail decisions
A judge must decide whether to release a defendant before trial. The relevant question is: "Will this defendant flee or commit a crime if released?"
This is prediction: P(flee∣X) where X includes defendant characteristics.
The causal question—"would releasing this person cause them to flee?"—is not needed. We want to predict behavior, not change it through the decision itself.
Characteristics of Prediction Policy Problems
Action doesn't affect outcome: The policy acts on a prediction, not by changing the outcome-generating process.
Outcome is observable for some: We observe outcomes for people who were released/hired/approved, enabling supervised learning.
Selection bias is the main concern: We only observe outcomes for selected individuals, creating missing data.
Examples
Hiring: Predict job performance from application materials. Hiring doesn't change a person's inherent ability.
Medical testing: Predict disease presence to decide who to test further. Testing reveals disease; it doesn't cause it.
Fraud detection: Predict which transactions are fraudulent. Flagging doesn't change whether fraud occurred.
Targeting: Predict who will benefit most from an intervention, given we know the intervention works.
When Prediction Isn't Enough
Example: Advertising
Should we show an ad to this user? The prediction question is: "Will they buy if shown the ad?" But we need the causal question: "Would showing the ad cause them to buy (who wouldn't have bought otherwise)?"
Showing ads to people who would buy anyway wastes money. We need the incremental (causal) effect.
Rule of thumb: If the decision itself changes outcomes, you need causation. If the decision merely acts on an existing state, prediction may suffice.
21.7 When ML Helps and When It Doesn't
Where ML Adds Value
1. High-dimensional confounding Many potential confounders; unclear which matter. ML selects and adjusts without pre-specifying.
2. Complex functional forms Nonlinear relationships, interactions. ML learns these from data.
3. Heterogeneity discovery Unknown effect modifiers. Causal forests find them.
4. Propensity score estimation Overlap matters; ML can estimate propensity scores flexibly.
Where ML Doesn't Help
1. Identification ML doesn't solve selection bias. If unconfoundedness fails, ML estimates are still biased—just with better nuisance function estimation.
Key point: ML improves estimation of identified parameters. It does not identify parameters that aren't identified.
2. Small samples ML methods need data. With small samples, simpler models often perform better.
3. Interpretability Understanding why something works may require simpler, interpretable models.
4. Strong prior knowledge If you know the correct functional form, parametric methods are more efficient than ML.
The Credibility Revolution Meets ML
The credibility revolution (Chapter 1) emphasizes research design over statistical methods. ML doesn't change this:
A well-designed RCT with simple analysis beats a poorly designed observational study with fancy ML
ML complements good design; it doesn't substitute for it
Transparency about assumptions matters more than methodological sophistication
21.8 Causal Discovery: Learning Structure from Data
The Promise and the Problem
Most of this chapter assumes the causal structure is known: we know which variables confound, which mediate, which are colliders. Methods like DML and causal forests estimate effects given this structure. But can we learn the causal structure itself from data?
Causal discovery (also called causal structure learning) attempts exactly this. The field emerged from computer science and philosophy, particularly the work of Spirtes, Glymour, Scheines, and Pearl in the 1990s.
The appeal is obvious: instead of assuming a DAG, learn it. But the fundamental challenge is severe:
Observational Equivalence Problem
Multiple DAGs can generate identical observational distributions. These form Markov equivalence classes. Without additional assumptions or interventions, we can at best identify the equivalence class, not the true DAG.
For example, these three DAGs are Markov equivalent (they imply identical conditional independence relationships):
X→Y→Z (chain)
X←Y→Z (fork)
X←Y←Z (reverse chain)
Observational data alone cannot distinguish them.
Constraint-Based Methods
Constraint-based algorithms use conditional independence tests to learn causal structure:
PC Algorithm (named for its creators Peter Spirtes and Clark Glymour):
Start with a complete undirected graph (all variables connected)
Remove edges between conditionally independent variables
Orient edges using v-structure patterns (colliders are identifiable)
Propagate orientation constraints
The PC algorithm outputs a CPDAG (completed partially directed acyclic graph) representing the Markov equivalence class—some edges directed, some undetermined.
FCI (Fast Causal Inference) extends PC to handle latent confounders, outputting a PAG (partial ancestral graph) that represents uncertainty about both edge direction and latent variables.
Strengths:
Principled; exploits conditional independence structure
Handles high-dimensional data with proper sparsity assumptions
Implementations available (pcalg in R, causal-learn in Python)
Limitations:
Requires the faithfulness assumption: all independencies in the data are due to the DAG structure (no "accidental" cancellations)
Sensitive to errors in conditional independence testing
Cannot orient all edges (equivalence class problem)
Assumes causal sufficiency (no latent confounders) for basic PC
Score-Based Methods
Score-based algorithms search over DAG space to optimize a scoring criterion:
Define a score (e.g., BIC, BGe score for Bayesian scoring)
Search over possible DAGs
Return highest-scoring DAG(s)
GES (Greedy Equivalence Search) efficiently searches by adding then removing edges, operating on equivalence classes.
Strengths:
Can incorporate prior knowledge through priors
Naturally handles model comparison
Less sensitive to individual test errors than constraint-based
Limitations:
Computationally intensive (DAG space is super-exponential in variables)
Score equivalence: Markov equivalent DAGs have equal scores
May find local optima
Modern Approaches
Recent work combines ideas and enables scaling:
NOTEARS (Zheng et al. 2018) treats structure learning as continuous optimization:
Parameterize the adjacency matrix as continuous
Add an acyclicity constraint (via trace of matrix exponential)
Optimize with gradient descent
This enables scaling to hundreds of variables and integration with deep learning.
Causal discovery with interventional data: When we observe some experimental interventions, the equivalence class shrinks. Active learning approaches design interventions to maximally resolve structural uncertainty.
Connection to Econometrics
Economists have generally been skeptical of purely algorithmic structure learning, preferring theory-guided identification. But connections exist:
David Hendry's general-to-specific modeling shares the spirit of algorithmic structure learning (see Chapter 16). Start with a general model, test down to a parsimonious specification. The Autometrics algorithm automates this with economic theory providing constraints.
DAG + theory: Some researchers use DAGs to represent theoretical structure, then test implied conditional independencies. Rejection suggests model misspecification. This uses causal discovery tools for model diagnostics rather than discovery.
Semi-parametric bounds: When structure is partially known, causal discovery can narrow the set of consistent structures, tightening partial identification bounds.
When Is Causal Discovery Useful?
Appropriate uses:
Exploratory analysis: Generating hypotheses about causal structure
Model diagnostics: Testing whether assumed structure is consistent with data
Very high-dimensional settings: When theory provides insufficient guidance
Complement to theory: Cross-check theoretical models against data patterns
Inappropriate uses:
Substitute for theory: Algorithms cannot replace substantive knowledge
Definitive causal conclusions: Equivalence classes, faithfulness violations, and finite-sample error limit certainty
Without domain expertise: Results require expert interpretation
Practical Guidance
Treat causal discovery as hypothesis-generating, not hypothesis-confirming. Use it to suggest structures for further investigation, not to establish causation. Always assess whether algorithmic output makes substantive sense.
For Further Study
Spirtes, Glymour & Scheines (2000), Causation, Prediction, and Search, 2nd ed. The foundational text.
Peters, Janzing & Schölkopf (2017), Elements of Causal Inference. Modern treatment from ML perspective.
Heinze-Deml, Maathuis & Meinshausen (2018), "Causal Structure Learning," Annual Review of Statistics. Accessible survey.
Practical Guidance
Choosing a Method
ATE with high-dimensional X
DML or TMLE
Heterogeneous effects
Causal forests (GRF)
Best prediction of outcomes
Super Learner
Few covariates, clear model
Standard regression
Pure prediction policy problem
Standard ML
Identification concerns
Fix design first, then consider ML
Structure unknown, exploratory
Causal discovery (hypothesis-generating)
Common Pitfalls
Pitfall 1: Using ML to "control for" unobserved confounders ML learns from observed data. If confounders are unobserved, ML cannot adjust for them.
How to avoid: Be clear about what ML does and doesn't do. It estimates functions of observed variables; it doesn't observe the unobserved.
Pitfall 2: Treating ML output as causal without design "The random forest says D is important for predicting Y" is not a causal statement.
How to avoid: Use causal ML methods (DML, causal forests) with appropriate identification assumptions, not off-the-shelf predictive ML.
Pitfall 3: Ignoring sample splitting Using the same data to estimate nuisance functions and treatment effects biases inference.
How to avoid: Always use cross-fitting or sample splitting in DML/TMLE.
Pitfall 4: Black-box heterogeneity Causal forest finds heterogeneity, but results may be hard to interpret or validate.
How to avoid: Validate with held-out data; test calibration; examine which variables drive heterogeneity.
Implementation Checklist
Running Example: Returns to Education with ML
The Setting
Estimating returns to education with high-dimensional controls:
Many potential confounders: family background, ability proxies, location, cohort
Possible nonlinear and interaction effects
Unknown functional form for propensity score and outcome model
Traditional Approach
OLS with researcher-chosen controls. Risk: omit important confounders, misspecify functional form.
DML Approach
Use LASSO/random forest to estimate E[Y∣X] (earnings given all covariates except education)
Use LASSO/random forest to estimate E[D∣X] (education given covariates)
Orthogonalize and estimate treatment effect on residuals
Advantages: Flexibly controls for many covariates; doesn't require choosing specification.
Finding: Belloni et al. (2014) apply DML-type methods to returns to education, finding estimates similar to careful OLS but with more robust standard errors.
Causal Forest for Heterogeneity
Estimate causal forest with education as treatment, earnings as outcome
Examine CATE across covariate space
Identify who benefits most from additional education
Finding: Returns may be higher for disadvantaged students (who are marginal), suggesting education has redistributive potential.
Caveats
ML doesn't solve the fundamental identification problem:
If ability is unobserved and affects both education and earnings, ML estimates are biased
IV or other designs still needed for credible identification
ML complements but doesn't replace research design
Integration Note
Connections to Other Methods
Selection on Observables
DML/TMLE extend these methods with ML
Ch. 11
Heterogeneity
Causal forests for CATE
Ch. 20
IV
Can combine DML with IV for LATE
Ch. 12
DiD
ML for covariate adjustment in DiD
Ch. 13
Triangulation Strategies
ML-based estimates gain credibility when:
Comparison with traditional methods: DML and OLS give similar estimates
Robustness to ML method: Results stable across LASSO, random forest, boosting
Validation on held-out data: Predictions generalize
Calibration tests: CATE estimates actually predict effect variation
Design-based evidence: ML estimates align with RCT or natural experiment findings
Summary
Key takeaways:
Prediction ≠ Causation: Standard ML learns associations, not causal effects. Using ML output as causal requires additional structure.
ML for nuisance functions: ML excels at estimating propensity scores and outcome models—the "nuisance" functions in causal inference.
Double ML uses orthogonalization and sample splitting to get valid causal estimates despite using ML for nuisance functions.
Causal forests extend random forests to estimate heterogeneous treatment effects, with honest estimation enabling valid inference.
TMLE targets initial ML estimates toward the causal parameter, providing doubly robust and efficient estimation.
Prediction policy problems sometimes substitute for causal questions—but only when the decision doesn't change the outcome-generating process.
ML complements, doesn't replace, research design: Good identification is still essential. ML improves estimation of identified parameters; it doesn't identify the unidentified.
Returning to the opening question: Machine learning can help answer causal questions, but not by treating causation as prediction. The contribution of ML is in flexibly estimating the nuisance functions (propensity scores, outcome models) that feed into causal estimators, and in discovering heterogeneous effects. The hard work of identification—ensuring we compare like with like—still requires research design.
Further Reading
Essential
Athey and Imbens (2019), "Machine Learning Methods That Economists Should Know About" - Overview for economists
Chernozhukov et al. (2018), "Double/Debiased Machine Learning" - DML methodology
For Deeper Understanding
Wager and Athey (2018), "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" - Causal forests
van der Laan and Rose (2011), Targeted Learning - TMLE textbook
Athey, Tibshirani, and Wager (2019), "Generalized Random Forests" - GRF framework
Advanced/Specialized
Kennedy (2022), "Semiparametric Doubly Robust Targeted Double Machine Learning" - Combining approaches
Künzel et al. (2019), "Metalearners for Estimating Heterogeneous Treatment Effects" - Meta-learner comparison
Nie and Wager (2021), "Quasi-Oracle Estimation of Heterogeneous Treatment Effects" - R-learner
Applications
Kleinberg et al. (2015), "Prediction Policy Problems" - When prediction suffices
Belloni et al. (2014), "High-Dimensional Methods and Inference on Structural and Treatment Effects" - LASSO for treatment effects
Davis and Heller (2017), "Using Causal Forests to Predict Treatment Heterogeneity" - Causal forests in practice
Imbens & Xu (2025), "Comparing Experimental and Nonexperimental Methods" JEP. Demonstrates DML, AIPW-GRF, and causal forests on LaLonde data; shows these methods estimate ATT well with overlap but struggle with CATE. Tutorial and replication data at https://yiqingxu.org/tutorials/lalonde/.
Exercises
Conceptual
Explain why regularization (LASSO, Ridge) applied to a regression of Y on (X,D) does not give a valid estimate of the causal effect of D. What goes wrong?
What is the role of sample splitting in Double ML? What would happen without it?
Distinguish between a prediction policy problem and a causal inference problem. Give an example of each in the context of healthcare.
Applied
Using a dataset with treatment, outcome, and many covariates:
Estimate the ATE using OLS with researcher-selected controls
Estimate using DML with LASSO for nuisance functions
Estimate using DML with random forest for nuisance functions
Compare estimates and standard errors across methods
Implement a causal forest on a dataset with known heterogeneity:
Estimate CATEs
Identify the most important variables for heterogeneity
Test calibration: do high-τ^(x) individuals actually have larger effects?
Discussion
A researcher argues: "ML is a black box. I'd rather use simple methods I understand than complex methods that might be doing something I don't understand." Another responds: "Simple methods impose functional form assumptions that are surely wrong. ML is more honest about our ignorance." Who is right?
Appendix 21A: The Neyman Orthogonality Condition
Why Orthogonality Matters
The key to DML is the Neyman orthogonality condition: the influence function for the treatment effect should be orthogonal to the nuisance function estimation error.
Formally: Let θ be the treatment effect and η be nuisance parameters (propensity score, outcome model). The estimating equation ψ(W;θ,η) satisfies Neyman orthogonality if:
∂η∂E[ψ(W;θ0,η)]η=η0=0
Intuition: Small errors in estimating η don't bias estimation of θ.
The AIPW Moment
The AIPW moment function: ψ(W;θ,g,μ0,μ1)=μ1(X)−μ0(X)+g(X)D(Y−μ1(X))−1−g(X)(1−D)(Y−μ0(X))−θ
satisfies Neyman orthogonality with respect to (g,μ0,μ1).
This is why DML based on AIPW achieves n-consistency even with slower-converging ML estimates of nuisance functions.
Last updated