Chapter 21: Machine Learning for Causal Inference

Opening Question

Machine learning excels at prediction—can it also help us answer causal questions, and if so, how?


Chapter Overview

Machine learning has transformed prediction. From image recognition to language translation to demand forecasting, ML methods achieve unprecedented accuracy by learning complex patterns from data. But prediction and causation are fundamentally different. Predicting who will default on a loan is not the same as knowing whether denying them credit would change their behavior. Predicting which patients will die is not the same as knowing whether a treatment would save them.

This chapter explores the intersection of machine learning and causal inference. The key insight is that ML methods can help with causal inference—but only in specific ways and under specific conditions. We'll see how ML can estimate nuisance functions (propensity scores, outcome predictions) that are then used for causal estimation, how causal forests discover heterogeneous treatment effects, and when prediction itself solves the policy problem without requiring causal knowledge.

What you will learn:

  • Why prediction and causation require different approaches

  • ML fundamentals relevant to causal inference (regularization, cross-validation, ensembles)

  • Double/debiased machine learning for treatment effect estimation

  • Causal forests for heterogeneous effects (extending Chapter 20)

  • Targeted learning (TMLE) and its advantages

  • When prediction problems substitute for causal questions

Prerequisites: Chapter 11 (Selection on Observables), Chapter 20 (Heterogeneity), basic familiarity with regression and machine learning concepts


21.1 Prediction vs. Causation

The Fundamental Difference

Prediction asks: Given what I observe about this unit, what outcome do I expect? E[YX=x]E[Y | X = x]

Causation asks: If I intervene to change treatment, what happens to outcomes? E[Ydo(D=1)]E[Ydo(D=0)]E[Y | do(D = 1)] - E[Y | do(D = 0)]

These are different questions with different answers.

Example: Hospital ICU admission

Prediction: Patients in the ICU have higher mortality. Seeing someone in the ICU predicts death.

Causation: ICU admission probably reduces mortality—that's why we admit people. The ICU is correlated with death because sick people go there, not because it kills them.

A model that predicts "ICU → death" is correct for prediction but wrong for the policy question "should we expand ICU capacity?"

Why Standard ML Fails for Causation

Standard ML minimizes prediction error: minfE[(Yf(X,D))2]\min_f E[(Y - f(X, D))^2]

This learns associations, including those driven by confounding. If DD is correlated with YY through confounders, ML learns that correlation—useful for prediction, misleading for causation.

Regularization makes it worse: Regularization (LASSO, Ridge) shrinks coefficients toward zero based on predictive value, not causal importance. A treatment effect that's real but noisy gets shrunk; a confounder that's predictive but non-causal gets kept.

Where ML Can Help

ML excels at learning complex functions from data. In causal inference, we often need to estimate:

  1. Propensity scores: P(D=1X)P(D = 1 | X)—the probability of treatment given covariates

  2. Outcome models: E[YX,D]E[Y | X, D]—expected outcomes conditional on covariates and treatment

  3. Heterogeneous effects: τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x]—how effects vary with covariates

These are prediction problems! ML can estimate them well, and the estimates feed into causal inference methods.


21.2 ML Essentials for Causal Inference

Regularization

High-dimensional covariates create overfitting risk. Regularization penalizes model complexity.

LASSO (L1 penalty): minβi(YiXiβ)2+λjβj\min_\beta \sum_i (Y_i - X_i'\beta)^2 + \lambda \sum_j |\beta_j|

  • Shrinks coefficients; sets some to exactly zero

  • Performs variable selection

  • Useful when many covariates, few are important

Ridge (L2 penalty): minβi(YiXiβ)2+λjβj2\min_\beta \sum_i (Y_i - X_i'\beta)^2 + \lambda \sum_j \beta_j^2

  • Shrinks all coefficients toward zero

  • Keeps all variables; none exactly zero

  • Handles multicollinearity well

Elastic Net (combined): minβi(YiXiβ)2+λ1jβj+λ2jβj2\min_\beta \sum_i (Y_i - X_i'\beta)^2 + \lambda_1 \sum_j |\beta_j| + \lambda_2 \sum_j \beta_j^2

Regularization Bias Figure 21.1: Regularization shrinks treatment effects toward zero. As the penalty parameter λ increases (moving right), both LASSO and Ridge shrink the treatment coefficient away from its true value. The shaded region shows the bias introduced by regularization. This is why naive application of penalized regression to causal inference produces biased estimates.

Cross-Validation

How do we choose tuning parameters (λ\lambda, tree depth, etc.)?

K-fold cross-validation:

  1. Split data into K folds

  2. For each fold k: train on other K-1 folds, evaluate on fold k

  3. Average performance across folds

  4. Choose parameters that minimize average error

This prevents overfitting to the training data.

Tree-Based Methods

Decision trees: Recursively partition covariate space; predict constant within each partition (leaf).

Random forests: Average predictions across many trees, each trained on bootstrap samples with random feature subsets. Reduces variance, handles interactions and nonlinearities.

Gradient boosting (XGBoost, LightGBM): Sequentially fit trees to residuals, building up an ensemble. Often achieves best predictive performance.

Why These Methods Matter for Causal Inference

Causal inference methods often require estimating nuisance functions:

  • Propensity scores in IPW

  • Outcome models in AIPW

  • Both in double ML

ML methods estimate these flexibly without imposing restrictive functional forms. This reduces bias from model misspecification.


21.3 Double/Debiased Machine Learning

The Problem with Naive ML

Suppose we want to estimate the effect of DD on YY controlling for XX. Naive approach:

  1. Use ML to predict YY from (X,D)(X, D)

  2. Read off the coefficient on DD

This fails because:

  • Regularization biases the treatment coefficient

  • Variable selection may drop DD if it's noisy

  • ML optimizes prediction, not causal identification

The Orthogonalization Insight

Chernozhukov et al. (2018) develop Double/Debiased Machine Learning (DML). The key insight: make the treatment effect estimator orthogonal to errors in nuisance function estimation.

Frisch-Waugh-Lovell intuition: In linear regression, the coefficient on DD equals:

  1. Regress DD on XX, get residuals D~\tilde{D}

  2. Regress YY on XX, get residuals Y~\tilde{Y}

  3. Regress Y~\tilde{Y} on D~\tilde{D}

The coefficient on D~\tilde{D} is the treatment effect, purged of XX's influence.

DML extends this to ML:

  1. Use ML to predict DD from XX: m^(X)=E[DX]\hat{m}(X) = E[D|X]

  2. Use ML to predict YY from XX: g^(X)=E[YX]\hat{g}(X) = E[Y|X]

  3. Compute residuals: D~=Dm^(X)\tilde{D} = D - \hat{m}(X), Y~=Yg^(X)\tilde{Y} = Y - \hat{g}(X)

  4. Estimate treatment effect from residuals

Sample Splitting

A key innovation: use different samples to estimate nuisance functions and treatment effects.

Why? If we use the same data for both:

  • Overfitting in m^(X)\hat{m}(X) or g^(X)\hat{g}(X) biases the treatment effect

  • Even with orthogonalization, bias persists

Cross-fitting procedure:

  1. Split sample into K folds

  2. For each fold k:

    • Train m^(X)\hat{m}(X) and g^(X)\hat{g}(X) on other folds

    • Compute residuals on fold k

  3. Estimate treatment effect using all residuals

This eliminates overfitting bias while using all data efficiently.

The DML Estimator

Definition 21.1 (Double Machine Learning Estimator): Under unconfoundedness given XX, the DML estimator of ATE is: τ^DML=1ni[g^1(Xi)g^0(Xi)+Di(Yig^1(Xi))m^(Xi)(1Di)(Yig^0(Xi))1m^(Xi)]\hat{\tau}^{DML} = \frac{1}{n} \sum_i \left[ \hat{g}_1(X_i) - \hat{g}_0(X_i) + \frac{D_i(Y_i - \hat{g}_1(X_i))}{\hat{m}(X_i)} - \frac{(1-D_i)(Y_i - \hat{g}_0(X_i))}{1 - \hat{m}(X_i)} \right]

where g^d(X)=E[YX,D=d]\hat{g}_d(X) = E[Y|X, D=d] and m^(X)=P(D=1X)\hat{m}(X) = P(D=1|X) are estimated by ML with cross-fitting.

Intuition: This is the augmented inverse probability weighted (AIPW) estimator from Chapter 11, with nuisance functions estimated by ML.

Properties

Double robustness: Consistent if either m^(X)\hat{m}(X) or g^(X)\hat{g}(X) is consistent (but not necessarily both).

Root-n consistency: Under regularity conditions, τ^DML\hat{\tau}^{DML} is n\sqrt{n}-consistent and asymptotically normal, even though nuisance functions converge at slower (ML) rates.

Valid inference: Standard errors and confidence intervals are valid despite ML estimation of nuisances.

When to Use DML

Good candidates:

  • High-dimensional covariates (many potential confounders)

  • Complex relationships between covariates and outcomes

  • Selection on observables is credible

  • Standard parametric models seem too restrictive

Poor candidates:

  • Few covariates (standard regression suffices)

  • Unconfoundedness is not credible (need IV, DiD, etc.)

  • Sample size is small (ML needs data)


21.4 Causal Forests

From Chapter 20 to Here

Chapter 20 introduced causal forests for heterogeneous treatment effect estimation. Here we develop the methodology more fully.

The Causal Forest Algorithm

Goal: Estimate τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x] for any xx.

Approach: Adapt random forests to target treatment effects rather than outcomes.

Key modifications:

  1. Splitting criterion: Instead of minimizing outcome variance, maximize heterogeneity in treatment effects between child nodes.

  2. Honest estimation: Use separate samples for determining splits and estimating leaf effects.

  3. Local centering: Remove E[YX]E[Y|X] and E[DX]E[D|X] before estimation (orthogonalization).

The Generalized Random Forest Framework

Athey, Tibshirani, and Wager (2019) develop Generalized Random Forests (GRF), which nest causal forests as a special case.

General setup: Estimate a parameter θ(x)\theta(x) defined by a local moment condition: E[ψ(Y,D,θ(x))X=x]=0E[\psi(Y, D, \theta(x)) | X = x] = 0

For treatment effects: ψ(Y,D,θ)=(YDθμ0)(Dp)\psi(Y, D, \theta) = (Y - D\theta - \mu_0)(D - p) where μ0=E[Y(0)X]\mu_0 = E[Y(0)|X] and p=P(D=1X)p = P(D=1|X).

Forest weighting: The forest produces weights αi(x)\alpha_i(x) indicating how much observation ii contributes to estimating θ(x)\theta(x): θ^(x)=argminθiαi(x)ρ(ψ(Yi,Di,θ))\hat{\theta}(x) = \arg\min_\theta \sum_i \alpha_i(x) \cdot \rho(\psi(Y_i, D_i, \theta))

Observations in the same leaf as xx get high weight; distant observations get low weight.

Inference

Asymptotic normality: Under regularity conditions: τ^(x)τ(x)σ^(x)dN(0,1)\frac{\hat{\tau}(x) - \tau(x)}{\hat{\sigma}(x)} \xrightarrow{d} N(0, 1)

Variance estimation: GRF provides variance estimates via infinitesimal jackknife, enabling confidence intervals.

Honest inference: Sample splitting ensures valid inference even with adaptive splitting.

Practical Usage


21.5 Targeted Learning (TMLE)

Motivation

Double ML and causal forests are regression-based. Targeted Maximum Likelihood Estimation (TMLE), developed by van der Laan and colleagues, takes a different approach: it "targets" the initial ML estimates toward the specific causal parameter of interest.

The Core Intuition: Why TMLE Works

ML algorithms optimize for prediction accuracy—minimizing squared error across all observations. But we care about a specific causal parameter (the ATE). These goals don't align perfectly.

The problem: A model that predicts outcomes well overall might be biased for the specific comparison we need (treated vs. control outcomes at the same covariate values).

TMLE's solution: Start with an ML prediction, then adjust it specifically to reduce bias for the causal estimand. The adjustment uses the propensity score to identify where the initial model is most likely wrong for causal purposes—specifically, at covariate values where treatment is rare.

Analogy: Imagine you're estimating average height in a population, but your sample over-represents tall people. TMLE doesn't just reweight—it adjusts the underlying height predictions themselves, pushing them in the direction that corrects for the sampling imbalance. The "clever covariate" tells you which direction to push and by how much.

Why "targeted"? The adjustment is targeted at reducing bias for your specific estimand (e.g., ATE), not improving general prediction accuracy. Different estimands get different adjustments.

The TMLE Procedure (Simplified)

Step 1: Initial estimates

  • Estimate outcome model: Q^(D,X)=E[YD,X]\hat{Q}(D, X) = E[Y | D, X]

  • Estimate propensity score: g^(X)=P(D=1X)\hat{g}(X) = P(D = 1 | X)

Use any ML method (ensemble, neural network, etc.).

Step 2: Clever covariate Define the "clever covariate": H(D,X)=Dg^(X)1D1g^(X)H(D, X) = \frac{D}{\hat{g}(X)} - \frac{1-D}{1-\hat{g}(X)}

Step 3: Targeting step Update the initial estimate by regressing: Y=Q^(D,X)+ϵH(D,X)+residualY = \hat{Q}(D, X) + \epsilon \cdot H(D, X) + \text{residual}

The coefficient ϵ^\hat{\epsilon} "targets" the estimate toward the causal parameter.

Step 4: Updated predictions Q^(D,X)=Q^(D,X)+ϵ^H(D,X)\hat{Q}^*(D, X) = \hat{Q}(D, X) + \hat{\epsilon} \cdot H(D, X)

Step 5: Parameter estimate τ^TMLE=1ni[Q^(1,Xi)Q^(0,Xi)]\hat{\tau}^{TMLE} = \frac{1}{n} \sum_i [\hat{Q}^*(1, X_i) - \hat{Q}^*(0, X_i)]

Properties

Double robustness: Like AIPW, consistent if either outcome model or propensity score is consistent.

Efficiency: Achieves the semiparametric efficiency bound under correct specification.

Bounded estimates: Unlike IPW, TMLE produces bounded estimates even with extreme propensity scores.

Inference: Influence function-based standard errors provide valid confidence intervals.

Super Learner

TMLE is often combined with Super Learner—an ensemble method that combines multiple ML algorithms:

  1. Specify a library of learners (LASSO, random forest, neural net, etc.)

  2. Use cross-validation to find optimal weights for combining them

  3. Final prediction is weighted average of learners

This provides robust nuisance function estimation without choosing a single method.

TMLE vs. DML

Aspect
DML
TMLE

Approach

Orthogonalization

Targeting

Software

DoubleML (R/Python)

tmle, tmle3 (R)

Flexibility

Any ML

Super Learner typical

Inference

Asymptotic

Influence function

Bounded

No (can extrapolate)

Yes

In practice, both work well; choice often depends on software familiarity.


21.6 Prediction Policy Problems

When Prediction Solves the Problem

Not all policy questions require causal inference. Some are prediction policy problems (Kleinberg et al. 2015): the policy action depends on predicting an outcome, not on knowing causal effects.

Example: Bail decisions

A judge must decide whether to release a defendant before trial. The relevant question is: "Will this defendant flee or commit a crime if released?"

This is prediction: P(fleeX)P(\text{flee} | X) where XX includes defendant characteristics.

The causal question—"would releasing this person cause them to flee?"—is not needed. We want to predict behavior, not change it through the decision itself.

Characteristics of Prediction Policy Problems

  1. Action doesn't affect outcome: The policy acts on a prediction, not by changing the outcome-generating process.

  2. Outcome is observable for some: We observe outcomes for people who were released/hired/approved, enabling supervised learning.

  3. Selection bias is the main concern: We only observe outcomes for selected individuals, creating missing data.

Examples

Hiring: Predict job performance from application materials. Hiring doesn't change a person's inherent ability.

Medical testing: Predict disease presence to decide who to test further. Testing reveals disease; it doesn't cause it.

Fraud detection: Predict which transactions are fraudulent. Flagging doesn't change whether fraud occurred.

Targeting: Predict who will benefit most from an intervention, given we know the intervention works.

When Prediction Isn't Enough

Example: Advertising

Should we show an ad to this user? The prediction question is: "Will they buy if shown the ad?" But we need the causal question: "Would showing the ad cause them to buy (who wouldn't have bought otherwise)?"

Showing ads to people who would buy anyway wastes money. We need the incremental (causal) effect.

Rule of thumb: If the decision itself changes outcomes, you need causation. If the decision merely acts on an existing state, prediction may suffice.


21.7 When ML Helps and When It Doesn't

Where ML Adds Value

1. High-dimensional confounding Many potential confounders; unclear which matter. ML selects and adjusts without pre-specifying.

2. Complex functional forms Nonlinear relationships, interactions. ML learns these from data.

3. Heterogeneity discovery Unknown effect modifiers. Causal forests find them.

4. Propensity score estimation Overlap matters; ML can estimate propensity scores flexibly.

Where ML Doesn't Help

1. Identification ML doesn't solve selection bias. If unconfoundedness fails, ML estimates are still biased—just with better nuisance function estimation.

Key point: ML improves estimation of identified parameters. It does not identify parameters that aren't identified.

2. Small samples ML methods need data. With small samples, simpler models often perform better.

3. Interpretability Understanding why something works may require simpler, interpretable models.

4. Strong prior knowledge If you know the correct functional form, parametric methods are more efficient than ML.

The Credibility Revolution Meets ML

The credibility revolution (Chapter 1) emphasizes research design over statistical methods. ML doesn't change this:

  • A well-designed RCT with simple analysis beats a poorly designed observational study with fancy ML

  • ML complements good design; it doesn't substitute for it

  • Transparency about assumptions matters more than methodological sophistication


21.8 Causal Discovery: Learning Structure from Data

The Promise and the Problem

Most of this chapter assumes the causal structure is known: we know which variables confound, which mediate, which are colliders. Methods like DML and causal forests estimate effects given this structure. But can we learn the causal structure itself from data?

Causal discovery (also called causal structure learning) attempts exactly this. The field emerged from computer science and philosophy, particularly the work of Spirtes, Glymour, Scheines, and Pearl in the 1990s.

The appeal is obvious: instead of assuming a DAG, learn it. But the fundamental challenge is severe:

Observational Equivalence Problem

Multiple DAGs can generate identical observational distributions. These form Markov equivalence classes. Without additional assumptions or interventions, we can at best identify the equivalence class, not the true DAG.

For example, these three DAGs are Markov equivalent (they imply identical conditional independence relationships):

  • XYZX \to Y \to Z (chain)

  • XYZX \leftarrow Y \to Z (fork)

  • XYZX \leftarrow Y \leftarrow Z (reverse chain)

Observational data alone cannot distinguish them.

Constraint-Based Methods

Constraint-based algorithms use conditional independence tests to learn causal structure:

PC Algorithm (named for its creators Peter Spirtes and Clark Glymour):

  1. Start with a complete undirected graph (all variables connected)

  2. Remove edges between conditionally independent variables

  3. Orient edges using v-structure patterns (colliders are identifiable)

  4. Propagate orientation constraints

The PC algorithm outputs a CPDAG (completed partially directed acyclic graph) representing the Markov equivalence class—some edges directed, some undetermined.

FCI (Fast Causal Inference) extends PC to handle latent confounders, outputting a PAG (partial ancestral graph) that represents uncertainty about both edge direction and latent variables.

Strengths:

  • Principled; exploits conditional independence structure

  • Handles high-dimensional data with proper sparsity assumptions

  • Implementations available (pcalg in R, causal-learn in Python)

Limitations:

  • Requires the faithfulness assumption: all independencies in the data are due to the DAG structure (no "accidental" cancellations)

  • Sensitive to errors in conditional independence testing

  • Cannot orient all edges (equivalence class problem)

  • Assumes causal sufficiency (no latent confounders) for basic PC

Score-Based Methods

Score-based algorithms search over DAG space to optimize a scoring criterion:

  1. Define a score (e.g., BIC, BGe score for Bayesian scoring)

  2. Search over possible DAGs

  3. Return highest-scoring DAG(s)

GES (Greedy Equivalence Search) efficiently searches by adding then removing edges, operating on equivalence classes.

Strengths:

  • Can incorporate prior knowledge through priors

  • Naturally handles model comparison

  • Less sensitive to individual test errors than constraint-based

Limitations:

  • Computationally intensive (DAG space is super-exponential in variables)

  • Score equivalence: Markov equivalent DAGs have equal scores

  • May find local optima

Modern Approaches

Recent work combines ideas and enables scaling:

NOTEARS (Zheng et al. 2018) treats structure learning as continuous optimization:

  • Parameterize the adjacency matrix as continuous

  • Add an acyclicity constraint (via trace of matrix exponential)

  • Optimize with gradient descent

This enables scaling to hundreds of variables and integration with deep learning.

Causal discovery with interventional data: When we observe some experimental interventions, the equivalence class shrinks. Active learning approaches design interventions to maximally resolve structural uncertainty.

Connection to Econometrics

Economists have generally been skeptical of purely algorithmic structure learning, preferring theory-guided identification. But connections exist:

David Hendry's general-to-specific modeling shares the spirit of algorithmic structure learning (see Chapter 16). Start with a general model, test down to a parsimonious specification. The Autometrics algorithm automates this with economic theory providing constraints.

DAG + theory: Some researchers use DAGs to represent theoretical structure, then test implied conditional independencies. Rejection suggests model misspecification. This uses causal discovery tools for model diagnostics rather than discovery.

Semi-parametric bounds: When structure is partially known, causal discovery can narrow the set of consistent structures, tightening partial identification bounds.

When Is Causal Discovery Useful?

Appropriate uses:

  • Exploratory analysis: Generating hypotheses about causal structure

  • Model diagnostics: Testing whether assumed structure is consistent with data

  • Very high-dimensional settings: When theory provides insufficient guidance

  • Complement to theory: Cross-check theoretical models against data patterns

Inappropriate uses:

  • Substitute for theory: Algorithms cannot replace substantive knowledge

  • Definitive causal conclusions: Equivalence classes, faithfulness violations, and finite-sample error limit certainty

  • Without domain expertise: Results require expert interpretation

Practical Guidance

Treat causal discovery as hypothesis-generating, not hypothesis-confirming. Use it to suggest structures for further investigation, not to establish causation. Always assess whether algorithmic output makes substantive sense.

For Further Study

Spirtes, Glymour & Scheines (2000), Causation, Prediction, and Search, 2nd ed. The foundational text.

Peters, Janzing & Schölkopf (2017), Elements of Causal Inference. Modern treatment from ML perspective.

Heinze-Deml, Maathuis & Meinshausen (2018), "Causal Structure Learning," Annual Review of Statistics. Accessible survey.


Practical Guidance

Choosing a Method

Situation
Recommended Approach

ATE with high-dimensional XX

DML or TMLE

Heterogeneous effects

Causal forests (GRF)

Best prediction of outcomes

Super Learner

Few covariates, clear model

Standard regression

Pure prediction policy problem

Standard ML

Identification concerns

Fix design first, then consider ML

Structure unknown, exploratory

Causal discovery (hypothesis-generating)

Common Pitfalls

Pitfall 1: Using ML to "control for" unobserved confounders ML learns from observed data. If confounders are unobserved, ML cannot adjust for them.

How to avoid: Be clear about what ML does and doesn't do. It estimates functions of observed variables; it doesn't observe the unobserved.

Pitfall 2: Treating ML output as causal without design "The random forest says DD is important for predicting YY" is not a causal statement.

How to avoid: Use causal ML methods (DML, causal forests) with appropriate identification assumptions, not off-the-shelf predictive ML.

Pitfall 3: Ignoring sample splitting Using the same data to estimate nuisance functions and treatment effects biases inference.

How to avoid: Always use cross-fitting or sample splitting in DML/TMLE.

Pitfall 4: Black-box heterogeneity Causal forest finds heterogeneity, but results may be hard to interpret or validate.

How to avoid: Validate with held-out data; test calibration; examine which variables drive heterogeneity.

Implementation Checklist


Running Example: Returns to Education with ML

The Setting

Estimating returns to education with high-dimensional controls:

  • Many potential confounders: family background, ability proxies, location, cohort

  • Possible nonlinear and interaction effects

  • Unknown functional form for propensity score and outcome model

Traditional Approach

OLS with researcher-chosen controls. Risk: omit important confounders, misspecify functional form.

DML Approach

  1. Use LASSO/random forest to estimate E[YX]E[Y|X] (earnings given all covariates except education)

  2. Use LASSO/random forest to estimate E[DX]E[D|X] (education given covariates)

  3. Orthogonalize and estimate treatment effect on residuals

Advantages: Flexibly controls for many covariates; doesn't require choosing specification.

Finding: Belloni et al. (2014) apply DML-type methods to returns to education, finding estimates similar to careful OLS but with more robust standard errors.

Causal Forest for Heterogeneity

  1. Estimate causal forest with education as treatment, earnings as outcome

  2. Examine CATE across covariate space

  3. Identify who benefits most from additional education

Finding: Returns may be higher for disadvantaged students (who are marginal), suggesting education has redistributive potential.

Caveats

ML doesn't solve the fundamental identification problem:

  • If ability is unobserved and affects both education and earnings, ML estimates are biased

  • IV or other designs still needed for credible identification

  • ML complements but doesn't replace research design


Integration Note

Connections to Other Methods

Method
Relationship
See Chapter

Selection on Observables

DML/TMLE extend these methods with ML

Ch. 11

Heterogeneity

Causal forests for CATE

Ch. 20

IV

Can combine DML with IV for LATE

Ch. 12

DiD

ML for covariate adjustment in DiD

Ch. 13

Triangulation Strategies

ML-based estimates gain credibility when:

  1. Comparison with traditional methods: DML and OLS give similar estimates

  2. Robustness to ML method: Results stable across LASSO, random forest, boosting

  3. Validation on held-out data: Predictions generalize

  4. Calibration tests: CATE estimates actually predict effect variation

  5. Design-based evidence: ML estimates align with RCT or natural experiment findings


Summary

Key takeaways:

  1. Prediction ≠ Causation: Standard ML learns associations, not causal effects. Using ML output as causal requires additional structure.

  2. ML for nuisance functions: ML excels at estimating propensity scores and outcome models—the "nuisance" functions in causal inference.

  3. Double ML uses orthogonalization and sample splitting to get valid causal estimates despite using ML for nuisance functions.

  4. Causal forests extend random forests to estimate heterogeneous treatment effects, with honest estimation enabling valid inference.

  5. TMLE targets initial ML estimates toward the causal parameter, providing doubly robust and efficient estimation.

  6. Prediction policy problems sometimes substitute for causal questions—but only when the decision doesn't change the outcome-generating process.

  7. ML complements, doesn't replace, research design: Good identification is still essential. ML improves estimation of identified parameters; it doesn't identify the unidentified.

Returning to the opening question: Machine learning can help answer causal questions, but not by treating causation as prediction. The contribution of ML is in flexibly estimating the nuisance functions (propensity scores, outcome models) that feed into causal estimators, and in discovering heterogeneous effects. The hard work of identification—ensuring we compare like with like—still requires research design.


Further Reading

Essential

  • Athey and Imbens (2019), "Machine Learning Methods That Economists Should Know About" - Overview for economists

  • Chernozhukov et al. (2018), "Double/Debiased Machine Learning" - DML methodology

For Deeper Understanding

  • Wager and Athey (2018), "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" - Causal forests

  • van der Laan and Rose (2011), Targeted Learning - TMLE textbook

  • Athey, Tibshirani, and Wager (2019), "Generalized Random Forests" - GRF framework

Advanced/Specialized

  • Kennedy (2022), "Semiparametric Doubly Robust Targeted Double Machine Learning" - Combining approaches

  • Künzel et al. (2019), "Metalearners for Estimating Heterogeneous Treatment Effects" - Meta-learner comparison

  • Nie and Wager (2021), "Quasi-Oracle Estimation of Heterogeneous Treatment Effects" - R-learner

Applications

  • Kleinberg et al. (2015), "Prediction Policy Problems" - When prediction suffices

  • Belloni et al. (2014), "High-Dimensional Methods and Inference on Structural and Treatment Effects" - LASSO for treatment effects

  • Davis and Heller (2017), "Using Causal Forests to Predict Treatment Heterogeneity" - Causal forests in practice

  • Imbens & Xu (2025), "Comparing Experimental and Nonexperimental Methods" JEP. Demonstrates DML, AIPW-GRF, and causal forests on LaLonde data; shows these methods estimate ATT well with overlap but struggle with CATE. Tutorial and replication data at https://yiqingxu.org/tutorials/lalonde/.


Exercises

Conceptual

  1. Explain why regularization (LASSO, Ridge) applied to a regression of YY on (X,D)(X, D) does not give a valid estimate of the causal effect of DD. What goes wrong?

  2. What is the role of sample splitting in Double ML? What would happen without it?

  3. Distinguish between a prediction policy problem and a causal inference problem. Give an example of each in the context of healthcare.

Applied

  1. Using a dataset with treatment, outcome, and many covariates:

    • Estimate the ATE using OLS with researcher-selected controls

    • Estimate using DML with LASSO for nuisance functions

    • Estimate using DML with random forest for nuisance functions

    • Compare estimates and standard errors across methods

  2. Implement a causal forest on a dataset with known heterogeneity:

    • Estimate CATEs

    • Identify the most important variables for heterogeneity

    • Test calibration: do high-τ^(x)\hat{\tau}(x) individuals actually have larger effects?

Discussion

  1. A researcher argues: "ML is a black box. I'd rather use simple methods I understand than complex methods that might be doing something I don't understand." Another responds: "Simple methods impose functional form assumptions that are surely wrong. ML is more honest about our ignorance." Who is right?


Appendix 21A: The Neyman Orthogonality Condition

Why Orthogonality Matters

The key to DML is the Neyman orthogonality condition: the influence function for the treatment effect should be orthogonal to the nuisance function estimation error.

Formally: Let θ\theta be the treatment effect and η\eta be nuisance parameters (propensity score, outcome model). The estimating equation ψ(W;θ,η)\psi(W; \theta, \eta) satisfies Neyman orthogonality if:

ηE[ψ(W;θ0,η)]η=η0=0\frac{\partial}{\partial \eta} E[\psi(W; \theta_0, \eta)] \bigg|_{\eta = \eta_0} = 0

Intuition: Small errors in estimating η\eta don't bias estimation of θ\theta.

The AIPW Moment

The AIPW moment function: ψ(W;θ,g,μ0,μ1)=μ1(X)μ0(X)+D(Yμ1(X))g(X)(1D)(Yμ0(X))1g(X)θ\psi(W; \theta, g, \mu_0, \mu_1) = \mu_1(X) - \mu_0(X) + \frac{D(Y - \mu_1(X))}{g(X)} - \frac{(1-D)(Y - \mu_0(X))}{1-g(X)} - \theta

satisfies Neyman orthogonality with respect to (g,μ0,μ1)(g, \mu_0, \mu_1).

This is why DML based on AIPW achieves n\sqrt{n}-consistency even with slower-converging ML estimates of nuisance functions.

Last updated