Chapter 21: Machine Learning for Causal Inference

Opening Question

Machine learning excels at prediction—can it also help us answer causal questions, and if so, how?

Chapter Overview

Machine learning has transformed prediction. From image recognition to language translation to demand forecasting, ML methods achieve unprecedented accuracy by learning complex patterns from data. But prediction and causation are fundamentally different. Predicting who will default on a loan is not the same as knowing whether denying them credit would change their behavior. Predicting which patients will die is not the same as knowing whether a treatment would save them.

This chapter explores the intersection of machine learning and causal inference. The key insight is that ML methods can help with causal inference—but only in specific ways and under specific conditions. We'll see how ML can estimate nuisance functions (propensity scores, outcome predictions) that are then used for causal estimation, how causal forests discover heterogeneous treatment effects, and when prediction itself solves the policy problem without requiring causal knowledge.

What you will learn:

Why prediction and causation require different approaches
ML fundamentals relevant to causal inference (regularization, cross-validation, ensembles)
Double/debiased machine learning for treatment effect estimation
Causal forests for heterogeneous effects (extending Chapter 20)
Targeted learning (TMLE) and its advantages
When prediction problems substitute for causal questions

Prerequisites: Chapter 11 (Selection on Observables), Chapter 20 (Heterogeneity), basic familiarity with regression and machine learning concepts

21.1 Prediction vs. Causation

The Fundamental Difference

Prediction asks: Given what I observe about this unit, what outcome do I expect? $E[Y | X = x]$

Causation asks: If I intervene to change treatment, what happens to outcomes? $E[Y | do(D = 1)] - E[Y | do(D = 0)]$

These are different questions with different answers.

Example: Hospital ICU admission
Prediction: Patients in the ICU have higher mortality. Seeing someone in the ICU predicts death.
Causation: ICU admission probably reduces mortality—that's why we admit people. The ICU is correlated with death because sick people go there, not because it kills them.
A model that predicts "ICU → death" is correct for prediction but wrong for the policy question "should we expand ICU capacity?"

Why Standard ML Fails for Causation

Standard ML minimizes prediction error: $\min_f E[(Y - f(X, D))^2]$

This learns associations, including those driven by confounding. If $D$ is correlated with $Y$ through confounders, ML learns that correlation—useful for prediction, misleading for causation.

Regularization makes it worse: Regularization (LASSO, Ridge) shrinks coefficients toward zero based on predictive value, not causal importance. A treatment effect that's real but noisy gets shrunk; a confounder that's predictive but non-causal gets kept.

Where ML Can Help

ML excels at learning complex functions from data. In causal inference, we often need to estimate:

Propensity scores: $P(D = 1 | X)$ —the probability of treatment given covariates
Outcome models: $E[Y | X, D]$ —expected outcomes conditional on covariates and treatment
Heterogeneous effects: $\tau(x) = E[Y(1) - Y(0) | X = x]$ —how effects vary with covariates

These are prediction problems! ML can estimate them well, and the estimates feed into causal inference methods.

21.2 ML Essentials for Causal Inference

Regularization

High-dimensional covariates create overfitting risk. Regularization penalizes model complexity.

LASSO (L1 penalty): $\min_\beta \sum_i (Y_i - X_i'\beta)^2 + \lambda \sum_j |\beta_j|$

Shrinks coefficients; sets some to exactly zero
Performs variable selection
Useful when many covariates, few are important

Ridge (L2 penalty): $\min_\beta \sum_i (Y_i - X_i'\beta)^2 + \lambda \sum_j \beta_j^2$

Shrinks all coefficients toward zero
Keeps all variables; none exactly zero
Handles multicollinearity well

Elastic Net (combined): $\min_\beta \sum_i (Y_i - X_i'\beta)^2 + \lambda_1 \sum_j |\beta_j| + \lambda_2 \sum_j \beta_j^2$

Figure 21.1: Regularization shrinks treatment effects toward zero. As the penalty parameter λ increases (moving right), both LASSO and Ridge shrink the treatment coefficient away from its true value. The shaded region shows the bias introduced by regularization. This is why naive application of penalized regression to causal inference produces biased estimates.

Cross-Validation

How do we choose tuning parameters ( $\lambda$ , tree depth, etc.)?

K-fold cross-validation:

Split data into K folds
For each fold k: train on other K-1 folds, evaluate on fold k
Average performance across folds
Choose parameters that minimize average error

This prevents overfitting to the training data.

Tree-Based Methods

Decision trees: Recursively partition covariate space; predict constant within each partition (leaf).

Random forests: Average predictions across many trees, each trained on bootstrap samples with random feature subsets. Reduces variance, handles interactions and nonlinearities.

Gradient boosting (XGBoost, LightGBM): Sequentially fit trees to residuals, building up an ensemble. Often achieves best predictive performance.

Why These Methods Matter for Causal Inference

Causal inference methods often require estimating nuisance functions:

Propensity scores in IPW
Outcome models in AIPW
Both in double ML

ML methods estimate these flexibly without imposing restrictive functional forms. This reduces bias from model misspecification.

21.3 Double/Debiased Machine Learning

The Problem with Naive ML

Suppose we want to estimate the effect of $D$ on $Y$ controlling for $X$ . Naive approach:

Use ML to predict $Y$ from $(X, D)$
Read off the coefficient on $D$

This fails because:

Regularization biases the treatment coefficient
Variable selection may drop $D$ if it's noisy
ML optimizes prediction, not causal identification

The Orthogonalization Insight

Chernozhukov et al. (2018) develop Double/Debiased Machine Learning (DML). The key insight: make the treatment effect estimator orthogonal to errors in nuisance function estimation.

Frisch-Waugh-Lovell intuition: In linear regression, the coefficient on $D$ equals:

Regress $D$ on $X$ , get residuals $\tilde{D}$
Regress $Y$ on $X$ , get residuals $\tilde{Y}$
Regress $\tilde{Y}$ on $\tilde{D}$

The coefficient on $\tilde{D}$ is the treatment effect, purged of $X$ 's influence.

DML extends this to ML:

Use ML to predict $D$ from $X$ : $\hat{m}(X) = E[D|X]$
Use ML to predict $Y$ from $X$ : $\hat{g}(X) = E[Y|X]$
Compute residuals: $\tilde{D} = D - \hat{m}(X)$ , $\tilde{Y} = Y - \hat{g}(X)$
Estimate treatment effect from residuals

Sample Splitting

A key innovation: use different samples to estimate nuisance functions and treatment effects.

Why? If we use the same data for both:

Overfitting in $\hat{m}(X)$ or $\hat{g}(X)$ biases the treatment effect
Even with orthogonalization, bias persists

Cross-fitting procedure:

Split sample into K folds
For each fold k:
- Train $\hat{m}(X)$ and $\hat{g}(X)$ on other folds
- Compute residuals on fold k
Estimate treatment effect using all residuals

This eliminates overfitting bias while using all data efficiently.

The DML Estimator

Definition 21.1 (Double Machine Learning Estimator): Under unconfoundedness given $X$ , the DML estimator of ATE is: $\hat{\tau}^{DML} = \frac{1}{n} \sum_i \left[ \hat{g}_1(X_i) - \hat{g}_0(X_i) + \frac{D_i(Y_i - \hat{g}_1(X_i))}{\hat{m}(X_i)} - \frac{(1-D_i)(Y_i - \hat{g}_0(X_i))}{1 - \hat{m}(X_i)} \right]$
where $\hat{g}_d(X) = E[Y|X, D=d]$ and $\hat{m}(X) = P(D=1|X)$ are estimated by ML with cross-fitting.

Intuition: This is the augmented inverse probability weighted (AIPW) estimator from Chapter 11, with nuisance functions estimated by ML.

Properties

Double robustness: Consistent if either $\hat{m}(X)$ or $\hat{g}(X)$ is consistent (but not necessarily both).

Root-n consistency: Under regularity conditions, $\hat{\tau}^{DML}$ is $\sqrt{n}$ -consistent and asymptotically normal, even though nuisance functions converge at slower (ML) rates.

Valid inference: Standard errors and confidence intervals are valid despite ML estimation of nuisances.

When to Use DML

Good candidates:

High-dimensional covariates (many potential confounders)
Complex relationships between covariates and outcomes
Selection on observables is credible
Standard parametric models seem too restrictive

Poor candidates:

Few covariates (standard regression suffices)
Unconfoundedness is not credible (need IV, DiD, etc.)
Sample size is small (ML needs data)

21.4 Causal Forests

From Chapter 20 to Here

Chapter 20 introduced causal forests for heterogeneous treatment effect estimation. Here we develop the methodology more fully.

The Causal Forest Algorithm

Goal: Estimate $\tau(x) = E[Y(1) - Y(0) | X = x]$ for any $x$ .

Approach: Adapt random forests to target treatment effects rather than outcomes.

Key modifications:

Splitting criterion: Instead of minimizing outcome variance, maximize heterogeneity in treatment effects between child nodes.
Honest estimation: Use separate samples for determining splits and estimating leaf effects.
Local centering: Remove $E[Y|X]$ and $E[D|X]$ before estimation (orthogonalization).

The Generalized Random Forest Framework

Athey, Tibshirani, and Wager (2019) develop Generalized Random Forests (GRF), which nest causal forests as a special case.

General setup: Estimate a parameter $\theta(x)$ defined by a local moment condition: $E[\psi(Y, D, \theta(x)) | X = x] = 0$

For treatment effects: $\psi(Y, D, \theta) = (Y - D\theta - \mu_0)(D - p)$ where $\mu_0 = E[Y(0)|X]$ and $p = P(D=1|X)$ .

Forest weighting: The forest produces weights $\alpha_i(x)$ indicating how much observation $i$ contributes to estimating $\theta(x)$ : $\hat{\theta}(x) = \arg\min_\theta \sum_i \alpha_i(x) \cdot \rho(\psi(Y_i, D_i, \theta))$

Observations in the same leaf as $x$ get high weight; distant observations get low weight.

Inference

Asymptotic normality: Under regularity conditions: $\frac{\hat{\tau}(x) - \tau(x)}{\hat{\sigma}(x)} \xrightarrow{d} N(0, 1)$

Variance estimation: GRF provides variance estimates via infinitesimal jackknife, enabling confidence intervals.

Honest inference: Sample splitting ensures valid inference even with adaptive splitting.

Practical Usage

library(grf)

# Estimate causal forest
cf <- causal_forest(
  X = covariates,
  Y = outcomes,
  W = treatment,
  num.trees = 2000,
  honesty = TRUE
)

# Individual treatment effects
tau_hat <- predict(cf)$$predictions

# Confidence intervals
tau_ci <- predict(cf, estimate.variance = TRUE)

# Variable importance
var_imp <- variable_importance(cf)

# Average treatment effect (doubly robust)
ate <- average_treatment_effect(cf)

21.5 Targeted Learning (TMLE)

Motivation

Double ML and causal forests are regression-based. Targeted Maximum Likelihood Estimation (TMLE), developed by van der Laan and colleagues, takes a different approach: it "targets" the initial ML estimates toward the specific causal parameter of interest.

The Core Intuition: Why TMLE Works
ML algorithms optimize for prediction accuracy—minimizing squared error across all observations. But we care about a specific causal parameter (the ATE). These goals don't align perfectly.
The problem: A model that predicts outcomes well overall might be biased for the specific comparison we need (treated vs. control outcomes at the same covariate values).
TMLE's solution: Start with an ML prediction, then adjust it specifically to reduce bias for the causal estimand. The adjustment uses the propensity score to identify where the initial model is most likely wrong for causal purposes—specifically, at covariate values where treatment is rare.
Analogy: Imagine you're estimating average height in a population, but your sample over-represents tall people. TMLE doesn't just reweight—it adjusts the underlying height predictions themselves, pushing them in the direction that corrects for the sampling imbalance. The "clever covariate" tells you which direction to push and by how much.
Why "targeted"? The adjustment is targeted at reducing bias for your specific estimand (e.g., ATE), not improving general prediction accuracy. Different estimands get different adjustments.

The TMLE Procedure (Simplified)

Step 1: Initial estimates

Estimate outcome model: $\hat{Q}(D, X) = E[Y | D, X]$
Estimate propensity score: $\hat{g}(X) = P(D = 1 | X)$

Use any ML method (ensemble, neural network, etc.).

Step 2: Clever covariate Define the "clever covariate": $H(D, X) = \frac{D}{\hat{g}(X)} - \frac{1-D}{1-\hat{g}(X)}$

Step 3: Targeting step Update the initial estimate by regressing: $Y = \hat{Q}(D, X) + \epsilon \cdot H(D, X) + \text{residual}$

The coefficient $\hat{\epsilon}$ "targets" the estimate toward the causal parameter.

Step 4: Updated predictions $\hat{Q}^*(D, X) = \hat{Q}(D, X) + \hat{\epsilon} \cdot H(D, X)$

Step 5: Parameter estimate $\hat{\tau}^{TMLE} = \frac{1}{n} \sum_i [\hat{Q}^*(1, X_i) - \hat{Q}^*(0, X_i)]$

Properties

Double robustness: Like AIPW, consistent if either outcome model or propensity score is consistent.

Efficiency: Achieves the semiparametric efficiency bound under correct specification.

Bounded estimates: Unlike IPW, TMLE produces bounded estimates even with extreme propensity scores.

Inference: Influence function-based standard errors provide valid confidence intervals.

Super Learner

TMLE is often combined with Super Learner—an ensemble method that combines multiple ML algorithms:

Specify a library of learners (LASSO, random forest, neural net, etc.)
Use cross-validation to find optimal weights for combining them
Final prediction is weighted average of learners

This provides robust nuisance function estimation without choosing a single method.

TMLE vs. DML

Aspect

DML

TMLE

Approach

Orthogonalization

Targeting

Software

DoubleML (R/Python)

tmle, tmle3 (R)

Flexibility

Any ML

Super Learner typical

Inference

Asymptotic

Influence function

Bounded

No (can extrapolate)

Yes

In practice, both work well; choice often depends on software familiarity.

21.6 Prediction Policy Problems

When Prediction Solves the Problem

Not all policy questions require causal inference. Some are prediction policy problems (Kleinberg et al. 2015): the policy action depends on predicting an outcome, not on knowing causal effects.

Example: Bail decisions
A judge must decide whether to release a defendant before trial. The relevant question is: "Will this defendant flee or commit a crime if released?"
This is prediction: $P(\text{flee} | X)$ where $X$ includes defendant characteristics.
The causal question—"would releasing this person cause them to flee?"—is not needed. We want to predict behavior, not change it through the decision itself.

Characteristics of Prediction Policy Problems

Action doesn't affect outcome: The policy acts on a prediction, not by changing the outcome-generating process.
Outcome is observable for some: We observe outcomes for people who were released/hired/approved, enabling supervised learning.
Selection bias is the main concern: We only observe outcomes for selected individuals, creating missing data.

Examples

Hiring: Predict job performance from application materials. Hiring doesn't change a person's inherent ability.

Medical testing: Predict disease presence to decide who to test further. Testing reveals disease; it doesn't cause it.

Fraud detection: Predict which transactions are fraudulent. Flagging doesn't change whether fraud occurred.

Targeting: Predict who will benefit most from an intervention, given we know the intervention works.

When Prediction Isn't Enough

Example: Advertising

Should we show an ad to this user? The prediction question is: "Will they buy if shown the ad?" But we need the causal question: "Would showing the ad cause them to buy (who wouldn't have bought otherwise)?"
Showing ads to people who would buy anyway wastes money. We need the incremental (causal) effect.

Rule of thumb: If the decision itself changes outcomes, you need causation. If the decision merely acts on an existing state, prediction may suffice.

21.7 When ML Helps and When It Doesn't

Where ML Adds Value

1. High-dimensional confounding Many potential confounders; unclear which matter. ML selects and adjusts without pre-specifying.

2. Complex functional forms Nonlinear relationships, interactions. ML learns these from data.

3. Heterogeneity discovery Unknown effect modifiers. Causal forests find them.

4. Propensity score estimation Overlap matters; ML can estimate propensity scores flexibly.

Where ML Doesn't Help

1. Identification ML doesn't solve selection bias. If unconfoundedness fails, ML estimates are still biased—just with better nuisance function estimation.

Key point: ML improves estimation of identified parameters. It does not identify parameters that aren't identified.

2. Small samples ML methods need data. With small samples, simpler models often perform better.

3. Interpretability Understanding why something works may require simpler, interpretable models.

4. Strong prior knowledge If you know the correct functional form, parametric methods are more efficient than ML.

The Credibility Revolution Meets ML

The credibility revolution (Chapter 1) emphasizes research design over statistical methods. ML doesn't change this:

A well-designed RCT with simple analysis beats a poorly designed observational study with fancy ML
ML complements good design; it doesn't substitute for it
Transparency about assumptions matters more than methodological sophistication

21.8 Causal Discovery: Learning Structure from Data

The Promise and the Problem

Most of this chapter assumes the causal structure is known: we know which variables confound, which mediate, which are colliders. Methods like DML and causal forests estimate effects given this structure. But can we learn the causal structure itself from data?

Causal discovery (also called causal structure learning) attempts exactly this. The field emerged from computer science and philosophy, particularly the work of Spirtes, Glymour, Scheines, and Pearl in the 1990s.

The appeal is obvious: instead of assuming a DAG, learn it. But the fundamental challenge is severe:

Observational Equivalence Problem
Multiple DAGs can generate identical observational distributions. These form Markov equivalence classes. Without additional assumptions or interventions, we can at best identify the equivalence class, not the true DAG.

For example, these three DAGs are Markov equivalent (they imply identical conditional independence relationships):

$X \to Y \to Z$ (chain)
$X \leftarrow Y \to Z$ (fork)
$X \leftarrow Y \leftarrow Z$ (reverse chain)

Observational data alone cannot distinguish them.

Constraint-Based Methods

Constraint-based algorithms use conditional independence tests to learn causal structure:

PC Algorithm (named for its creators Peter Spirtes and Clark Glymour):

Start with a complete undirected graph (all variables connected)
Remove edges between conditionally independent variables
Orient edges using v-structure patterns (colliders are identifiable)
Propagate orientation constraints

The PC algorithm outputs a CPDAG (completed partially directed acyclic graph) representing the Markov equivalence class—some edges directed, some undetermined.

FCI (Fast Causal Inference) extends PC to handle latent confounders, outputting a PAG (partial ancestral graph) that represents uncertainty about both edge direction and latent variables.

Strengths:

Principled; exploits conditional independence structure
Handles high-dimensional data with proper sparsity assumptions
Implementations available (pcalg in R, causal-learn in Python)

Limitations:

Requires the faithfulness assumption: all independencies in the data are due to the DAG structure (no "accidental" cancellations)
Sensitive to errors in conditional independence testing
Cannot orient all edges (equivalence class problem)
Assumes causal sufficiency (no latent confounders) for basic PC

Score-Based Methods

Score-based algorithms search over DAG space to optimize a scoring criterion:

Define a score (e.g., BIC, BGe score for Bayesian scoring)
Search over possible DAGs
Return highest-scoring DAG(s)

GES (Greedy Equivalence Search) efficiently searches by adding then removing edges, operating on equivalence classes.

Strengths:

Can incorporate prior knowledge through priors
Naturally handles model comparison
Less sensitive to individual test errors than constraint-based

Limitations:

Computationally intensive (DAG space is super-exponential in variables)
Score equivalence: Markov equivalent DAGs have equal scores
May find local optima

Modern Approaches

Recent work combines ideas and enables scaling:

NOTEARS (Zheng et al. 2018) treats structure learning as continuous optimization:

Parameterize the adjacency matrix as continuous
Add an acyclicity constraint (via trace of matrix exponential)
Optimize with gradient descent

This enables scaling to hundreds of variables and integration with deep learning.

Causal discovery with interventional data: When we observe some experimental interventions, the equivalence class shrinks. Active learning approaches design interventions to maximally resolve structural uncertainty.

Connection to Econometrics

Economists have generally been skeptical of purely algorithmic structure learning, preferring theory-guided identification. But connections exist:

David Hendry's general-to-specific modeling shares the spirit of algorithmic structure learning (see Chapter 16). Start with a general model, test down to a parsimonious specification. The Autometrics algorithm automates this with economic theory providing constraints.

DAG + theory: Some researchers use DAGs to represent theoretical structure, then test implied conditional independencies. Rejection suggests model misspecification. This uses causal discovery tools for model diagnostics rather than discovery.

Semi-parametric bounds: When structure is partially known, causal discovery can narrow the set of consistent structures, tightening partial identification bounds.

When Is Causal Discovery Useful?

Appropriate uses:

Exploratory analysis: Generating hypotheses about causal structure
Model diagnostics: Testing whether assumed structure is consistent with data
Very high-dimensional settings: When theory provides insufficient guidance
Complement to theory: Cross-check theoretical models against data patterns

Inappropriate uses:

Substitute for theory: Algorithms cannot replace substantive knowledge
Definitive causal conclusions: Equivalence classes, faithfulness violations, and finite-sample error limit certainty
Without domain expertise: Results require expert interpretation

Practical Guidance
Treat causal discovery as hypothesis-generating, not hypothesis-confirming. Use it to suggest structures for further investigation, not to establish causation. Always assess whether algorithmic output makes substantive sense.

For Further Study
Spirtes, Glymour & Scheines (2000), Causation, Prediction, and Search, 2nd ed. The foundational text.
Peters, Janzing & Schölkopf (2017), Elements of Causal Inference. Modern treatment from ML perspective.
Heinze-Deml, Maathuis & Meinshausen (2018), "Causal Structure Learning," Annual Review of Statistics. Accessible survey.

Practical Guidance

Choosing a Method

Situation

Recommended Approach

ATE with high-dimensional $X$

DML or TMLE

Heterogeneous effects

Causal forests (GRF)

Best prediction of outcomes

Super Learner

Few covariates, clear model

Standard regression

Pure prediction policy problem

Standard ML

Identification concerns

Fix design first, then consider ML

Structure unknown, exploratory

Causal discovery (hypothesis-generating)

Common Pitfalls

Pitfall 1: Using ML to "control for" unobserved confounders ML learns from observed data. If confounders are unobserved, ML cannot adjust for them.
How to avoid: Be clear about what ML does and doesn't do. It estimates functions of observed variables; it doesn't observe the unobserved.

Pitfall 2: Treating ML output as causal without design "The random forest says $D$ is important for predicting $Y$ " is not a causal statement.
How to avoid: Use causal ML methods (DML, causal forests) with appropriate identification assumptions, not off-the-shelf predictive ML.

Pitfall 3: Ignoring sample splitting Using the same data to estimate nuisance functions and treatment effects biases inference.
How to avoid: Always use cross-fitting or sample splitting in DML/TMLE.

Pitfall 4: Black-box heterogeneity Causal forest finds heterogeneity, but results may be hard to interpret or validate.
How to avoid: Validate with held-out data; test calibration; examine which variables drive heterogeneity.

Implementation Checklist

Verify identification (unconfoundedness, etc.) holds before applying ML
Use cross-fitting/sample splitting for nuisance estimation
For DML: estimate both propensity score and outcome model
For causal forests: use honest estimation
Validate on held-out data when possible
Report standard errors/confidence intervals using appropriate methods
Check covariate balance after propensity score estimation
Consider simpler methods as benchmarks

Running Example: Returns to Education with ML

The Setting

Estimating returns to education with high-dimensional controls:

Many potential confounders: family background, ability proxies, location, cohort
Possible nonlinear and interaction effects
Unknown functional form for propensity score and outcome model

Traditional Approach

OLS with researcher-chosen controls. Risk: omit important confounders, misspecify functional form.

DML Approach

Use LASSO/random forest to estimate $E[Y|X]$ (earnings given all covariates except education)
Use LASSO/random forest to estimate $E[D|X]$ (education given covariates)
Orthogonalize and estimate treatment effect on residuals

Advantages: Flexibly controls for many covariates; doesn't require choosing specification.

Finding: Belloni et al. (2014) apply DML-type methods to returns to education, finding estimates similar to careful OLS but with more robust standard errors.

Causal Forest for Heterogeneity

Estimate causal forest with education as treatment, earnings as outcome
Examine CATE across covariate space
Identify who benefits most from additional education

Finding: Returns may be higher for disadvantaged students (who are marginal), suggesting education has redistributive potential.

Caveats

ML doesn't solve the fundamental identification problem:

If ability is unobserved and affects both education and earnings, ML estimates are biased
IV or other designs still needed for credible identification
ML complements but doesn't replace research design

Integration Note

Connections to Other Methods

Method

Relationship

See Chapter

Selection on Observables

DML/TMLE extend these methods with ML

Ch. 11

Heterogeneity

Causal forests for CATE

Ch. 20

Can combine DML with IV for LATE

Ch. 12

DiD

ML for covariate adjustment in DiD

Ch. 13

Triangulation Strategies

ML-based estimates gain credibility when:

Comparison with traditional methods: DML and OLS give similar estimates
Robustness to ML method: Results stable across LASSO, random forest, boosting
Validation on held-out data: Predictions generalize
Calibration tests: CATE estimates actually predict effect variation
Design-based evidence: ML estimates align with RCT or natural experiment findings

Summary

Key takeaways:

Prediction ≠ Causation: Standard ML learns associations, not causal effects. Using ML output as causal requires additional structure.
ML for nuisance functions: ML excels at estimating propensity scores and outcome models—the "nuisance" functions in causal inference.
Double ML uses orthogonalization and sample splitting to get valid causal estimates despite using ML for nuisance functions.
Causal forests extend random forests to estimate heterogeneous treatment effects, with honest estimation enabling valid inference.
TMLE targets initial ML estimates toward the causal parameter, providing doubly robust and efficient estimation.
Prediction policy problems sometimes substitute for causal questions—but only when the decision doesn't change the outcome-generating process.
ML complements, doesn't replace, research design: Good identification is still essential. ML improves estimation of identified parameters; it doesn't identify the unidentified.

Returning to the opening question: Machine learning can help answer causal questions, but not by treating causation as prediction. The contribution of ML is in flexibly estimating the nuisance functions (propensity scores, outcome models) that feed into causal estimators, and in discovering heterogeneous effects. The hard work of identification—ensuring we compare like with like—still requires research design.

Exercises

Conceptual

Explain why regularization (LASSO, Ridge) applied to a regression of $Y$ on $(X, D)$ does not give a valid estimate of the causal effect of $D$ . What goes wrong?
What is the role of sample splitting in Double ML? What would happen without it?
Distinguish between a prediction policy problem and a causal inference problem. Give an example of each in the context of healthcare.

Applied

Using a dataset with treatment, outcome, and many covariates:
- Estimate the ATE using OLS with researcher-selected controls
- Estimate using DML with LASSO for nuisance functions
- Estimate using DML with random forest for nuisance functions
- Compare estimates and standard errors across methods
Implement a causal forest on a dataset with known heterogeneity:
- Estimate CATEs
- Identify the most important variables for heterogeneity
- Test calibration: do high- $\hat{\tau}(x)$ individuals actually have larger effects?

Discussion

A researcher argues: "ML is a black box. I'd rather use simple methods I understand than complex methods that might be doing something I don't understand." Another responds: "Simple methods impose functional form assumptions that are surely wrong. ML is more honest about our ignorance." Who is right?

Appendix 21A: The Neyman Orthogonality Condition

Why Orthogonality Matters

The key to DML is the Neyman orthogonality condition: the influence function for the treatment effect should be orthogonal to the nuisance function estimation error.

Formally: Let $\theta$ be the treatment effect and $\eta$ be nuisance parameters (propensity score, outcome model). The estimating equation $\psi(W; \theta, \eta)$ satisfies Neyman orthogonality if:

$\frac{\partial}{\partial \eta} E[\psi(W; \theta_0, \eta)] \bigg|_{\eta = \eta_0} = 0$

Intuition: Small errors in estimating $\eta$ don't bias estimation of $\theta$ .

The AIPW Moment

The AIPW moment function: $\psi(W; \theta, g, \mu_0, \mu_1) = \mu_1(X) - \mu_0(X) + \frac{D(Y - \mu_1(X))}{g(X)} - \frac{(1-D)(Y - \mu_0(X))}{1-g(X)} - \theta$

satisfies Neyman orthogonality with respect to $(g, \mu_0, \mu_1)$ .

This is why DML based on AIPW achieves $\sqrt{n}$ -consistency even with slower-converging ML estimates of nuisance functions.

PreviousChapter 20: Heterogeneity and Generalization NextChapter 22: Programming Companion—Beyond Averages

Last updated 4 days ago

hashtagOpening Question

hashtagChapter Overview

hashtag21.1 Prediction vs. Causation

hashtagThe Fundamental Difference

hashtagWhy Standard ML Fails for Causation

hashtagWhere ML Can Help

hashtag21.2 ML Essentials for Causal Inference

hashtagRegularization

hashtagCross-Validation

hashtagTree-Based Methods

hashtagWhy These Methods Matter for Causal Inference

hashtag21.3 Double/Debiased Machine Learning

hashtagThe Problem with Naive ML

hashtagThe Orthogonalization Insight

hashtagSample Splitting

hashtagThe DML Estimator

hashtagProperties

hashtagWhen to Use DML

hashtag21.4 Causal Forests

hashtagFrom Chapter 20 to Here

hashtagThe Causal Forest Algorithm

hashtagThe Generalized Random Forest Framework

hashtagInference

hashtagPractical Usage

hashtag21.5 Targeted Learning (TMLE)

hashtagMotivation

hashtagThe TMLE Procedure (Simplified)

hashtagProperties

hashtagSuper Learner

hashtagTMLE vs. DML

hashtag21.6 Prediction Policy Problems

hashtagWhen Prediction Solves the Problem

hashtagCharacteristics of Prediction Policy Problems

hashtagExamples

hashtagWhen Prediction Isn't Enough

hashtag21.7 When ML Helps and When It Doesn't

hashtagWhere ML Adds Value

hashtagWhere ML Doesn't Help

hashtagThe Credibility Revolution Meets ML

hashtag21.8 Causal Discovery: Learning Structure from Data

hashtagThe Promise and the Problem

hashtagConstraint-Based Methods

hashtagScore-Based Methods

hashtagModern Approaches

hashtagConnection to Econometrics

hashtagWhen Is Causal Discovery Useful?

hashtagPractical Guidance

hashtagChoosing a Method

hashtagCommon Pitfalls

hashtagImplementation Checklist

hashtagRunning Example: Returns to Education with ML

hashtagThe Setting

hashtagTraditional Approach

hashtagDML Approach

hashtagCausal Forest for Heterogeneity

hashtagCaveats

hashtagIntegration Note

hashtagConnections to Other Methods

hashtagTriangulation Strategies

hashtagSummary

hashtagFurther Reading

hashtagEssential

hashtagFor Deeper Understanding

hashtagAdvanced/Specialized

hashtagApplications

hashtagExercises

hashtagConceptual

hashtagApplied

hashtagDiscussion

hashtagAppendix 21A: The Neyman Orthogonality Condition

hashtagWhy Orthogonality Matters

hashtagThe AIPW Moment

Opening Question

Chapter Overview

21.1 Prediction vs. Causation

The Fundamental Difference

Why Standard ML Fails for Causation

Where ML Can Help

21.2 ML Essentials for Causal Inference

Regularization

Cross-Validation

Tree-Based Methods

Why These Methods Matter for Causal Inference

21.3 Double/Debiased Machine Learning

The Problem with Naive ML

The Orthogonalization Insight

Sample Splitting

The DML Estimator

Properties

When to Use DML

21.4 Causal Forests

From Chapter 20 to Here

The Causal Forest Algorithm

The Generalized Random Forest Framework

Inference

Practical Usage

21.5 Targeted Learning (TMLE)

Motivation

The TMLE Procedure (Simplified)

Properties

Super Learner

TMLE vs. DML

21.6 Prediction Policy Problems

When Prediction Solves the Problem

Characteristics of Prediction Policy Problems

Examples

When Prediction Isn't Enough

21.7 When ML Helps and When It Doesn't

Where ML Adds Value

Where ML Doesn't Help

The Credibility Revolution Meets ML

21.8 Causal Discovery: Learning Structure from Data

The Promise and the Problem

Constraint-Based Methods

Score-Based Methods

Modern Approaches

Connection to Econometrics

When Is Causal Discovery Useful?

Practical Guidance

Choosing a Method

Common Pitfalls

Implementation Checklist

Running Example: Returns to Education with ML

The Setting

Traditional Approach

DML Approach

Causal Forest for Heterogeneity

Caveats

Integration Note

Connections to Other Methods

Triangulation Strategies

Summary

Further Reading

Essential

For Deeper Understanding

Advanced/Specialized

Applications

Exercises

Conceptual

Applied

Discussion

Appendix 21A: The Neyman Orthogonality Condition

Why Orthogonality Matters

The AIPW Moment