Chapter 11: Selection on Observables

Opening Question

When can we credibly estimate causal effects from observational data by controlling for the right variables—and how do we know if we have?

Chapter Overview

Randomized experiments are not always feasible. When they are not, researchers often turn to observational data—data where treatment was not randomly assigned but chosen by individuals, institutions, or circumstance. Can we still learn about causal effects?

The answer depends on a critical assumption: that we observe all the variables that jointly affect treatment and outcomes. If we do—if "selection is on observables"—then adjusting for these confounders recovers causal effects. This chapter develops methods for such adjustment: regression, matching, propensity score weighting, and doubly robust estimation. We also confront the uncomfortable truth that the key assumption is untestable, and develop sensitivity analyses to assess how much unobserved confounding would be needed to overturn our conclusions.

Selection on observables methods are ubiquitous in applied research. They are also frequently misused. The goal of this chapter is not just to teach the methods, but to develop judgment about when they are credible and how to defend—or critique—their application.

What you will learn:

When regression identifies causal effects (and when it does not)
How matching and propensity score methods work
The logic of doubly robust estimation
How to conduct and interpret sensitivity analyses
When selection on observables is a credible assumption

Prerequisites: Chapter 9 (The Causal Framework), Chapter 3 (Statistical Foundations)

11.1 Regression for Causal Inference

When Does OLS Identify Causal Effects?

Regression is the workhorse of empirical research. But when does a regression coefficient have a causal interpretation?

Consider the regression: $Y_i = \alpha + \tau D_i + X_i'\beta + \varepsilon_i$

The coefficient $\tau$ estimates the causal effect of $D$ on $Y$ if:

Assumption 11.1: Conditional Unconfoundedness
$(Y(0), Y(1)) \perp D | X$
Conditional on observed covariates $X$ , treatment assignment is independent of potential outcomes.

This is the same as unconfoundedness from Chapter 9. In regression terms, it means that after controlling for $X$ , the remaining variation in $D$ is as good as random.

When might this hold?

When $X$ includes all variables that affect both treatment selection and outcomes
When institutional knowledge suggests that selection depends only on observables
When rich administrative data captures the selection process

When does it fail?

When unobserved factors (ability, motivation, preferences) affect both $D$ and $Y$
When selection depends on private information not in the data
Almost always, to some degree

The Bad Controls Problem

A common mistake is to "control for everything available." This can introduce bias rather than remove it.

Bad controls are variables that are affected by treatment or that open backdoor paths:

Post-treatment variables: Controlling for a consequence of treatment blocks part of the causal effect.
Mediators: If $D \to M \to Y$ , controlling for $M$ removes the indirect effect.
Colliders: Controlling for a common effect of two variables creates spurious association between them.

Example: Bad Control
Estimating the effect of education on wages, you control for occupation. But occupation is partly caused by education. Controlling for it asks: "What is the effect of education, holding occupation fixed?" This removes the channel through which education raises wages (better jobs) and may reverse the sign of the effect.

The DAG test: Draw the causal graph. A variable is a bad control if:

It is a descendant of treatment (post-treatment)
It is a collider on a path between treatment and outcome
Conditioning on it opens a previously blocked path

Functional Form Sensitivity

Regression imposes functional form assumptions. With continuous covariates or treatment:

$E[Y | D, X] = \alpha + \tau D + X'\beta$

assumes linearity and additivity. If the true relationship is nonlinear or includes interactions, the estimate of $\tau$ depends on the distribution of $X$ and may not equal ATE.

Solutions:

Include flexible functions of $X$ (polynomials, splines)
Interact treatment with covariates
Use nonparametric methods (matching, weighting)

Practical Guidance: Regression for Causal Inference
Regression is appropriate when:
Unconfoundedness is plausible given $X$
The functional form is approximately correct
You avoid bad controls
Be cautious when:
Key confounders are likely unobserved
The treatment effect may vary with $X$
Covariate distributions differ substantially between treated and control

Box: G-Computation—The Epidemiological Perspective
Epidemiologists call the outcome-modeling approach g-computation or standardization—part of the broader "g-methods" tradition developed by James Robins and colleagues (see Section 9.7 for context on the epidemiological contribution to causal inference). The "g" stands for "generalized," referring to Robins' (1986) generalization to time-varying treatments.
The g-computation algorithm:
Fit an outcome model: $\hat{E}[Y | D, X]$
For each unit, predict outcomes under treatment: $\hat{Y}_i(1) = \hat{E}[Y | D=1, X_i]$
For each unit, predict outcomes under control: $\hat{Y}_i(0) = \hat{E}[Y | D=0, X_i]$
Average: $\widehat{ATE} = \frac{1}{n}\sum_i [\hat{Y}_i(1) - \hat{Y}_i(0)]$
This is equivalent to the regression approach when the outcome model is correctly specified and treatment effects are homogeneous. But g-computation naturally accommodates:
Nonlinear outcome models (logistic, Poisson)
Treatment effect heterogeneity (via interactions)
Marginal effects that differ from conditional effects
The key insight: instead of interpreting a regression coefficient, compute predicted outcomes under each treatment scenario and compare them. This "plug-in" approach is conceptually cleaner when effects are heterogeneous or outcomes are nonlinear.
See Hernán & Robins (2020), Chapter 13, for full treatment including time-varying extensions.

Box: Marginal Effects—What Are We Actually Estimating?
The term "marginal effect" means different things to different disciplines, creating confusion that has persisted for decades. When a researcher reports a "marginal effect," they might mean any of the following:

Term

Definition

When they differ

Conditional effect

$$\partial E[Y

X]/\partial X $at specific$ X$$ values

Marginal effect at the mean (MEM)

Effect evaluated at $X = \bar{X}$

Differs from AME with nonlinearity

Average marginal effect (AME)

$$\frac{1}{n}\sum_i \partial E[Y

X_i]/\partial X$$

In a linear model ( $Y = \alpha + \beta X + \varepsilon$ ), these all equal $\beta$ . But with nonlinear models—logit, probit, Poisson, or any model with interactions—they diverge, sometimes substantially.
Why this matters for causal inference: The causal estimands we care about—ATE, ATT—are defined as averages over populations. The ATE is $E[Y(1) - Y(0)]$ , which corresponds to the AME, not the MEM. When you report a logit coefficient or an effect "at the mean," you're not reporting the ATE.
Consider a logistic regression for employment ( $Y$ ) on a training program ( $D$ ):
The coefficient $\hat{\beta}$ is a log odds ratio—not directly interpretable as a probability change
The MEM evaluates $\partial P(Y=1)/\partial D$ at mean covariate values—but no one has exactly average characteristics
The AME averages the marginal effect across all individuals—this estimates the ATE under unconfoundedness
The g-computation algorithm in the previous box computes the AME: predict outcomes for everyone under treatment, predict under control, and average the difference. This is what we want for causal inference.
Practical implications:
For linear models with no interactions: report regression coefficients
For nonlinear models: report AMEs, not coefficients or MEMs
For models with interactions: compute effects at substantively meaningful covariate values, or report AMEs
Always clarify which quantity you're reporting
Terminology warning: The word "marginal" is overloaded in statistics. In the AME/MEM context above, "marginal" means a derivative ( $\partial y / \partial x$ ). But in multilevel/mixed models, "marginal" means something entirely different: population-average effects (integrating over the distribution of random effects) versus conditional effects (for a typical cluster with random effects set to zero). These are distinct concepts that happen to share a name. When reading papers using mixed models, check which meaning applies—the marginal (population-average) effect in a GLMM is not the same as the AME from this box, though both involve averaging.
The marginaleffects package (Arel-Bundock, Greifer & Heiss 2024) provides a unified framework for computing predictions, comparisons, and slopes across 100+ model types in R and Python---see Chapter 18 for implementation. For a thorough treatment of the terminology and the pitfalls of each approach, see Arel-Bundock, Greifer & Heiss (2024).

11.2 Propensity Score Methods

The Propensity Score

The propensity score is the probability of treatment given covariates:

$e(X) = P(D = 1 | X)$

Rosenbaum and Rubin (1983) showed a remarkable result, building on ideas that developed simultaneously in statistics and epidemiology:

Theorem 11.1: Propensity Score Theorem
If $(Y(0), Y(1)) \perp D | X$ , then $(Y(0), Y(1)) \perp D | e(X)$
Conditioning on the propensity score is sufficient for unconfoundedness.

Intuition: The propensity score summarizes all the information in $X$ relevant for treatment assignment. Units with the same propensity score are equally likely to be treated, regardless of their specific covariate values. Comparing treated and control units with the same propensity score is like comparing within a mini-experiment.

Dimension reduction: Instead of matching on many covariates $X$ , we can match on a single scalar $e(X)$ .

Estimating the Propensity Score

The propensity score is typically estimated using logistic regression:

$\log\left(\frac{e(X)}{1-e(X)}\right) = X'\gamma$

Then $\hat{e}(X_i) = \text{logit}^{-1}(X_i'\hat{\gamma})$ .

Alternatives:

Probit regression
Machine learning methods (random forests, boosting, LASSO)
Covariate balancing propensity scores (CBPS)

The goal is not to predict treatment well in a forecasting sense, but to achieve covariate balance between treated and control groups after adjustment.

Matching

Matching pairs treated units with similar control units:

For each treated unit, find control units with similar $X$ or $e(X)$
Estimate the treatment effect by comparing matched pairs

Types of matching:

Method

Description

Pros

Cons

Exact

Match on identical $X$

No approximation

Fails with many covariates

Nearest neighbor

Match to closest unit(s)

Simple

May use poor matches

Caliper

Match only within distance $c$

Avoids bad matches

May discard units

Kernel

Weight all controls by distance

Uses all data

Requires bandwidth choice

Coarsened exact

Exact match on coarsened $X$

Transparent

Sensitive to coarsening

With or without replacement:

With replacement: Each control can match multiple treated units. More flexible but inference is more complex.
Without replacement: Each control matches at most one treated unit. Order of matching matters.

ATT Matching Estimator
$\hat{\tau}_{ATT} = \frac{1}{n_1} \sum_{D_i=1} \left(Y_i - \hat{Y}_i^{(0)}\right)$
where $\hat{Y}_i^{(0)}$ is the (average) outcome of matched control units.

Box: The Evolution from Matching to Doubly Robust Methods
Matching was the dominant approach in the 2000s, following influential papers by Rosenbaum, Rubin, and Dehejia and Wahba. However, modern best practice has shifted toward doubly robust (AIPW) and weighting-based methods for several reasons:
Limitations of matching:
Matching discards data: Unmatched controls are thrown away, reducing precision
Arbitrary matching choices: Caliper width, with/without replacement, matching order—all affect results
Variance estimation is complex: Standard errors for matched estimators require special adjustments (Abadie & Imbens, 2006)
No double robustness: Misspecify the matching metric and estimates are biased
Advantages of AIPW/weighting:
Uses all data: Every observation contributes (with appropriate weights)
Double robustness: Consistent if either the propensity or outcome model is correct
Clean inference: Standard variance estimation works; easy to bootstrap
Natural integration with ML: Cross-fitting + AIPW enables flexible estimation (DML)
Current recommendation:

Method

When to use

AIPW/TMLE

Default choice for most applications

IPW

When only propensity model is credible

Regression

When only outcome model is credible

Matching

For transparency, sample construction, or when stakeholders expect it

Matching remains valuable for sample construction (creating a matched sample for subsequent analysis) and transparency (showing exactly which units are compared). But for primary causal estimation, doubly robust methods are now preferred.

Inverse Probability Weighting (IPW)

IPW reweights observations to create a pseudo-population where treatment is independent of covariates:

IPW Estimator for ATE
$\hat{\tau}_{IPW} = \frac{1}{n}\sum_{i=1}^n \left(\frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i) Y_i}{1-\hat{e}(X_i)}\right)$

Intuition: Treated units with low propensity scores are "surprising"—they were unlikely to be treated but were. They carry more information about what control units would have experienced under treatment, so they receive higher weight.

Weights:

For ATE: $w_i = \frac{D_i}{\hat{e}(X_i)} + \frac{1-D_i}{1-\hat{e}(X_i)}$
For ATT: $w_i = D_i + (1-D_i)\frac{\hat{e}(X_i)}{1-\hat{e}(X_i)}$

Problems with IPW:

Extreme propensity scores create extreme weights
Estimates can be highly variable
Sensitive to propensity score misspecification

Solutions:

Trim observations with propensity scores near 0 or 1
Truncate weights at some maximum
Use stabilized weights: $\frac{P(D)}{\hat{e}(X)}$ instead of $\frac{1}{\hat{e}(X)}$
Use doubly robust methods (Section 11.3)

Subclassification (Stratification)

Subclassification groups units into strata by propensity score, then estimates effects within strata:

Estimate propensity scores
Divide into $K$ strata (often 5-10 quantiles)
Estimate treatment effect within each stratum
Average across strata, weighting by stratum size

This is less sensitive to extreme weights than IPW but imposes discretization.

Covariate Balance Diagnostics

The key diagnostic: Check whether adjustment achieves covariate balance between treated and control.

Before adjustment, treated and control groups typically differ on $X$ . After matching or weighting, they should not.

Standardized differences: $d_j = \frac{\bar{X}_{j1} - \bar{X}_{j0}}{\sqrt{(s_{j1}^2 + s_{j0}^2)/2}}$

Rules of thumb:

$|d| < 0.1$ : Good balance
$|d| < 0.25$ : Acceptable
$|d| > 0.25$ : Poor balance; investigate

Variance ratios: Check that variances are similar after matching.

Visual inspection: Plot covariate distributions before and after matching.

Practical Box: Balance Assessment Checklist
Report standardized differences for all covariates
Show balance table or Love plot
Assess balance on higher-order terms (squares, interactions) if used
If balance is poor, try different specifications or matching methods
Consider whether remaining imbalance is substantively important

Box: Assessing Overlap with Propensity Score Distributions
Balance diagnostics check whether covariate means are similar after adjustment. But a more fundamental question is whether treated and control groups share common support—whether there are comparable units in both groups across the range of propensity scores.
The most informative diagnostic is a histogram (or density plot) of propensity scores by treatment status:
What to look for:
Good overlap: Distributions largely coincide; most treated units have propensity scores where control units also exist (and vice versa)
Poor overlap: Distributions are separated; treated units cluster at high propensity scores with few controls nearby, or vice versa
Log-odds scale: For clearer visualization, especially when propensity scores cluster near 0 or 1, plot the log-odds: $\log(\hat{e}/(1-\hat{e}))$ . A log-odds of −3 corresponds to roughly $p = 0.05$ ; a log-odds of 3 corresponds to roughly $p = 0.95$ .
When overlap is poor:
Estimates rely on extrapolation, not comparable units
Different estimators can yield wildly different results
Trimming (removing units with extreme propensity scores) often improves robustness more than choosing a fancier estimator
Imbens & Xu (2025) demonstrate this vividly: with the LaLonde data, severe overlap problems cause estimates to vary from −$8,000 to +$2,000 depending on method. After trimming to ensure overlap, all modern estimators converge to similar values. The lesson: ensuring overlap is often more important than estimator choice.
Trimming strategies:
Drop treated units with propensity scores below the minimum in the control group (Dehejia & Wahba 1999)
Drop units with propensity scores outside [0.1, 0.9] (Crump et al. 2009)
Use matching to construct a sample with good overlap by design
The loss of precision from trimming is typically modest; the gain in robustness can be substantial.

11.3 Doubly Robust Estimation

The Problem with Single Methods

Both outcome regression and propensity score methods have limitations:

Outcome regression: Sensitive to misspecification of $E[Y | D, X]$
IPW: Sensitive to misspecification of $e(X)$ , extreme weights

What if we could combine them to be robust to misspecification of either (but not both)? This insight, developed by James Robins, Andrea Rotnitzky, and colleagues in biostatistics during the 1990s, led to doubly robust methods—now standard in both epidemiology and economics.

Augmented Inverse Probability Weighting (AIPW)

The doubly robust or AIPW estimator combines outcome modeling and propensity scores:

AIPW Estimator
$\hat{\tau}_{AIPW} = \frac{1}{n}\sum_{i=1}^n \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1-\hat{e}(X_i)}\right]$
where $\hat{\mu}_d(X) = E[Y | D = d, X]$ are outcome models.

Intuition: Start with the outcome model prediction. Then correct for any remaining imbalance using IPW applied to the residuals.

Double Robustness Property

Theorem 11.2: Double Robustness
The AIPW estimator is consistent if either:
The outcome model $\hat{\mu}_d(X)$ is correctly specified, OR
The propensity score $\hat{e}(X)$ is correctly specified
Only one needs to be correct, not both.

This provides insurance against model misspecification. If you're unsure about functional form, doubly robust estimators hedge your bets.

Implementation

In R:

library(AIPW)
aipw_obj <- AIPW$new(Y = Y, A = D, W = X,
                      Q.SL.library = c("SL.glm", "SL.ranger"),
                      g.SL.library = c("SL.glm", "SL.ranger"))
aipw_obj$fit()
aipw_obj$summary()

In Stata:

teffects aipw (outcome covariates) (treatment covariates)

Targeted Learning and TMLE

Targeted Maximum Likelihood Estimation (TMLE) is a sophisticated doubly robust approach that:

Fits an initial outcome model
Updates it using propensity score information
Targets the specific estimand of interest

TMLE has better finite-sample properties than basic AIPW and integrates well with machine learning (Super Learner).

Cross-Fitting: Essential for Machine Learning

Box: The Cross-Fitting Requirement (Double/Debiased Machine Learning)
When using machine learning methods (random forests, LASSO, boosting) to estimate propensity scores or outcome models, a critical issue arises: overfitting bias.
The problem: If you estimate $\hat{e}(X)$ or $\hat{\mu}(X)$ and compute treatment effects on the same data, ML's flexibility leads to overfitting. The nuisance function estimates are too tailored to the specific sample, and the resulting treatment effect estimate is biased—even with doubly robust methods.
The solution: Cross-fitting (Chernozhukov et al. 2018):
Split the sample into $K$ folds (typically 5-10)
For each fold $k$ :
Estimate nuisance functions ( $\hat{e}$ , $\hat{\mu}$ ) on all data except fold $k$
Predict nuisance values for fold $k$ using these out-of-fold estimates
Compute treatment effects using the cross-fitted predictions
Average across folds
This is the core of Double/Debiased Machine Learning (DML).
Why it works: By estimating nuisance functions on different data than where they're applied, we avoid overfitting. Small errors in nuisance estimation don't create first-order bias in the treatment effect (this is called "Neyman orthogonality").
Practical implication: Standard implementations like Stata's teffects use parametric models and don't require cross-fitting. But if you use flexible ML methods:
In R: Use DoubleML, grf, or AIPW with Super Learner (handles cross-fitting automatically)
In Python: Use econml (EconML) or DoubleML
Never use sklearn to fit a random forest propensity score and then apply it to the same data for weighting
See Chapter 21 for detailed coverage of DML and its theoretical foundations.

11.4 Continuous and Multivalued Treatments

Beyond Binary Treatment

Many treatments are not binary:

Years of education (continuous)
Drug dosage (continuous)
Type of program (multivalued)

The propensity score framework extends to these cases.

Generalized Propensity Score

For continuous treatment $D$ , the generalized propensity score is:

$r(d, X) = f_{D|X}(d | X)$

The conditional density of treatment given covariates.

Under weak unconfoundedness: $Y(d) \perp D | r(d, X)$

Dose-Response Estimation

The dose-response function maps treatment intensity to expected outcome:

$\mu(d) = E[Y(d)]$

Estimation approaches:

Stratify on GPS: Group observations by GPS value, estimate $E[Y | D = d]$ within groups
Inverse probability weighting: Weight by inverse of GPS (with stabilization)
Outcome modeling: Specify $E[Y | D, X]$ and average over covariate distribution

Implementation Note: IPW for Continuous Treatments
IPW for continuous treatments differs fundamentally from the binary case. With binary treatment, weights are based on propensity scores (probabilities). With continuous treatment, weights are based on probability density functions.
The stabilized weight for continuous treatment is: $w_i = \frac{f_D(D_i)}{f_{D|X}(D_i | X_i)}$
where the numerator is the marginal density of treatment and the denominator is the conditional density given confounders. In practice, both are often assumed normal:
Numerator: $D_i \sim N(\bar{D}, \sigma_D^2)$
Denominator: $D_i | X_i \sim N(\hat{D}_i, \sigma_{\varepsilon}^2)$ from a regression of $D$ on $X$
Unlike binary IPW where extreme propensity scores (near 0 or 1) create problems, continuous IPW can produce extreme weights when an observation's treatment value is unlikely given their covariates. Stabilization and trimming remain important.
In R, the WeightIt package handles continuous treatments with method = "ps" and appropriate family specification. See Chapter 18 for implementation and Hirano & Imbens (2004) for the theoretical foundations.

Multivalued Treatments

With $J$ treatment levels, estimate separate propensity scores:

$e_j(X) = P(D = j | X)$

Then use multinomial IPW or matching within propensity strata.

11.5 Sensitivity Analysis

The Core Problem

Unconfoundedness cannot be tested. No matter how many covariates we include, there may be unobserved factors that bias our estimates.

Sensitivity analysis asks: How much unobserved confounding would be needed to change our conclusions?

Rosenbaum Bounds

Rosenbaum's approach asks: If unobserved confounding exists, how large would the bias need to be to explain away the treatment effect?

Define $\Gamma$ as the maximum ratio of treatment odds for two units with the same observed covariates:

$\frac{1}{\Gamma} \leq \frac{P(D=1|X, U)/P(D=0|X, U)}{P(D=1|X, U')/P(D=0|X, U')} \leq \Gamma$

Under no unmeasured confounding, $\Gamma = 1$ . Larger $\Gamma$ represents more potential confounding.

For each value of $\Gamma$ , compute bounds on the p-value. Find the minimum $\Gamma$ at which significance is lost. This is the study's sensitivity.

Example: Interpreting Rosenbaum Bounds
Suppose an effect is significant at $\Gamma = 1$ (no confounding) and remains significant up to $\Gamma = 2$ . This means an unobserved confounder would need to double the odds of treatment to explain away the finding—a substantial amount of confounding.

Oster's Method: Coefficient Stability

Oster (2019) extends Altonji, Elder & Taber (2005) to assess how coefficient estimates change as controls are added:

Estimate treatment effect without controls: $\tilde{\tau}$
Estimate with observed controls: $\hat{\tau}$
Calculate how much $\tau$ would change if unobservables were as important as observables

The relative degree of selection $\delta$ measures how much selection on unobservables would need to exceed selection on observables to produce a zero effect:

$\delta = \frac{(\hat{\tau} - 0)}{(\tilde{\tau} - \hat{\tau})} \times \frac{R_{max} - \tilde{R}^2}{R^2 - \tilde{R}^2}$

where $R_{max}$ is a hypothesized maximum $R^2$ and $R^2$ is the $R^2$ from the controlled regression.

If $|\delta| > 1$ , unobservables would need to be more important than observables to eliminate the effect—providing some reassurance.

E-Values: The Intuitive Sensitivity Measure

The E-value (VanderWeele & Ding 2017) has become the preferred sensitivity measure because of its intuitive interpretation. It asks: What is the minimum strength of association between an unmeasured confounder and both treatment and outcome needed to explain away the observed effect?

$E\text{-value} = RR + \sqrt{RR \times (RR - 1)}$

where $RR$ is the observed risk ratio (or an approximation for other effect measures).

How to Interpret E-Values
An E-value of 3 means: An unmeasured confounder would need to be associated with both treatment and outcome by a risk ratio of at least 3 to fully explain away the observed effect.

E-value

Interpretation

1.0

No confounding needed (null effect)

1.5

Modest confounding could explain result

2.0

Moderate confounding required

3.0+

Strong confounding required

5.0+

Very robust to confounding

Why E-values are intuitive: Unlike Rosenbaum's $\Gamma$ (which measures odds ratios for treatment assignment), the E-value is directly comparable to known risk ratios. If your strongest observed confounder has a risk ratio of 2 with both treatment and outcome, and your E-value is 4, an unobserved confounder would need to be twice as strong to explain away your result.

Computing E-values in practice:

# R: EValue package
library(EValue)
evalues.RR(est = 2.5, lo = 1.8, hi = 3.5)  # For risk ratio with CI
evalues.OR(est = 2.0, rare = TRUE)          # For odds ratio (rare outcome)

# Python approximation
import numpy as np
def e_value(rr):
    return rr + np.sqrt(rr * (rr - 1))

E-value for the confidence interval: Report E-values for both the point estimate and the confidence interval bound closest to the null. If even the CI bound has a high E-value, the finding is robust.

Benchmarking

Compare sensitivity parameters to observed confounders:

How strongly are observed confounders related to treatment and outcome?
Would an unobserved confounder need to be stronger than any observed confounder?
Are there plausible candidates for such strong confounders?

Practical Box: Sensitivity Analysis Reporting
Report:
Point estimate and confidence interval under unconfoundedness
Rosenbaum $\Gamma$ at which significance is lost
Oster's $\delta$ (selection ratio)
E-value for the point estimate and confidence interval
Comparison to strength of observed confounders
Discussion of plausible unobserved confounders

11.6 When Is Conditional Ignorability Credible?

The Fundamental Question

No statistical test can verify unconfoundedness. Its credibility rests on substantive arguments about the selection process.

Key question: Given what we know about how treatment is assigned, is it plausible that all relevant factors are observed?

Institutional Knowledge

The strongest case for unconfoundedness comes from understanding the selection mechanism:

How are treatments assigned?
What information is available to decision-makers?
What factors plausibly influence selection?

If the selection process is well-understood and observed in the data, unconfoundedness is more credible.

Box: Target Trial Emulation
Epidemiologists have developed a useful framework for designing observational studies: target trial emulation (Hernán & Robins 2016).
The idea: Before analyzing observational data, specify the randomized trial you would ideally run. What would be the eligibility criteria? Treatment assignment? Follow-up period? Primary outcome? Then ask: How well can your observational study emulate this trial?
This discipline forces clarity about:
Time zero: When does follow-up begin? (Avoids immortal time bias)
Eligibility: Who is in the study population?
Treatment definition: What exactly is being compared?
Assignment mechanism: What determines who gets treated?
When the observational study cannot emulate the target trial—because of confounding, selection, or measurement issues—at least the limitations are explicit.
This framework is equally valuable in economics. Before running regressions, specify the experiment you wish you could run. Then assess how well your observational design approximates it.

Example: Hospital Choice
Studying the effect of hospital quality on outcomes, selection depends on:
Patient preference and information
Distance and transportation
Referral patterns
Emergency vs. planned admission
If we observe these factors, we might argue for conditional ignorability. But patients may choose based on unobserved health factors or private information—undermining the assumption.

Data Quality

Better data makes unconfoundedness more plausible:

Administrative data: Often captures the actual selection process
Rich longitudinal data: Pre-treatment outcomes may proxy for unobserved factors
Multiple measures: Different proxies for the same construct reduce measurement error

The Selection Story

A credible analysis tells a clear story about selection:

Who gets treated and why? Describe the selection process.
What do we observe? List the covariates and why they matter.
What might we miss? Acknowledge potential unobserved confounders.
How sensitive are results? Show sensitivity analysis.

Validation Through Placebo Tests

While unconfoundedness cannot be directly tested, placebo tests offer indirect evidence about its plausibility.

The logic: Estimate a treatment effect on an outcome that should not be affected by treatment. If you find an effect, something is wrong—likely unobserved confounding.

Common placebo outcomes:

Lagged outcomes: Use a pre-treatment measure of the outcome. The treatment cannot have caused changes in the past.
Predetermined covariates: Variables fixed before treatment should show no "effect."
Conceptually unrelated outcomes: Outcomes with no plausible causal link to treatment.

Example: LaLonde Placebo Test
LaLonde (1986) and subsequent analyses used 1975 earnings as a placebo outcome for a job training program that occurred afterward. The true treatment effect on 1975 earnings is zero by construction.
Using experimental data: placebo estimates are close to zero (as expected).
Using nonexperimental data: placebo estimates are large and negative, even with modern methods—suggesting that selection into treatment is correlated with earnings trajectories in ways the observed covariates do not capture.

Interpreting placebo tests:

Null placebo effect: Consistent with (but does not prove) unconfoundedness. The covariates may capture selection.
Non-null placebo effect: Strong evidence against unconfoundedness. Either important confounders are missing, or the confounders affect the outcome differently in placebo vs. actual periods.

Multiple pretreatment periods strengthen placebo tests: When several lagged outcomes are available, testing for "effects" on each provides more convincing evidence. Imbens, Rubin & Sacerdote (2001), studying lottery winners, used six years of pre-winning earnings as placebos—their null effects across all years strongly supported unconfoundedness.

Practical Guidance: Placebo Test Checklist
Identify outcomes that should be unaffected by treatment
Estimate treatment "effects" on placebo outcomes using same methods as main analysis
Report placebo estimates alongside main estimates
A failed placebo test is a serious warning—do not ignore it
A passed placebo test does not guarantee unconfoundedness, but increases credibility

Warning Signs

Be skeptical of unconfoundedness when:

Treatment depends on private information (patient symptoms, student motivation)
Selection is based on anticipated outcomes (Roy model selection)
Strong economic incentives drive selection
Similar studies with better identification yield different results
Sensitivity analysis shows results are fragile

Box: The LaLonde Benchmark—What 40 Years Taught Us
In 1986, Robert LaLonde asked a simple question: Can nonexperimental methods replicate the results of a randomized experiment? Using data from the National Supported Work demonstration—a job training program with a randomized control group—he compared experimental estimates to those from matching trainees to observational comparison groups.
LaLonde's original conclusion was stark: Nonexperimental methods failed. Estimates varied wildly across specifications and often had the wrong sign. His paper became a foundational critique of observational methods and helped spark the credibility revolution.
What we've learned since (Imbens & Xu 2025):
Overlap matters enormously. LaLonde's comparison groups differed dramatically from trainees—average 1975 earnings of $14,000–$19,000 versus $3,000 for trainees. With such poor overlap, estimates relied on extrapolation.
With overlap ensured, estimator choice matters less. Modern methods (matching, IPW, AIPW, causal forests) yield similar estimates once samples are trimmed to ensure comparable units exist in both groups.
But similar estimates don't guarantee causal validity. The LDW-CPS sample produces ATT estimates close to the experimental benchmark using various modern methods. Success? Not quite: placebo tests using 1975 earnings (before treatment) fail badly, suggesting unconfoundedness does not hold.
Recovering average effects is easier than heterogeneous effects. Even when ATT estimates match experimental benchmarks, conditional average treatment effects (CATTs) diverge substantially between experimental and nonexperimental analyses.
The nuanced answer to LaLonde's question: Sometimes nonexperimental methods can replicate experimental benchmarks—and we now have better tools to assess when. The key is not the estimation method but whether (a) overlap exists and (b) unconfoundedness is plausible. Placebo tests and sensitivity analyses help evaluate the latter.
The sobering implication: Methods that robustly estimate the statistical estimand (covariate-adjusted treatment-control difference) may still fail to recover the causal estimand if unobserved confounding exists. No amount of methodological sophistication substitutes for a credible design.

Comparison to Other Methods

Selection on observables is one tool among many. Consider:

Consider instead

A natural experiment exists

IV, DiD, or RD

Panel data available

Fixed effects to control time-invariant confounders

Selection on unobservables likely

IV or bounds

Key confounder is unobserved but can be proxied

Proxy variable methods

Running Example Connection: China's Growth
SOO is fundamentally a micro-level strategy. It assumes we can identify and measure all relevant confounders—plausible when studying individual choices where selection depends on observable characteristics. But for macro questions like what caused China's post-1978 growth, the approach breaks down entirely. We cannot list all the confounders affecting both policy choices (market liberalization, trade openness, institutional reform) and outcomes (GDP growth). Even if we could, we would have only one China to observe. Selection on observables requires comparing treated and control units with similar characteristics—but there is no control China. This is why macro questions require the triangulation of methods discussed in Chapters 1 and 23.

Practical Guidance

Method Selection

Situation

Recommended Method

Unconfoundedness plausible, good overlap

Any method; AIPW preferred

Concern about functional form

Matching or AIPW with flexible models

Extreme propensity scores

Trim or match; avoid pure IPW

Multiple confounders, continuous treatment

GPS methods

Concern about unobserved confounding

SOO with extensive sensitivity analysis

Strong concern about unobserved confounding

Consider different identification strategy

Common Pitfalls

Pitfall 1: Controlling for Post-Treatment Variables
Including variables affected by treatment biases estimates—often severely.
How to avoid: Map out the causal structure. Only control for pre-treatment confounders.

Pitfall 2: Ignoring the Overlap Assumption
If propensity scores are near 0 or 1, there is no comparable counterfactual. Estimates rely on extrapolation. The LaLonde data illustrate this dramatically: without ensuring overlap, ATT estimates range from −$8,000 to +$2,000 depending on the estimator used.
How to avoid: Plot propensity score distributions by treatment status. Trim or match on propensity scores to ensure common support. Report overlap diagnostics. With good overlap, estimates become much more stable across methods—often more stable than any other specification choice.

Pitfall 3: Claiming Causation Without Defending Unconfoundedness
Adding controls does not automatically justify causal interpretation.
How to avoid: Explicitly state and defend the unconfoundedness assumption. Conduct sensitivity analysis.

Pitfall 4: Over-Relying on Propensity Score Prediction
A good propensity score model predicts treatment. But prediction quality is not the goal—balance is.
How to avoid: Focus on covariate balance, not prediction accuracy. Use balance diagnostics.

Pitfall 5: The Table 2 Fallacy
When you estimate $Y = \alpha + \tau D + \beta X + \varepsilon$ and report both $\hat{\tau}$ and $\hat{\beta}$ , it's tempting to interpret $\hat{\beta}$ causally too. This is almost always wrong.
The regression is designed to identify the effect of $D$ on $Y$ by controlling for $X$ . But the coefficient on $X$ is not identified for a causal interpretation—you haven't controlled for the confounders of the $X \to Y$ relationship.
Example: Regressing wages on education and parental income, parental income "controls for" family background when estimating returns to education. But the coefficient on parental income does NOT estimate the causal effect of parental income on wages—that would require controlling for its confounders (parental education, geography, genetics, etc.).
The pattern: Papers often present regression results with one causally-interpreted coefficient (the treatment) and many descriptive coefficients (the controls). Readers—and sometimes authors—interpret all coefficients causally. This is the "Table 2 Fallacy" (Westreich and Greenland 2013).
How to avoid: Only interpret causally the coefficient you've designed the study to identify. Label control variable coefficients as "adjustment factors" or "associations," not effects.

Implementation Checklist

Design:

Identify treatment, outcome, and potential confounders
Justify which variables to include (and exclude)
Draw the DAG; check for bad controls and colliders

Estimation:

Estimate propensity scores
Check overlap (propensity score distributions)
Choose method (matching, IPW, AIPW)
Assess covariate balance after adjustment

Inference and Robustness:

Report point estimate and confidence interval
Show results are robust to different specifications
Conduct sensitivity analysis (Rosenbaum bounds, Oster, E-values)
Compare to alternative identification strategies if available

Reporting:

Clearly state the unconfoundedness assumption
Defend the assumption substantively
Report balance diagnostics
Report sensitivity analysis
Acknowledge limitations

Qualitative Bridge

What Quantitative Balance Cannot Capture

Matching and weighting achieve balance on observed covariates. But:

Do these covariates capture what matters for selection?
Are the measures valid and reliable?
What unobserved factors might remain?

Qualitative research can help answer these questions.

Using Qualitative Knowledge for Selection Stories

Interviews with decision-makers: How do they assign treatment? What factors do they consider?

Case studies: Detailed examination of selection in specific instances.

Expert knowledge: Domain specialists often know the selection process better than data reveal.

Example: Program Evaluation

Evaluating a job training program using observational data:

Quantitative: Match participants to non-participants on demographics, prior employment, education.

Qualitative: Interview case workers who refer clients. Learn that they refer the "most motivated" clients—an unobserved confounder.

This qualitative insight suggests the matching estimate is upward biased and motivates searching for better identification.

Integration Note

Connections to Other Chapters

Chapter

Connection

Ch. 9 (Causal Framework)

Formalizes unconfoundedness assumption

Ch. 10 (Experiments)

Experiments eliminate selection bias that SOO must assume away

Ch. 12 (IV)

IV handles selection on unobservables when SOO fails

Ch. 20 (Heterogeneity)

Propensity scores for heterogeneous effects (CATE)

Ch. 21 (ML for Causal)

Machine learning for flexible propensity/outcome models

When to Use Selection on Observables

SOO is appropriate when:

No better identification strategy is available
Rich data capture the selection process
Institutional knowledge supports unconfoundedness
Sensitivity analysis shows robustness

SOO is inappropriate when:

Strong unobserved confounders are likely
A cleaner identification strategy exists
Results are sensitive to plausible confounding
The selection process is poorly understood

Summary

Key takeaways:

Selection on observables assumes all confounders are observed: This allows conditioning on covariates to identify causal effects, but the assumption is untestable.
Multiple methods implement the assumption: Regression, matching, propensity score weighting, and doubly robust estimation all rely on the same identifying assumption but differ in how they adjust for confounders.
Doubly robust methods provide insurance: AIPW is consistent if either the outcome model or propensity score is correct—a valuable hedge against misspecification.
Overlap is as important as balance: Before worrying about covariate balance, verify that treated and control groups have common support in their propensity score distributions. With poor overlap, estimates depend on extrapolation and can vary wildly across methods. Trimming to ensure overlap often matters more than estimator choice.
Placebo tests probe unconfoundedness: Since unconfoundedness cannot be tested directly, estimate "effects" on outcomes that should be unaffected (pre-treatment measures, conceptually unrelated variables). Failed placebo tests are strong evidence against unconfoundedness; passed tests increase credibility.
Sensitivity analysis is essential: We must assess how much unobserved confounding would be needed to overturn findings. Rosenbaum bounds, E-values, and Oster's method provide complementary approaches.
Credibility requires substantive defense: Statistical methods cannot establish unconfoundedness. It must be defended with institutional knowledge and careful reasoning about selection.

Returning to the opening question: We can credibly estimate causal effects by controlling for confounders when we have good reason to believe all relevant confounders are observed and controlled for. "Good reason" comes not from statistical tests but from understanding the selection process, having rich data, and demonstrating robustness to potential unobserved confounding. When these conditions are met, selection on observables methods are powerful tools. When they are not, we should seek better identification strategies.

Exercises

Conceptual

Explain why controlling for a collider can introduce bias even when the collider is correlated with both treatment and outcome. Give an example.
A researcher estimates the effect of college quality on earnings by matching students on SAT scores and family income. A colleague suggests also matching on post-college occupation.
- Is occupation a good control?
- Draw a DAG illustrating the issue.
- How would including occupation affect the estimate?

Applied

Using observational data on a job training program (e.g., Dehejia-Wahba data):
- Estimate the ATT using (a) OLS, (b) nearest-neighbor matching, (c) IPW, and (d) AIPW
- Assess covariate balance before and after matching
- How much do the estimates differ? Why?
Conduct a sensitivity analysis for your preferred estimate from Exercise 3:
- Calculate Rosenbaum's $\Gamma$ at which significance is lost
- Calculate the E-value
- Discuss: How much unobserved confounding would be needed to overturn the finding?

Discussion

"Selection on observables is always inferior to quasi-experimental methods because the key assumption cannot be tested." Evaluate this claim. Under what circumstances might SOO be preferable to (or at least as credible as) IV or DiD?

Data Exercise

LaLonde replication (data available at https://yiqingxu.org/tutorials/lalonde/):
- Using the LDW-CPS data, estimate the ATT using regression, nearest-neighbor matching, IPW, and AIPW
- Plot propensity score distributions for treated and control groups. Is there adequate overlap?
- Trim the sample to improve overlap (e.g., remove units with propensity scores outside [0.1, 0.9]). How do estimates change? Do they converge across methods?
- Conduct a placebo test using 1975 earnings as the outcome (excluding 1975 earnings from conditioning variables). What do you find?
- Discuss: Do your results support causal interpretation of the ATT estimate? What does this exercise teach about the relationship between statistical and causal estimands?

Technical Appendix

A.1 Propensity Score Weighting Derivation

Under unconfoundedness and overlap:

$E\left[\frac{DY}{e(X)}\right] = E\left[E\left[\frac{DY}{e(X)} \bigg| X\right]\right] = E\left[\frac{E[DY|X]}{e(X)}\right]$

Since $E[DY|X] = E[Y|D=1, X] \cdot P(D=1|X) = E[Y|D=1, X] \cdot e(X)$ :

$E\left[\frac{DY}{e(X)}\right] = E[E[Y|D=1, X]] = E[E[Y(1)|X]] = E[Y(1)]$

Similarly, $E\left[\frac{(1-D)Y}{1-e(X)}\right] = E[Y(0)]$ .

A.2 Double Robustness Proof (Sketch)

The AIPW estimator can be written as:

$\hat{\tau}_{AIPW} = \frac{1}{n}\sum_i \hat{\phi}(Z_i)$

where $\hat{\phi}$ is the estimated influence function. Under regularity conditions:

If the propensity score is correct:

The IPW term consistently estimates the bias in the outcome model
The combination is consistent

If the outcome model is correct:

The outcome model term is consistent
The IPW correction term has expectation zero
The combination is consistent

The key is that the cross-term has expectation zero when either model is correct.

A.3 Overlap and Positivity

The overlap or positivity assumption requires:

$0 < e(X) < 1 \text{ for all } X \text{ in the support}$

Without overlap:

Some treated units have no comparable controls (or vice versa)
IPW weights become infinite
Estimates rely on extrapolation, not data

Practical violations: Near-violations where $e(X) \approx 0$ or $\approx 1$ also cause problems through extreme weights and high variance.

PreviousChapter 10: Experimental Methods NextChapter 12: Instrumental Variables

Last updated 4 days ago

hashtagOpening Question

hashtagChapter Overview

hashtag11.1 Regression for Causal Inference

hashtagWhen Does OLS Identify Causal Effects?

hashtagThe Bad Controls Problem

hashtagFunctional Form Sensitivity

hashtag11.2 Propensity Score Methods

hashtagThe Propensity Score

hashtagEstimating the Propensity Score

hashtagMatching

hashtagInverse Probability Weighting (IPW)

hashtagSubclassification (Stratification)

hashtagCovariate Balance Diagnostics

hashtag11.3 Doubly Robust Estimation

hashtagThe Problem with Single Methods

hashtagAugmented Inverse Probability Weighting (AIPW)

hashtagDouble Robustness Property

hashtagImplementation

hashtagTargeted Learning and TMLE

hashtagCross-Fitting: Essential for Machine Learning

hashtag11.4 Continuous and Multivalued Treatments

hashtagBeyond Binary Treatment

hashtagGeneralized Propensity Score

hashtagDose-Response Estimation

hashtagMultivalued Treatments

hashtag11.5 Sensitivity Analysis

hashtagThe Core Problem

hashtagRosenbaum Bounds

hashtagOster's Method: Coefficient Stability

hashtagE-Values: The Intuitive Sensitivity Measure

hashtagBenchmarking

hashtag11.6 When Is Conditional Ignorability Credible?

hashtagThe Fundamental Question

hashtagInstitutional Knowledge

hashtagData Quality

hashtagThe Selection Story

hashtagValidation Through Placebo Tests

hashtagWarning Signs

hashtagComparison to Other Methods

hashtagPractical Guidance

hashtagMethod Selection

hashtagCommon Pitfalls

hashtagImplementation Checklist

hashtagQualitative Bridge

hashtagWhat Quantitative Balance Cannot Capture

hashtagUsing Qualitative Knowledge for Selection Stories

hashtagExample: Program Evaluation

hashtagIntegration Note

hashtagConnections to Other Chapters

hashtagWhen to Use Selection on Observables

hashtagSummary

hashtagFurther Reading

hashtagEssential

hashtagFor Deeper Understanding

hashtagThe Epidemiological Perspective

hashtagDoubly Robust Methods

hashtagModel Interpretation and Marginal Effects

hashtagSensitivity Analysis

hashtagContinuous and Multivalued Treatments

hashtagThe LaLonde Literature

hashtagOther Applications

hashtagExercises

hashtagConceptual

hashtagApplied

hashtagDiscussion

hashtagData Exercise

hashtagTechnical Appendix

hashtagA.1 Propensity Score Weighting Derivation

hashtagA.2 Double Robustness Proof (Sketch)

hashtagA.3 Overlap and Positivity