Chapter 11: Selection on Observables

Opening Question

When can we credibly estimate causal effects from observational data by controlling for the right variables—and how do we know if we have?


Chapter Overview

Randomized experiments are not always feasible. When they are not, researchers often turn to observational data—data where treatment was not randomly assigned but chosen by individuals, institutions, or circumstance. Can we still learn about causal effects?

The answer depends on a critical assumption: that we observe all the variables that jointly affect treatment and outcomes. If we do—if "selection is on observables"—then adjusting for these confounders recovers causal effects. This chapter develops methods for such adjustment: regression, matching, propensity score weighting, and doubly robust estimation. We also confront the uncomfortable truth that the key assumption is untestable, and develop sensitivity analyses to assess how much unobserved confounding would be needed to overturn our conclusions.

Selection on observables methods are ubiquitous in applied research. They are also frequently misused. The goal of this chapter is not just to teach the methods, but to develop judgment about when they are credible and how to defend—or critique—their application.

What you will learn:

  • When regression identifies causal effects (and when it does not)

  • How matching and propensity score methods work

  • The logic of doubly robust estimation

  • How to conduct and interpret sensitivity analyses

  • When selection on observables is a credible assumption

Prerequisites: Chapter 9 (The Causal Framework), Chapter 3 (Statistical Foundations)


11.1 Regression for Causal Inference

When Does OLS Identify Causal Effects?

Regression is the workhorse of empirical research. But when does a regression coefficient have a causal interpretation?

Consider the regression: Yi=α+τDi+Xiβ+εiY_i = \alpha + \tau D_i + X_i'\beta + \varepsilon_i

The coefficient τ\tau estimates the causal effect of DD on YY if:

Assumption 11.1: Conditional Unconfoundedness

(Y(0),Y(1))DX(Y(0), Y(1)) \perp D | X

Conditional on observed covariates XX, treatment assignment is independent of potential outcomes.

This is the same as unconfoundedness from Chapter 9. In regression terms, it means that after controlling for XX, the remaining variation in DD is as good as random.

When might this hold?

  • When XX includes all variables that affect both treatment selection and outcomes

  • When institutional knowledge suggests that selection depends only on observables

  • When rich administrative data captures the selection process

When does it fail?

  • When unobserved factors (ability, motivation, preferences) affect both DD and YY

  • When selection depends on private information not in the data

  • Almost always, to some degree

The Bad Controls Problem

A common mistake is to "control for everything available." This can introduce bias rather than remove it.

Bad controls are variables that are affected by treatment or that open backdoor paths:

  1. Post-treatment variables: Controlling for a consequence of treatment blocks part of the causal effect.

  2. Mediators: If DMYD \to M \to Y, controlling for MM removes the indirect effect.

  3. Colliders: Controlling for a common effect of two variables creates spurious association between them.

Example: Bad Control

Estimating the effect of education on wages, you control for occupation. But occupation is partly caused by education. Controlling for it asks: "What is the effect of education, holding occupation fixed?" This removes the channel through which education raises wages (better jobs) and may reverse the sign of the effect.

The DAG test: Draw the causal graph. A variable is a bad control if:

  • It is a descendant of treatment (post-treatment)

  • It is a collider on a path between treatment and outcome

  • Conditioning on it opens a previously blocked path

Functional Form Sensitivity

Regression imposes functional form assumptions. With continuous covariates or treatment:

E[YD,X]=α+τD+XβE[Y | D, X] = \alpha + \tau D + X'\beta

assumes linearity and additivity. If the true relationship is nonlinear or includes interactions, the estimate of τ\tau depends on the distribution of XX and may not equal ATE.

Solutions:

  • Include flexible functions of XX (polynomials, splines)

  • Interact treatment with covariates

  • Use nonparametric methods (matching, weighting)

Practical Guidance: Regression for Causal Inference

Regression is appropriate when:

  • Unconfoundedness is plausible given XX

  • The functional form is approximately correct

  • You avoid bad controls

Be cautious when:

  • Key confounders are likely unobserved

  • The treatment effect may vary with XX

  • Covariate distributions differ substantially between treated and control

Box: G-Computation—The Epidemiological Perspective

Epidemiologists call the outcome-modeling approach g-computation or standardization—part of the broader "g-methods" tradition developed by James Robins and colleagues (see Section 9.7 for context on the epidemiological contribution to causal inference). The "g" stands for "generalized," referring to Robins' (1986) generalization to time-varying treatments.

The g-computation algorithm:

  1. Fit an outcome model: E^[YD,X]\hat{E}[Y | D, X]

  2. For each unit, predict outcomes under treatment: Y^i(1)=E^[YD=1,Xi]\hat{Y}_i(1) = \hat{E}[Y | D=1, X_i]

  3. For each unit, predict outcomes under control: Y^i(0)=E^[YD=0,Xi]\hat{Y}_i(0) = \hat{E}[Y | D=0, X_i]

  4. Average: ATE^=1ni[Y^i(1)Y^i(0)]\widehat{ATE} = \frac{1}{n}\sum_i [\hat{Y}_i(1) - \hat{Y}_i(0)]

This is equivalent to the regression approach when the outcome model is correctly specified and treatment effects are homogeneous. But g-computation naturally accommodates:

  • Nonlinear outcome models (logistic, Poisson)

  • Treatment effect heterogeneity (via interactions)

  • Marginal effects that differ from conditional effects

The key insight: instead of interpreting a regression coefficient, compute predicted outcomes under each treatment scenario and compare them. This "plug-in" approach is conceptually cleaner when effects are heterogeneous or outcomes are nonlinear.

See Hernán & Robins (2020), Chapter 13, for full treatment including time-varying extensions.

Box: Marginal Effects—What Are We Actually Estimating?

The term "marginal effect" means different things to different disciplines, creating confusion that has persisted for decades. When a researcher reports a "marginal effect," they might mean any of the following:

Term
Definition
When they differ

Conditional effect

$$\partial E[Y

X]/\partial Xatspecificat specificX$$ values

Marginal effect at the mean (MEM)

Effect evaluated at X=XˉX = \bar{X}

Differs from AME with nonlinearity

Average marginal effect (AME)

$$\frac{1}{n}\sum_i \partial E[Y

X_i]/\partial X$$

In a linear model (Y=α+βX+εY = \alpha + \beta X + \varepsilon), these all equal β\beta. But with nonlinear models—logit, probit, Poisson, or any model with interactions—they diverge, sometimes substantially.

Why this matters for causal inference: The causal estimands we care about—ATE, ATT—are defined as averages over populations. The ATE is E[Y(1)Y(0)]E[Y(1) - Y(0)], which corresponds to the AME, not the MEM. When you report a logit coefficient or an effect "at the mean," you're not reporting the ATE.

Consider a logistic regression for employment (YY) on a training program (DD):

  • The coefficient β^\hat{\beta} is a log odds ratio—not directly interpretable as a probability change

  • The MEM evaluates P(Y=1)/D\partial P(Y=1)/\partial D at mean covariate values—but no one has exactly average characteristics

  • The AME averages the marginal effect across all individuals—this estimates the ATE under unconfoundedness

The g-computation algorithm in the previous box computes the AME: predict outcomes for everyone under treatment, predict under control, and average the difference. This is what we want for causal inference.

Practical implications:

  • For linear models with no interactions: report regression coefficients

  • For nonlinear models: report AMEs, not coefficients or MEMs

  • For models with interactions: compute effects at substantively meaningful covariate values, or report AMEs

  • Always clarify which quantity you're reporting

Terminology warning: The word "marginal" is overloaded in statistics. In the AME/MEM context above, "marginal" means a derivative (y/x\partial y / \partial x). But in multilevel/mixed models, "marginal" means something entirely different: population-average effects (integrating over the distribution of random effects) versus conditional effects (for a typical cluster with random effects set to zero). These are distinct concepts that happen to share a name. When reading papers using mixed models, check which meaning applies—the marginal (population-average) effect in a GLMM is not the same as the AME from this box, though both involve averaging.

The marginaleffects package (Arel-Bundock, Greifer & Heiss 2024) provides a unified framework for computing predictions, comparisons, and slopes across 100+ model types in R and Python---see Chapter 18 for implementation. For a thorough treatment of the terminology and the pitfalls of each approach, see Arel-Bundock, Greifer & Heiss (2024).


11.2 Propensity Score Methods

The Propensity Score

The propensity score is the probability of treatment given covariates:

e(X)=P(D=1X)e(X) = P(D = 1 | X)

Rosenbaum and Rubin (1983) showed a remarkable result, building on ideas that developed simultaneously in statistics and epidemiology:

Theorem 11.1: Propensity Score Theorem

If (Y(0),Y(1))DX(Y(0), Y(1)) \perp D | X, then (Y(0),Y(1))De(X)(Y(0), Y(1)) \perp D | e(X)

Conditioning on the propensity score is sufficient for unconfoundedness.

Intuition: The propensity score summarizes all the information in XX relevant for treatment assignment. Units with the same propensity score are equally likely to be treated, regardless of their specific covariate values. Comparing treated and control units with the same propensity score is like comparing within a mini-experiment.

Dimension reduction: Instead of matching on many covariates XX, we can match on a single scalar e(X)e(X).

Estimating the Propensity Score

The propensity score is typically estimated using logistic regression:

log(e(X)1e(X))=Xγ\log\left(\frac{e(X)}{1-e(X)}\right) = X'\gamma

Then e^(Xi)=logit1(Xiγ^)\hat{e}(X_i) = \text{logit}^{-1}(X_i'\hat{\gamma}).

Alternatives:

  • Probit regression

  • Machine learning methods (random forests, boosting, LASSO)

  • Covariate balancing propensity scores (CBPS)

The goal is not to predict treatment well in a forecasting sense, but to achieve covariate balance between treated and control groups after adjustment.

Matching

Matching pairs treated units with similar control units:

  1. For each treated unit, find control units with similar XX or e(X)e(X)

  2. Estimate the treatment effect by comparing matched pairs

Types of matching:

Method
Description
Pros
Cons

Exact

Match on identical XX

No approximation

Fails with many covariates

Nearest neighbor

Match to closest unit(s)

Simple

May use poor matches

Caliper

Match only within distance cc

Avoids bad matches

May discard units

Kernel

Weight all controls by distance

Uses all data

Requires bandwidth choice

Coarsened exact

Exact match on coarsened XX

Transparent

Sensitive to coarsening

With or without replacement:

  • With replacement: Each control can match multiple treated units. More flexible but inference is more complex.

  • Without replacement: Each control matches at most one treated unit. Order of matching matters.

ATT Matching Estimator

τ^ATT=1n1Di=1(YiY^i(0))\hat{\tau}_{ATT} = \frac{1}{n_1} \sum_{D_i=1} \left(Y_i - \hat{Y}_i^{(0)}\right)

where Y^i(0)\hat{Y}_i^{(0)} is the (average) outcome of matched control units.

Box: The Evolution from Matching to Doubly Robust Methods

Matching was the dominant approach in the 2000s, following influential papers by Rosenbaum, Rubin, and Dehejia and Wahba. However, modern best practice has shifted toward doubly robust (AIPW) and weighting-based methods for several reasons:

Limitations of matching:

  • Matching discards data: Unmatched controls are thrown away, reducing precision

  • Arbitrary matching choices: Caliper width, with/without replacement, matching order—all affect results

  • Variance estimation is complex: Standard errors for matched estimators require special adjustments (Abadie & Imbens, 2006)

  • No double robustness: Misspecify the matching metric and estimates are biased

Advantages of AIPW/weighting:

  • Uses all data: Every observation contributes (with appropriate weights)

  • Double robustness: Consistent if either the propensity or outcome model is correct

  • Clean inference: Standard variance estimation works; easy to bootstrap

  • Natural integration with ML: Cross-fitting + AIPW enables flexible estimation (DML)

Current recommendation:

Method
When to use

AIPW/TMLE

Default choice for most applications

IPW

When only propensity model is credible

Regression

When only outcome model is credible

Matching

For transparency, sample construction, or when stakeholders expect it

Matching remains valuable for sample construction (creating a matched sample for subsequent analysis) and transparency (showing exactly which units are compared). But for primary causal estimation, doubly robust methods are now preferred.

Inverse Probability Weighting (IPW)

IPW reweights observations to create a pseudo-population where treatment is independent of covariates:

IPW Estimator for ATE

τ^IPW=1ni=1n(DiYie^(Xi)(1Di)Yi1e^(Xi))\hat{\tau}_{IPW} = \frac{1}{n}\sum_{i=1}^n \left(\frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i) Y_i}{1-\hat{e}(X_i)}\right)

Intuition: Treated units with low propensity scores are "surprising"—they were unlikely to be treated but were. They carry more information about what control units would have experienced under treatment, so they receive higher weight.

Weights:

  • For ATE: wi=Die^(Xi)+1Di1e^(Xi)w_i = \frac{D_i}{\hat{e}(X_i)} + \frac{1-D_i}{1-\hat{e}(X_i)}

  • For ATT: wi=Di+(1Di)e^(Xi)1e^(Xi)w_i = D_i + (1-D_i)\frac{\hat{e}(X_i)}{1-\hat{e}(X_i)}

Problems with IPW:

  • Extreme propensity scores create extreme weights

  • Estimates can be highly variable

  • Sensitive to propensity score misspecification

Solutions:

  • Trim observations with propensity scores near 0 or 1

  • Truncate weights at some maximum

  • Use stabilized weights: P(D)e^(X)\frac{P(D)}{\hat{e}(X)} instead of 1e^(X)\frac{1}{\hat{e}(X)}

  • Use doubly robust methods (Section 11.3)

Subclassification (Stratification)

Subclassification groups units into strata by propensity score, then estimates effects within strata:

  1. Estimate propensity scores

  2. Divide into KK strata (often 5-10 quantiles)

  3. Estimate treatment effect within each stratum

  4. Average across strata, weighting by stratum size

This is less sensitive to extreme weights than IPW but imposes discretization.

Covariate Balance Diagnostics

The key diagnostic: Check whether adjustment achieves covariate balance between treated and control.

Before adjustment, treated and control groups typically differ on XX. After matching or weighting, they should not.

Standardized differences: dj=Xˉj1Xˉj0(sj12+sj02)/2d_j = \frac{\bar{X}_{j1} - \bar{X}_{j0}}{\sqrt{(s_{j1}^2 + s_{j0}^2)/2}}

Rules of thumb:

  • d<0.1|d| < 0.1: Good balance

  • d<0.25|d| < 0.25: Acceptable

  • d>0.25|d| > 0.25: Poor balance; investigate

Variance ratios: Check that variances are similar after matching.

Visual inspection: Plot covariate distributions before and after matching.

Practical Box: Balance Assessment Checklist

Figure 11.1: Love Plot—Covariate Balance Before and After Matching. Each row shows a covariate's standardized difference between treated and control groups. Red circles indicate imbalance before matching; green squares show balance achieved after matching. Values within the shaded region (|d| < 0.1) indicate good balance.

Box: Assessing Overlap with Propensity Score Distributions

Balance diagnostics check whether covariate means are similar after adjustment. But a more fundamental question is whether treated and control groups share common support—whether there are comparable units in both groups across the range of propensity scores.

The most informative diagnostic is a histogram (or density plot) of propensity scores by treatment status:

What to look for:

  • Good overlap: Distributions largely coincide; most treated units have propensity scores where control units also exist (and vice versa)

  • Poor overlap: Distributions are separated; treated units cluster at high propensity scores with few controls nearby, or vice versa

Log-odds scale: For clearer visualization, especially when propensity scores cluster near 0 or 1, plot the log-odds: log(e^/(1e^))\log(\hat{e}/(1-\hat{e})). A log-odds of −3 corresponds to roughly p=0.05p = 0.05; a log-odds of 3 corresponds to roughly p=0.95p = 0.95.

When overlap is poor:

  • Estimates rely on extrapolation, not comparable units

  • Different estimators can yield wildly different results

  • Trimming (removing units with extreme propensity scores) often improves robustness more than choosing a fancier estimator

Imbens & Xu (2025) demonstrate this vividly: with the LaLonde data, severe overlap problems cause estimates to vary from −$8,000 to +$2,000 depending on method. After trimming to ensure overlap, all modern estimators converge to similar values. The lesson: ensuring overlap is often more important than estimator choice.

Trimming strategies:

  • Drop treated units with propensity scores below the minimum in the control group (Dehejia & Wahba 1999)

  • Drop units with propensity scores outside [0.1, 0.9] (Crump et al. 2009)

  • Use matching to construct a sample with good overlap by design

The loss of precision from trimming is typically modest; the gain in robustness can be substantial.

Figure 11.2: Propensity Score Overlap. Left panel shows good overlap—treated and control distributions largely coincide, enabling reliable comparisons. Right panel shows poor overlap—groups are largely separated, forcing estimates to rely on extrapolation rather than comparable units.

11.3 Doubly Robust Estimation

The Problem with Single Methods

Both outcome regression and propensity score methods have limitations:

  • Outcome regression: Sensitive to misspecification of E[YD,X]E[Y | D, X]

  • IPW: Sensitive to misspecification of e(X)e(X), extreme weights

What if we could combine them to be robust to misspecification of either (but not both)? This insight, developed by James Robins, Andrea Rotnitzky, and colleagues in biostatistics during the 1990s, led to doubly robust methods—now standard in both epidemiology and economics.

Augmented Inverse Probability Weighting (AIPW)

The doubly robust or AIPW estimator combines outcome modeling and propensity scores:

AIPW Estimator

τ^AIPW=1ni=1n[μ^1(Xi)μ^0(Xi)+Di(Yiμ^1(Xi))e^(Xi)(1Di)(Yiμ^0(Xi))1e^(Xi)]\hat{\tau}_{AIPW} = \frac{1}{n}\sum_{i=1}^n \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1-\hat{e}(X_i)}\right]

where μ^d(X)=E[YD=d,X]\hat{\mu}_d(X) = E[Y | D = d, X] are outcome models.

Intuition: Start with the outcome model prediction. Then correct for any remaining imbalance using IPW applied to the residuals.

Double Robustness Property

Theorem 11.2: Double Robustness

The AIPW estimator is consistent if either:

  • The outcome model μ^d(X)\hat{\mu}_d(X) is correctly specified, OR

  • The propensity score e^(X)\hat{e}(X) is correctly specified

Only one needs to be correct, not both.

This provides insurance against model misspecification. If you're unsure about functional form, doubly robust estimators hedge your bets.

Implementation

In R:

In Stata:

Targeted Learning and TMLE

Targeted Maximum Likelihood Estimation (TMLE) is a sophisticated doubly robust approach that:

  1. Fits an initial outcome model

  2. Updates it using propensity score information

  3. Targets the specific estimand of interest

TMLE has better finite-sample properties than basic AIPW and integrates well with machine learning (Super Learner).

Cross-Fitting: Essential for Machine Learning

Box: The Cross-Fitting Requirement (Double/Debiased Machine Learning)

When using machine learning methods (random forests, LASSO, boosting) to estimate propensity scores or outcome models, a critical issue arises: overfitting bias.

The problem: If you estimate e^(X)\hat{e}(X) or μ^(X)\hat{\mu}(X) and compute treatment effects on the same data, ML's flexibility leads to overfitting. The nuisance function estimates are too tailored to the specific sample, and the resulting treatment effect estimate is biased—even with doubly robust methods.

The solution: Cross-fitting (Chernozhukov et al. 2018):

  1. Split the sample into KK folds (typically 5-10)

  2. For each fold kk:

    • Estimate nuisance functions (e^\hat{e}, μ^\hat{\mu}) on all data except fold kk

    • Predict nuisance values for fold kk using these out-of-fold estimates

  3. Compute treatment effects using the cross-fitted predictions

  4. Average across folds

This is the core of Double/Debiased Machine Learning (DML).

Why it works: By estimating nuisance functions on different data than where they're applied, we avoid overfitting. Small errors in nuisance estimation don't create first-order bias in the treatment effect (this is called "Neyman orthogonality").

Practical implication: Standard implementations like Stata's teffects use parametric models and don't require cross-fitting. But if you use flexible ML methods:

  • In R: Use DoubleML, grf, or AIPW with Super Learner (handles cross-fitting automatically)

  • In Python: Use econml (EconML) or DoubleML

  • Never use sklearn to fit a random forest propensity score and then apply it to the same data for weighting

See Chapter 21 for detailed coverage of DML and its theoretical foundations.


11.4 Continuous and Multivalued Treatments

Beyond Binary Treatment

Many treatments are not binary:

  • Years of education (continuous)

  • Drug dosage (continuous)

  • Type of program (multivalued)

The propensity score framework extends to these cases.

Generalized Propensity Score

For continuous treatment DD, the generalized propensity score is:

r(d,X)=fDX(dX)r(d, X) = f_{D|X}(d | X)

The conditional density of treatment given covariates.

Under weak unconfoundedness: Y(d)Dr(d,X)Y(d) \perp D | r(d, X)

Dose-Response Estimation

The dose-response function maps treatment intensity to expected outcome:

μ(d)=E[Y(d)]\mu(d) = E[Y(d)]

Estimation approaches:

  1. Stratify on GPS: Group observations by GPS value, estimate E[YD=d]E[Y | D = d] within groups

  2. Inverse probability weighting: Weight by inverse of GPS (with stabilization)

  3. Outcome modeling: Specify E[YD,X]E[Y | D, X] and average over covariate distribution

Implementation Note: IPW for Continuous Treatments

IPW for continuous treatments differs fundamentally from the binary case. With binary treatment, weights are based on propensity scores (probabilities). With continuous treatment, weights are based on probability density functions.

The stabilized weight for continuous treatment is: wi=fD(Di)fDX(DiXi)w_i = \frac{f_D(D_i)}{f_{D|X}(D_i | X_i)}

where the numerator is the marginal density of treatment and the denominator is the conditional density given confounders. In practice, both are often assumed normal:

  • Numerator: DiN(Dˉ,σD2)D_i \sim N(\bar{D}, \sigma_D^2)

  • Denominator: DiXiN(D^i,σε2)D_i | X_i \sim N(\hat{D}_i, \sigma_{\varepsilon}^2) from a regression of DD on XX

Unlike binary IPW where extreme propensity scores (near 0 or 1) create problems, continuous IPW can produce extreme weights when an observation's treatment value is unlikely given their covariates. Stabilization and trimming remain important.

In R, the WeightIt package handles continuous treatments with method = "ps" and appropriate family specification. See Chapter 18 for implementation and Hirano & Imbens (2004) for the theoretical foundations.

Multivalued Treatments

With JJ treatment levels, estimate separate propensity scores:

ej(X)=P(D=jX)e_j(X) = P(D = j | X)

Then use multinomial IPW or matching within propensity strata.


11.5 Sensitivity Analysis

The Core Problem

Unconfoundedness cannot be tested. No matter how many covariates we include, there may be unobserved factors that bias our estimates.

Sensitivity analysis asks: How much unobserved confounding would be needed to change our conclusions?

Rosenbaum Bounds

Rosenbaum's approach asks: If unobserved confounding exists, how large would the bias need to be to explain away the treatment effect?

Define Γ\Gamma as the maximum ratio of treatment odds for two units with the same observed covariates:

1ΓP(D=1X,U)/P(D=0X,U)P(D=1X,U)/P(D=0X,U)Γ\frac{1}{\Gamma} \leq \frac{P(D=1|X, U)/P(D=0|X, U)}{P(D=1|X, U')/P(D=0|X, U')} \leq \Gamma

Under no unmeasured confounding, Γ=1\Gamma = 1. Larger Γ\Gamma represents more potential confounding.

For each value of Γ\Gamma, compute bounds on the p-value. Find the minimum Γ\Gamma at which significance is lost. This is the study's sensitivity.

Example: Interpreting Rosenbaum Bounds

Suppose an effect is significant at Γ=1\Gamma = 1 (no confounding) and remains significant up to Γ=2\Gamma = 2. This means an unobserved confounder would need to double the odds of treatment to explain away the finding—a substantial amount of confounding.

Oster's Method: Coefficient Stability

Oster (2019) extends Altonji, Elder & Taber (2005) to assess how coefficient estimates change as controls are added:

  1. Estimate treatment effect without controls: τ~\tilde{\tau}

  2. Estimate with observed controls: τ^\hat{\tau}

  3. Calculate how much τ\tau would change if unobservables were as important as observables

The relative degree of selection δ\delta measures how much selection on unobservables would need to exceed selection on observables to produce a zero effect:

δ=(τ^0)(τ~τ^)×RmaxR~2R2R~2\delta = \frac{(\hat{\tau} - 0)}{(\tilde{\tau} - \hat{\tau})} \times \frac{R_{max} - \tilde{R}^2}{R^2 - \tilde{R}^2}

where RmaxR_{max} is a hypothesized maximum R2R^2 and R2R^2 is the R2R^2 from the controlled regression.

If δ>1|\delta| > 1, unobservables would need to be more important than observables to eliminate the effect—providing some reassurance.

E-Values: The Intuitive Sensitivity Measure

The E-value (VanderWeele & Ding 2017) has become the preferred sensitivity measure because of its intuitive interpretation. It asks: What is the minimum strength of association between an unmeasured confounder and both treatment and outcome needed to explain away the observed effect?

E-value=RR+RR×(RR1)E\text{-value} = RR + \sqrt{RR \times (RR - 1)}

where RRRR is the observed risk ratio (or an approximation for other effect measures).

How to Interpret E-Values

An E-value of 3 means: An unmeasured confounder would need to be associated with both treatment and outcome by a risk ratio of at least 3 to fully explain away the observed effect.

E-value
Interpretation

1.0

No confounding needed (null effect)

1.5

Modest confounding could explain result

2.0

Moderate confounding required

3.0+

Strong confounding required

5.0+

Very robust to confounding

Why E-values are intuitive: Unlike Rosenbaum's Γ\Gamma (which measures odds ratios for treatment assignment), the E-value is directly comparable to known risk ratios. If your strongest observed confounder has a risk ratio of 2 with both treatment and outcome, and your E-value is 4, an unobserved confounder would need to be twice as strong to explain away your result.

Computing E-values in practice:

E-value for the confidence interval: Report E-values for both the point estimate and the confidence interval bound closest to the null. If even the CI bound has a high E-value, the finding is robust.

Benchmarking

Compare sensitivity parameters to observed confounders:

  • How strongly are observed confounders related to treatment and outcome?

  • Would an unobserved confounder need to be stronger than any observed confounder?

  • Are there plausible candidates for such strong confounders?

Practical Box: Sensitivity Analysis Reporting

Report:

  • Point estimate and confidence interval under unconfoundedness

  • Rosenbaum Γ\Gamma at which significance is lost

  • Oster's δ\delta (selection ratio)

  • E-value for the point estimate and confidence interval

  • Comparison to strength of observed confounders

  • Discussion of plausible unobserved confounders


11.6 When Is Conditional Ignorability Credible?

The Fundamental Question

No statistical test can verify unconfoundedness. Its credibility rests on substantive arguments about the selection process.

Key question: Given what we know about how treatment is assigned, is it plausible that all relevant factors are observed?

Institutional Knowledge

The strongest case for unconfoundedness comes from understanding the selection mechanism:

  • How are treatments assigned?

  • What information is available to decision-makers?

  • What factors plausibly influence selection?

If the selection process is well-understood and observed in the data, unconfoundedness is more credible.

Box: Target Trial Emulation

Epidemiologists have developed a useful framework for designing observational studies: target trial emulation (Hernán & Robins 2016).

The idea: Before analyzing observational data, specify the randomized trial you would ideally run. What would be the eligibility criteria? Treatment assignment? Follow-up period? Primary outcome? Then ask: How well can your observational study emulate this trial?

This discipline forces clarity about:

  • Time zero: When does follow-up begin? (Avoids immortal time bias)

  • Eligibility: Who is in the study population?

  • Treatment definition: What exactly is being compared?

  • Assignment mechanism: What determines who gets treated?

When the observational study cannot emulate the target trial—because of confounding, selection, or measurement issues—at least the limitations are explicit.

This framework is equally valuable in economics. Before running regressions, specify the experiment you wish you could run. Then assess how well your observational design approximates it.

Example: Hospital Choice

Studying the effect of hospital quality on outcomes, selection depends on:

  • Patient preference and information

  • Distance and transportation

  • Referral patterns

  • Emergency vs. planned admission

If we observe these factors, we might argue for conditional ignorability. But patients may choose based on unobserved health factors or private information—undermining the assumption.

Data Quality

Better data makes unconfoundedness more plausible:

  • Administrative data: Often captures the actual selection process

  • Rich longitudinal data: Pre-treatment outcomes may proxy for unobserved factors

  • Multiple measures: Different proxies for the same construct reduce measurement error

The Selection Story

A credible analysis tells a clear story about selection:

  1. Who gets treated and why? Describe the selection process.

  2. What do we observe? List the covariates and why they matter.

  3. What might we miss? Acknowledge potential unobserved confounders.

  4. How sensitive are results? Show sensitivity analysis.

Validation Through Placebo Tests

While unconfoundedness cannot be directly tested, placebo tests offer indirect evidence about its plausibility.

The logic: Estimate a treatment effect on an outcome that should not be affected by treatment. If you find an effect, something is wrong—likely unobserved confounding.

Common placebo outcomes:

  • Lagged outcomes: Use a pre-treatment measure of the outcome. The treatment cannot have caused changes in the past.

  • Predetermined covariates: Variables fixed before treatment should show no "effect."

  • Conceptually unrelated outcomes: Outcomes with no plausible causal link to treatment.

Example: LaLonde Placebo Test

LaLonde (1986) and subsequent analyses used 1975 earnings as a placebo outcome for a job training program that occurred afterward. The true treatment effect on 1975 earnings is zero by construction.

Using experimental data: placebo estimates are close to zero (as expected).

Using nonexperimental data: placebo estimates are large and negative, even with modern methods—suggesting that selection into treatment is correlated with earnings trajectories in ways the observed covariates do not capture.

Interpreting placebo tests:

  • Null placebo effect: Consistent with (but does not prove) unconfoundedness. The covariates may capture selection.

  • Non-null placebo effect: Strong evidence against unconfoundedness. Either important confounders are missing, or the confounders affect the outcome differently in placebo vs. actual periods.

Multiple pretreatment periods strengthen placebo tests: When several lagged outcomes are available, testing for "effects" on each provides more convincing evidence. Imbens, Rubin & Sacerdote (2001), studying lottery winners, used six years of pre-winning earnings as placebos—their null effects across all years strongly supported unconfoundedness.

Practical Guidance: Placebo Test Checklist

Warning Signs

Be skeptical of unconfoundedness when:

  • Treatment depends on private information (patient symptoms, student motivation)

  • Selection is based on anticipated outcomes (Roy model selection)

  • Strong economic incentives drive selection

  • Similar studies with better identification yield different results

  • Sensitivity analysis shows results are fragile

Box: The LaLonde Benchmark—What 40 Years Taught Us

In 1986, Robert LaLonde asked a simple question: Can nonexperimental methods replicate the results of a randomized experiment? Using data from the National Supported Work demonstration—a job training program with a randomized control group—he compared experimental estimates to those from matching trainees to observational comparison groups.

LaLonde's original conclusion was stark: Nonexperimental methods failed. Estimates varied wildly across specifications and often had the wrong sign. His paper became a foundational critique of observational methods and helped spark the credibility revolution.

What we've learned since (Imbens & Xu 2025):

  1. Overlap matters enormously. LaLonde's comparison groups differed dramatically from trainees—average 1975 earnings of $14,000–$19,000 versus $3,000 for trainees. With such poor overlap, estimates relied on extrapolation.

  2. With overlap ensured, estimator choice matters less. Modern methods (matching, IPW, AIPW, causal forests) yield similar estimates once samples are trimmed to ensure comparable units exist in both groups.

  3. But similar estimates don't guarantee causal validity. The LDW-CPS sample produces ATT estimates close to the experimental benchmark using various modern methods. Success? Not quite: placebo tests using 1975 earnings (before treatment) fail badly, suggesting unconfoundedness does not hold.

  4. Recovering average effects is easier than heterogeneous effects. Even when ATT estimates match experimental benchmarks, conditional average treatment effects (CATTs) diverge substantially between experimental and nonexperimental analyses.

The nuanced answer to LaLonde's question: Sometimes nonexperimental methods can replicate experimental benchmarks—and we now have better tools to assess when. The key is not the estimation method but whether (a) overlap exists and (b) unconfoundedness is plausible. Placebo tests and sensitivity analyses help evaluate the latter.

The sobering implication: Methods that robustly estimate the statistical estimand (covariate-adjusted treatment-control difference) may still fail to recover the causal estimand if unobserved confounding exists. No amount of methodological sophistication substitutes for a credible design.

Comparison to Other Methods

Selection on observables is one tool among many. Consider:

If
Consider instead

A natural experiment exists

IV, DiD, or RD

Panel data available

Fixed effects to control time-invariant confounders

Selection on unobservables likely

IV or bounds

Key confounder is unobserved but can be proxied

Proxy variable methods

Running Example Connection: China's Growth

SOO is fundamentally a micro-level strategy. It assumes we can identify and measure all relevant confounders—plausible when studying individual choices where selection depends on observable characteristics. But for macro questions like what caused China's post-1978 growth, the approach breaks down entirely. We cannot list all the confounders affecting both policy choices (market liberalization, trade openness, institutional reform) and outcomes (GDP growth). Even if we could, we would have only one China to observe. Selection on observables requires comparing treated and control units with similar characteristics—but there is no control China. This is why macro questions require the triangulation of methods discussed in Chapters 1 and 23.


Practical Guidance

Method Selection

Situation
Recommended Method

Unconfoundedness plausible, good overlap

Any method; AIPW preferred

Concern about functional form

Matching or AIPW with flexible models

Extreme propensity scores

Trim or match; avoid pure IPW

Multiple confounders, continuous treatment

GPS methods

Concern about unobserved confounding

SOO with extensive sensitivity analysis

Strong concern about unobserved confounding

Consider different identification strategy

Common Pitfalls

Pitfall 1: Controlling for Post-Treatment Variables

Including variables affected by treatment biases estimates—often severely.

How to avoid: Map out the causal structure. Only control for pre-treatment confounders.

Pitfall 2: Ignoring the Overlap Assumption

If propensity scores are near 0 or 1, there is no comparable counterfactual. Estimates rely on extrapolation. The LaLonde data illustrate this dramatically: without ensuring overlap, ATT estimates range from −$8,000 to +$2,000 depending on the estimator used.

How to avoid: Plot propensity score distributions by treatment status. Trim or match on propensity scores to ensure common support. Report overlap diagnostics. With good overlap, estimates become much more stable across methods—often more stable than any other specification choice.

Pitfall 3: Claiming Causation Without Defending Unconfoundedness

Adding controls does not automatically justify causal interpretation.

How to avoid: Explicitly state and defend the unconfoundedness assumption. Conduct sensitivity analysis.

Pitfall 4: Over-Relying on Propensity Score Prediction

A good propensity score model predicts treatment. But prediction quality is not the goal—balance is.

How to avoid: Focus on covariate balance, not prediction accuracy. Use balance diagnostics.

Pitfall 5: The Table 2 Fallacy

When you estimate Y=α+τD+βX+εY = \alpha + \tau D + \beta X + \varepsilon and report both τ^\hat{\tau} and β^\hat{\beta}, it's tempting to interpret β^\hat{\beta} causally too. This is almost always wrong.

The regression is designed to identify the effect of DD on YY by controlling for XX. But the coefficient on XX is not identified for a causal interpretation—you haven't controlled for the confounders of the XYX \to Y relationship.

Example: Regressing wages on education and parental income, parental income "controls for" family background when estimating returns to education. But the coefficient on parental income does NOT estimate the causal effect of parental income on wages—that would require controlling for its confounders (parental education, geography, genetics, etc.).

The pattern: Papers often present regression results with one causally-interpreted coefficient (the treatment) and many descriptive coefficients (the controls). Readers—and sometimes authors—interpret all coefficients causally. This is the "Table 2 Fallacy" (Westreich and Greenland 2013).

How to avoid: Only interpret causally the coefficient you've designed the study to identify. Label control variable coefficients as "adjustment factors" or "associations," not effects.

Implementation Checklist

Design:

Estimation:

Inference and Robustness:

Reporting:


Qualitative Bridge

What Quantitative Balance Cannot Capture

Matching and weighting achieve balance on observed covariates. But:

  • Do these covariates capture what matters for selection?

  • Are the measures valid and reliable?

  • What unobserved factors might remain?

Qualitative research can help answer these questions.

Using Qualitative Knowledge for Selection Stories

Interviews with decision-makers: How do they assign treatment? What factors do they consider?

Case studies: Detailed examination of selection in specific instances.

Expert knowledge: Domain specialists often know the selection process better than data reveal.

Example: Program Evaluation

Evaluating a job training program using observational data:

Quantitative: Match participants to non-participants on demographics, prior employment, education.

Qualitative: Interview case workers who refer clients. Learn that they refer the "most motivated" clients—an unobserved confounder.

This qualitative insight suggests the matching estimate is upward biased and motivates searching for better identification.


Integration Note

Connections to Other Chapters

Chapter
Connection

Ch. 9 (Causal Framework)

Formalizes unconfoundedness assumption

Ch. 10 (Experiments)

Experiments eliminate selection bias that SOO must assume away

Ch. 12 (IV)

IV handles selection on unobservables when SOO fails

Ch. 20 (Heterogeneity)

Propensity scores for heterogeneous effects (CATE)

Ch. 21 (ML for Causal)

Machine learning for flexible propensity/outcome models

When to Use Selection on Observables

SOO is appropriate when:

  • No better identification strategy is available

  • Rich data capture the selection process

  • Institutional knowledge supports unconfoundedness

  • Sensitivity analysis shows robustness

SOO is inappropriate when:

  • Strong unobserved confounders are likely

  • A cleaner identification strategy exists

  • Results are sensitive to plausible confounding

  • The selection process is poorly understood


Summary

Key takeaways:

  1. Selection on observables assumes all confounders are observed: This allows conditioning on covariates to identify causal effects, but the assumption is untestable.

  2. Multiple methods implement the assumption: Regression, matching, propensity score weighting, and doubly robust estimation all rely on the same identifying assumption but differ in how they adjust for confounders.

  3. Doubly robust methods provide insurance: AIPW is consistent if either the outcome model or propensity score is correct—a valuable hedge against misspecification.

  4. Overlap is as important as balance: Before worrying about covariate balance, verify that treated and control groups have common support in their propensity score distributions. With poor overlap, estimates depend on extrapolation and can vary wildly across methods. Trimming to ensure overlap often matters more than estimator choice.

  5. Placebo tests probe unconfoundedness: Since unconfoundedness cannot be tested directly, estimate "effects" on outcomes that should be unaffected (pre-treatment measures, conceptually unrelated variables). Failed placebo tests are strong evidence against unconfoundedness; passed tests increase credibility.

  6. Sensitivity analysis is essential: We must assess how much unobserved confounding would be needed to overturn findings. Rosenbaum bounds, E-values, and Oster's method provide complementary approaches.

  7. Credibility requires substantive defense: Statistical methods cannot establish unconfoundedness. It must be defended with institutional knowledge and careful reasoning about selection.

Returning to the opening question: We can credibly estimate causal effects by controlling for confounders when we have good reason to believe all relevant confounders are observed and controlled for. "Good reason" comes not from statistical tests but from understanding the selection process, having rich data, and demonstrating robustness to potential unobserved confounding. When these conditions are met, selection on observables methods are powerful tools. When they are not, we should seek better identification strategies.


Further Reading

Essential

  • Rosenbaum & Rubin (1983). "The Central Role of the Propensity Score in Observational Studies for Causal Effects." Biometrika. The foundational paper.

  • Imbens (2004). "Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review." RESTAT.

For Deeper Understanding

  • Imbens & Rubin (2015). Causal Inference for Statistics, Social, and Biomedical Sciences, Chapters 12-18. Comprehensive textbook treatment.

  • Rosenbaum (2002). Observational Studies, 2nd ed. Deep dive into matching and sensitivity analysis.

The Epidemiological Perspective

  • Hernán & Robins (2020). Causal Inference: What If. Freely available online. The definitive treatment of g-methods.

  • Hernán & Robins (2016). "Using Big Data to Emulate a Target Trial." AJE. Introduces target trial emulation.

  • Robins (1986). "A New Approach to Causal Inference in Mortality Studies." Mathematical Modelling. The original g-computation paper.

Doubly Robust Methods

  • Bang & Robins (2005). "Doubly Robust Estimation in Missing Data and Causal Inference Models." Biometrics.

  • Kennedy (2016). "Semiparametric Theory and Empirical Processes in Causal Inference." Survey of modern methods.

Model Interpretation and Marginal Effects

  • Arel-Bundock, Greifer & Heiss (2024). "How to Interpret Statistical Models Using marginaleffects in R and Python." Journal of Statistical Software 111(9), 1–32. Clarifies the confusing terminology around marginal effects and provides a unified computational framework.

  • Long (1997). Regression Models for Categorical and Limited Dependent Variables. Classic treatment of interpreting nonlinear models.

Sensitivity Analysis

  • Rosenbaum (2002). Observational Studies, Chapters 4-5. Rosenbaum bounds.

  • Oster (2019). "Unobservable Selection and Coefficient Stability." JBES.

  • VanderWeele & Ding (2017). "Sensitivity Analysis in Observational Research." Annals of Internal Medicine. E-values.

Continuous and Multivalued Treatments

  • Hirano & Imbens (2004). "The Propensity Score with Continuous Treatments." In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives. The foundational paper on generalized propensity scores.

  • Imbens (2000). "The Role of the Propensity Score in Estimating Dose-Response Functions." Biometrika. Early treatment of dose-response with GPS.

The LaLonde Literature

  • Imbens & Xu (2025). "Comparing Experimental and Nonexperimental Methods: What Lessons Have We Learned Four Decades after LaLonde (1986)?" Journal of Economic Perspectives 39(4): 173–202. Essential retrospective with replication data and tutorial at https://yiqingxu.org/tutorials/lalonde/.

  • LaLonde (1986). "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." AER. The original benchmark study.

  • Dehejia & Wahba (1999). "Causal Effects in Nonexperimental Studies." JASA. Classic matching application using LaLonde data.

  • Smith & Todd (2005). "Does Matching Overcome LaLonde's Critique of Nonexperimental Estimators?" Journal of Econometrics. Important follow-up showing sensitivity to comparison group choice.

Other Applications

  • Heckman, Ichimura & Todd (1997). "Matching as an Econometric Evaluation Estimator." RES.

  • Imbens, Rubin & Sacerdote (2001). "Estimating the Effect of Unearned Income on Labor Earnings." AER. Lottery study with strong placebo test support for unconfoundedness.


Exercises

Conceptual

  1. Explain why controlling for a collider can introduce bias even when the collider is correlated with both treatment and outcome. Give an example.

  2. A researcher estimates the effect of college quality on earnings by matching students on SAT scores and family income. A colleague suggests also matching on post-college occupation.

    • Is occupation a good control?

    • Draw a DAG illustrating the issue.

    • How would including occupation affect the estimate?

Applied

  1. Using observational data on a job training program (e.g., Dehejia-Wahba data):

    • Estimate the ATT using (a) OLS, (b) nearest-neighbor matching, (c) IPW, and (d) AIPW

    • Assess covariate balance before and after matching

    • How much do the estimates differ? Why?

  2. Conduct a sensitivity analysis for your preferred estimate from Exercise 3:

    • Calculate Rosenbaum's Γ\Gamma at which significance is lost

    • Calculate the E-value

    • Discuss: How much unobserved confounding would be needed to overturn the finding?

Discussion

  1. "Selection on observables is always inferior to quasi-experimental methods because the key assumption cannot be tested." Evaluate this claim. Under what circumstances might SOO be preferable to (or at least as credible as) IV or DiD?

Data Exercise

  1. LaLonde replication (data available at https://yiqingxu.org/tutorials/lalonde/):

    • Using the LDW-CPS data, estimate the ATT using regression, nearest-neighbor matching, IPW, and AIPW

    • Plot propensity score distributions for treated and control groups. Is there adequate overlap?

    • Trim the sample to improve overlap (e.g., remove units with propensity scores outside [0.1, 0.9]). How do estimates change? Do they converge across methods?

    • Conduct a placebo test using 1975 earnings as the outcome (excluding 1975 earnings from conditioning variables). What do you find?

    • Discuss: Do your results support causal interpretation of the ATT estimate? What does this exercise teach about the relationship between statistical and causal estimands?


Technical Appendix

A.1 Propensity Score Weighting Derivation

Under unconfoundedness and overlap:

E[DYe(X)]=E[E[DYe(X)X]]=E[E[DYX]e(X)]E\left[\frac{DY}{e(X)}\right] = E\left[E\left[\frac{DY}{e(X)} \bigg| X\right]\right] = E\left[\frac{E[DY|X]}{e(X)}\right]

Since E[DYX]=E[YD=1,X]P(D=1X)=E[YD=1,X]e(X)E[DY|X] = E[Y|D=1, X] \cdot P(D=1|X) = E[Y|D=1, X] \cdot e(X):

E[DYe(X)]=E[E[YD=1,X]]=E[E[Y(1)X]]=E[Y(1)]E\left[\frac{DY}{e(X)}\right] = E[E[Y|D=1, X]] = E[E[Y(1)|X]] = E[Y(1)]

Similarly, E[(1D)Y1e(X)]=E[Y(0)]E\left[\frac{(1-D)Y}{1-e(X)}\right] = E[Y(0)].

A.2 Double Robustness Proof (Sketch)

The AIPW estimator can be written as:

τ^AIPW=1niϕ^(Zi)\hat{\tau}_{AIPW} = \frac{1}{n}\sum_i \hat{\phi}(Z_i)

where ϕ^\hat{\phi} is the estimated influence function. Under regularity conditions:

If the propensity score is correct:

  • The IPW term consistently estimates the bias in the outcome model

  • The combination is consistent

If the outcome model is correct:

  • The outcome model term is consistent

  • The IPW correction term has expectation zero

  • The combination is consistent

The key is that the cross-term has expectation zero when either model is correct.

A.3 Overlap and Positivity

The overlap or positivity assumption requires:

0<e(X)<1 for all X in the support0 < e(X) < 1 \text{ for all } X \text{ in the support}

Without overlap:

  • Some treated units have no comparable controls (or vice versa)

  • IPW weights become infinite

  • Estimates rely on extrapolation, not data

Practical violations: Near-violations where e(X)0e(X) \approx 0 or 1\approx 1 also cause problems through extreme weights and high variance.

Last updated