Chapter 11: Selection on Observables
Opening Question
When can we credibly estimate causal effects from observational data by controlling for the right variables—and how do we know if we have?
Chapter Overview
Randomized experiments are not always feasible. When they are not, researchers often turn to observational data—data where treatment was not randomly assigned but chosen by individuals, institutions, or circumstance. Can we still learn about causal effects?
The answer depends on a critical assumption: that we observe all the variables that jointly affect treatment and outcomes. If we do—if "selection is on observables"—then adjusting for these confounders recovers causal effects. This chapter develops methods for such adjustment: regression, matching, propensity score weighting, and doubly robust estimation. We also confront the uncomfortable truth that the key assumption is untestable, and develop sensitivity analyses to assess how much unobserved confounding would be needed to overturn our conclusions.
Selection on observables methods are ubiquitous in applied research. They are also frequently misused. The goal of this chapter is not just to teach the methods, but to develop judgment about when they are credible and how to defend—or critique—their application.
What you will learn:
When regression identifies causal effects (and when it does not)
How matching and propensity score methods work
The logic of doubly robust estimation
How to conduct and interpret sensitivity analyses
When selection on observables is a credible assumption
Prerequisites: Chapter 9 (The Causal Framework), Chapter 3 (Statistical Foundations)
11.1 Regression for Causal Inference
When Does OLS Identify Causal Effects?
Regression is the workhorse of empirical research. But when does a regression coefficient have a causal interpretation?
Consider the regression: Yi=α+τDi+Xi′β+εi
The coefficient τ estimates the causal effect of D on Y if:
Assumption 11.1: Conditional Unconfoundedness
(Y(0),Y(1))⊥D∣X
Conditional on observed covariates X, treatment assignment is independent of potential outcomes.
This is the same as unconfoundedness from Chapter 9. In regression terms, it means that after controlling for X, the remaining variation in D is as good as random.
When might this hold?
When X includes all variables that affect both treatment selection and outcomes
When institutional knowledge suggests that selection depends only on observables
When rich administrative data captures the selection process
When does it fail?
When unobserved factors (ability, motivation, preferences) affect both D and Y
When selection depends on private information not in the data
Almost always, to some degree
The Bad Controls Problem
A common mistake is to "control for everything available." This can introduce bias rather than remove it.
Bad controls are variables that are affected by treatment or that open backdoor paths:
Post-treatment variables: Controlling for a consequence of treatment blocks part of the causal effect.
Mediators: If D→M→Y, controlling for M removes the indirect effect.
Colliders: Controlling for a common effect of two variables creates spurious association between them.
Example: Bad Control
Estimating the effect of education on wages, you control for occupation. But occupation is partly caused by education. Controlling for it asks: "What is the effect of education, holding occupation fixed?" This removes the channel through which education raises wages (better jobs) and may reverse the sign of the effect.
The DAG test: Draw the causal graph. A variable is a bad control if:
It is a descendant of treatment (post-treatment)
It is a collider on a path between treatment and outcome
Conditioning on it opens a previously blocked path
Functional Form Sensitivity
Regression imposes functional form assumptions. With continuous covariates or treatment:
E[Y∣D,X]=α+τD+X′β
assumes linearity and additivity. If the true relationship is nonlinear or includes interactions, the estimate of τ depends on the distribution of X and may not equal ATE.
Solutions:
Include flexible functions of X (polynomials, splines)
Interact treatment with covariates
Use nonparametric methods (matching, weighting)
Practical Guidance: Regression for Causal Inference
Regression is appropriate when:
Unconfoundedness is plausible given X
The functional form is approximately correct
You avoid bad controls
Be cautious when:
Key confounders are likely unobserved
The treatment effect may vary with X
Covariate distributions differ substantially between treated and control
Box: G-Computation—The Epidemiological Perspective
Epidemiologists call the outcome-modeling approach g-computation or standardization—part of the broader "g-methods" tradition developed by James Robins and colleagues (see Section 9.7 for context on the epidemiological contribution to causal inference). The "g" stands for "generalized," referring to Robins' (1986) generalization to time-varying treatments.
The g-computation algorithm:
Fit an outcome model: E^[Y∣D,X]
For each unit, predict outcomes under treatment: Y^i(1)=E^[Y∣D=1,Xi]
For each unit, predict outcomes under control: Y^i(0)=E^[Y∣D=0,Xi]
Average: ATE=n1∑i[Y^i(1)−Y^i(0)]
This is equivalent to the regression approach when the outcome model is correctly specified and treatment effects are homogeneous. But g-computation naturally accommodates:
Nonlinear outcome models (logistic, Poisson)
Treatment effect heterogeneity (via interactions)
Marginal effects that differ from conditional effects
The key insight: instead of interpreting a regression coefficient, compute predicted outcomes under each treatment scenario and compare them. This "plug-in" approach is conceptually cleaner when effects are heterogeneous or outcomes are nonlinear.
See Hernán & Robins (2020), Chapter 13, for full treatment including time-varying extensions.
Box: Marginal Effects—What Are We Actually Estimating?
The term "marginal effect" means different things to different disciplines, creating confusion that has persisted for decades. When a researcher reports a "marginal effect," they might mean any of the following:
Conditional effect
$$\partial E[Y
X]/\partial XatspecificX$$ values
Marginal effect at the mean (MEM)
Effect evaluated at X=Xˉ
Differs from AME with nonlinearity
Average marginal effect (AME)
$$\frac{1}{n}\sum_i \partial E[Y
X_i]/\partial X$$
In a linear model (Y=α+βX+ε), these all equal β. But with nonlinear models—logit, probit, Poisson, or any model with interactions—they diverge, sometimes substantially.
Why this matters for causal inference: The causal estimands we care about—ATE, ATT—are defined as averages over populations. The ATE is E[Y(1)−Y(0)], which corresponds to the AME, not the MEM. When you report a logit coefficient or an effect "at the mean," you're not reporting the ATE.
Consider a logistic regression for employment (Y) on a training program (D):
The coefficient β^ is a log odds ratio—not directly interpretable as a probability change
The MEM evaluates ∂P(Y=1)/∂D at mean covariate values—but no one has exactly average characteristics
The AME averages the marginal effect across all individuals—this estimates the ATE under unconfoundedness
The g-computation algorithm in the previous box computes the AME: predict outcomes for everyone under treatment, predict under control, and average the difference. This is what we want for causal inference.
Practical implications:
For linear models with no interactions: report regression coefficients
For nonlinear models: report AMEs, not coefficients or MEMs
For models with interactions: compute effects at substantively meaningful covariate values, or report AMEs
Always clarify which quantity you're reporting
Terminology warning: The word "marginal" is overloaded in statistics. In the AME/MEM context above, "marginal" means a derivative (∂y/∂x). But in multilevel/mixed models, "marginal" means something entirely different: population-average effects (integrating over the distribution of random effects) versus conditional effects (for a typical cluster with random effects set to zero). These are distinct concepts that happen to share a name. When reading papers using mixed models, check which meaning applies—the marginal (population-average) effect in a GLMM is not the same as the AME from this box, though both involve averaging.
The
marginaleffectspackage (Arel-Bundock, Greifer & Heiss 2024) provides a unified framework for computing predictions, comparisons, and slopes across 100+ model types in R and Python---see Chapter 18 for implementation. For a thorough treatment of the terminology and the pitfalls of each approach, see Arel-Bundock, Greifer & Heiss (2024).
11.2 Propensity Score Methods
The Propensity Score
The propensity score is the probability of treatment given covariates:
e(X)=P(D=1∣X)
Rosenbaum and Rubin (1983) showed a remarkable result, building on ideas that developed simultaneously in statistics and epidemiology:
Theorem 11.1: Propensity Score Theorem
If (Y(0),Y(1))⊥D∣X, then (Y(0),Y(1))⊥D∣e(X)
Conditioning on the propensity score is sufficient for unconfoundedness.
Intuition: The propensity score summarizes all the information in X relevant for treatment assignment. Units with the same propensity score are equally likely to be treated, regardless of their specific covariate values. Comparing treated and control units with the same propensity score is like comparing within a mini-experiment.
Dimension reduction: Instead of matching on many covariates X, we can match on a single scalar e(X).
Estimating the Propensity Score
The propensity score is typically estimated using logistic regression:
log(1−e(X)e(X))=X′γ
Then e^(Xi)=logit−1(Xi′γ^).
Alternatives:
Probit regression
Machine learning methods (random forests, boosting, LASSO)
Covariate balancing propensity scores (CBPS)
The goal is not to predict treatment well in a forecasting sense, but to achieve covariate balance between treated and control groups after adjustment.
Matching
Matching pairs treated units with similar control units:
For each treated unit, find control units with similar X or e(X)
Estimate the treatment effect by comparing matched pairs
Types of matching:
Exact
Match on identical X
No approximation
Fails with many covariates
Nearest neighbor
Match to closest unit(s)
Simple
May use poor matches
Caliper
Match only within distance c
Avoids bad matches
May discard units
Kernel
Weight all controls by distance
Uses all data
Requires bandwidth choice
Coarsened exact
Exact match on coarsened X
Transparent
Sensitive to coarsening
With or without replacement:
With replacement: Each control can match multiple treated units. More flexible but inference is more complex.
Without replacement: Each control matches at most one treated unit. Order of matching matters.
ATT Matching Estimator
τ^ATT=n11∑Di=1(Yi−Y^i(0))
where Y^i(0) is the (average) outcome of matched control units.
Box: The Evolution from Matching to Doubly Robust Methods
Matching was the dominant approach in the 2000s, following influential papers by Rosenbaum, Rubin, and Dehejia and Wahba. However, modern best practice has shifted toward doubly robust (AIPW) and weighting-based methods for several reasons:
Limitations of matching:
Matching discards data: Unmatched controls are thrown away, reducing precision
Arbitrary matching choices: Caliper width, with/without replacement, matching order—all affect results
Variance estimation is complex: Standard errors for matched estimators require special adjustments (Abadie & Imbens, 2006)
No double robustness: Misspecify the matching metric and estimates are biased
Advantages of AIPW/weighting:
Uses all data: Every observation contributes (with appropriate weights)
Double robustness: Consistent if either the propensity or outcome model is correct
Clean inference: Standard variance estimation works; easy to bootstrap
Natural integration with ML: Cross-fitting + AIPW enables flexible estimation (DML)
Current recommendation:
AIPW/TMLE
Default choice for most applications
IPW
When only propensity model is credible
Regression
When only outcome model is credible
Matching
For transparency, sample construction, or when stakeholders expect it
Matching remains valuable for sample construction (creating a matched sample for subsequent analysis) and transparency (showing exactly which units are compared). But for primary causal estimation, doubly robust methods are now preferred.
Inverse Probability Weighting (IPW)
IPW reweights observations to create a pseudo-population where treatment is independent of covariates:
IPW Estimator for ATE
τ^IPW=n1∑i=1n(e^(Xi)DiYi−1−e^(Xi)(1−Di)Yi)
Intuition: Treated units with low propensity scores are "surprising"—they were unlikely to be treated but were. They carry more information about what control units would have experienced under treatment, so they receive higher weight.
Weights:
For ATE: wi=e^(Xi)Di+1−e^(Xi)1−Di
For ATT: wi=Di+(1−Di)1−e^(Xi)e^(Xi)
Problems with IPW:
Extreme propensity scores create extreme weights
Estimates can be highly variable
Sensitive to propensity score misspecification
Solutions:
Trim observations with propensity scores near 0 or 1
Truncate weights at some maximum
Use stabilized weights: e^(X)P(D) instead of e^(X)1
Use doubly robust methods (Section 11.3)
Subclassification (Stratification)
Subclassification groups units into strata by propensity score, then estimates effects within strata:
Estimate propensity scores
Divide into K strata (often 5-10 quantiles)
Estimate treatment effect within each stratum
Average across strata, weighting by stratum size
This is less sensitive to extreme weights than IPW but imposes discretization.
Covariate Balance Diagnostics
The key diagnostic: Check whether adjustment achieves covariate balance between treated and control.
Before adjustment, treated and control groups typically differ on X. After matching or weighting, they should not.
Standardized differences: dj=(sj12+sj02)/2Xˉj1−Xˉj0
Rules of thumb:
∣d∣<0.1: Good balance
∣d∣<0.25: Acceptable
∣d∣>0.25: Poor balance; investigate
Variance ratios: Check that variances are similar after matching.
Visual inspection: Plot covariate distributions before and after matching.
Practical Box: Balance Assessment Checklist

Box: Assessing Overlap with Propensity Score Distributions
Balance diagnostics check whether covariate means are similar after adjustment. But a more fundamental question is whether treated and control groups share common support—whether there are comparable units in both groups across the range of propensity scores.
The most informative diagnostic is a histogram (or density plot) of propensity scores by treatment status:
What to look for:
Good overlap: Distributions largely coincide; most treated units have propensity scores where control units also exist (and vice versa)
Poor overlap: Distributions are separated; treated units cluster at high propensity scores with few controls nearby, or vice versa
Log-odds scale: For clearer visualization, especially when propensity scores cluster near 0 or 1, plot the log-odds: log(e^/(1−e^)). A log-odds of −3 corresponds to roughly p=0.05; a log-odds of 3 corresponds to roughly p=0.95.
When overlap is poor:
Estimates rely on extrapolation, not comparable units
Different estimators can yield wildly different results
Trimming (removing units with extreme propensity scores) often improves robustness more than choosing a fancier estimator
Imbens & Xu (2025) demonstrate this vividly: with the LaLonde data, severe overlap problems cause estimates to vary from −$8,000 to +$2,000 depending on method. After trimming to ensure overlap, all modern estimators converge to similar values. The lesson: ensuring overlap is often more important than estimator choice.
Trimming strategies:
Drop treated units with propensity scores below the minimum in the control group (Dehejia & Wahba 1999)
Drop units with propensity scores outside [0.1, 0.9] (Crump et al. 2009)
Use matching to construct a sample with good overlap by design
The loss of precision from trimming is typically modest; the gain in robustness can be substantial.

11.3 Doubly Robust Estimation
The Problem with Single Methods
Both outcome regression and propensity score methods have limitations:
Outcome regression: Sensitive to misspecification of E[Y∣D,X]
IPW: Sensitive to misspecification of e(X), extreme weights
What if we could combine them to be robust to misspecification of either (but not both)? This insight, developed by James Robins, Andrea Rotnitzky, and colleagues in biostatistics during the 1990s, led to doubly robust methods—now standard in both epidemiology and economics.
Augmented Inverse Probability Weighting (AIPW)
The doubly robust or AIPW estimator combines outcome modeling and propensity scores:
AIPW Estimator
τ^AIPW=n1∑i=1n[μ^1(Xi)−μ^0(Xi)+e^(Xi)Di(Yi−μ^1(Xi))−1−e^(Xi)(1−Di)(Yi−μ^0(Xi))]
where μ^d(X)=E[Y∣D=d,X] are outcome models.
Intuition: Start with the outcome model prediction. Then correct for any remaining imbalance using IPW applied to the residuals.
Double Robustness Property
Theorem 11.2: Double Robustness
The AIPW estimator is consistent if either:
The outcome model μ^d(X) is correctly specified, OR
The propensity score e^(X) is correctly specified
Only one needs to be correct, not both.
This provides insurance against model misspecification. If you're unsure about functional form, doubly robust estimators hedge your bets.
Implementation
In R:
In Stata:
Targeted Learning and TMLE
Targeted Maximum Likelihood Estimation (TMLE) is a sophisticated doubly robust approach that:
Fits an initial outcome model
Updates it using propensity score information
Targets the specific estimand of interest
TMLE has better finite-sample properties than basic AIPW and integrates well with machine learning (Super Learner).
Cross-Fitting: Essential for Machine Learning
Box: The Cross-Fitting Requirement (Double/Debiased Machine Learning)
When using machine learning methods (random forests, LASSO, boosting) to estimate propensity scores or outcome models, a critical issue arises: overfitting bias.
The problem: If you estimate e^(X) or μ^(X) and compute treatment effects on the same data, ML's flexibility leads to overfitting. The nuisance function estimates are too tailored to the specific sample, and the resulting treatment effect estimate is biased—even with doubly robust methods.
The solution: Cross-fitting (Chernozhukov et al. 2018):
Split the sample into K folds (typically 5-10)
For each fold k:
Estimate nuisance functions (e^, μ^) on all data except fold k
Predict nuisance values for fold k using these out-of-fold estimates
Compute treatment effects using the cross-fitted predictions
Average across folds
This is the core of Double/Debiased Machine Learning (DML).
Why it works: By estimating nuisance functions on different data than where they're applied, we avoid overfitting. Small errors in nuisance estimation don't create first-order bias in the treatment effect (this is called "Neyman orthogonality").
Practical implication: Standard implementations like Stata's
teffectsuse parametric models and don't require cross-fitting. But if you use flexible ML methods:
In R: Use
DoubleML,grf, orAIPWwith Super Learner (handles cross-fitting automatically)In Python: Use
econml(EconML) orDoubleMLNever use
sklearnto fit a random forest propensity score and then apply it to the same data for weightingSee Chapter 21 for detailed coverage of DML and its theoretical foundations.
11.4 Continuous and Multivalued Treatments
Beyond Binary Treatment
Many treatments are not binary:
Years of education (continuous)
Drug dosage (continuous)
Type of program (multivalued)
The propensity score framework extends to these cases.
Generalized Propensity Score
For continuous treatment D, the generalized propensity score is:
r(d,X)=fD∣X(d∣X)
The conditional density of treatment given covariates.
Under weak unconfoundedness: Y(d)⊥D∣r(d,X)
Dose-Response Estimation
The dose-response function maps treatment intensity to expected outcome:
μ(d)=E[Y(d)]
Estimation approaches:
Stratify on GPS: Group observations by GPS value, estimate E[Y∣D=d] within groups
Inverse probability weighting: Weight by inverse of GPS (with stabilization)
Outcome modeling: Specify E[Y∣D,X] and average over covariate distribution
Implementation Note: IPW for Continuous Treatments
IPW for continuous treatments differs fundamentally from the binary case. With binary treatment, weights are based on propensity scores (probabilities). With continuous treatment, weights are based on probability density functions.
The stabilized weight for continuous treatment is: wi=fD∣X(Di∣Xi)fD(Di)
where the numerator is the marginal density of treatment and the denominator is the conditional density given confounders. In practice, both are often assumed normal:
Numerator: Di∼N(Dˉ,σD2)
Denominator: Di∣Xi∼N(D^i,σε2) from a regression of D on X
Unlike binary IPW where extreme propensity scores (near 0 or 1) create problems, continuous IPW can produce extreme weights when an observation's treatment value is unlikely given their covariates. Stabilization and trimming remain important.
In R, the
WeightItpackage handles continuous treatments withmethod = "ps"and appropriate family specification. See Chapter 18 for implementation and Hirano & Imbens (2004) for the theoretical foundations.
Multivalued Treatments
With J treatment levels, estimate separate propensity scores:
ej(X)=P(D=j∣X)
Then use multinomial IPW or matching within propensity strata.
11.5 Sensitivity Analysis
The Core Problem
Unconfoundedness cannot be tested. No matter how many covariates we include, there may be unobserved factors that bias our estimates.
Sensitivity analysis asks: How much unobserved confounding would be needed to change our conclusions?
Rosenbaum Bounds
Rosenbaum's approach asks: If unobserved confounding exists, how large would the bias need to be to explain away the treatment effect?
Define Γ as the maximum ratio of treatment odds for two units with the same observed covariates:
Γ1≤P(D=1∣X,U′)/P(D=0∣X,U′)P(D=1∣X,U)/P(D=0∣X,U)≤Γ
Under no unmeasured confounding, Γ=1. Larger Γ represents more potential confounding.
For each value of Γ, compute bounds on the p-value. Find the minimum Γ at which significance is lost. This is the study's sensitivity.
Example: Interpreting Rosenbaum Bounds
Suppose an effect is significant at Γ=1 (no confounding) and remains significant up to Γ=2. This means an unobserved confounder would need to double the odds of treatment to explain away the finding—a substantial amount of confounding.
Oster's Method: Coefficient Stability
Oster (2019) extends Altonji, Elder & Taber (2005) to assess how coefficient estimates change as controls are added:
Estimate treatment effect without controls: τ~
Estimate with observed controls: τ^
Calculate how much τ would change if unobservables were as important as observables
The relative degree of selection δ measures how much selection on unobservables would need to exceed selection on observables to produce a zero effect:
δ=(τ~−τ^)(τ^−0)×R2−R~2Rmax−R~2
where Rmax is a hypothesized maximum R2 and R2 is the R2 from the controlled regression.
If ∣δ∣>1, unobservables would need to be more important than observables to eliminate the effect—providing some reassurance.
E-Values: The Intuitive Sensitivity Measure
The E-value (VanderWeele & Ding 2017) has become the preferred sensitivity measure because of its intuitive interpretation. It asks: What is the minimum strength of association between an unmeasured confounder and both treatment and outcome needed to explain away the observed effect?
E-value=RR+RR×(RR−1)
where RR is the observed risk ratio (or an approximation for other effect measures).
How to Interpret E-Values
An E-value of 3 means: An unmeasured confounder would need to be associated with both treatment and outcome by a risk ratio of at least 3 to fully explain away the observed effect.
1.0
No confounding needed (null effect)
1.5
Modest confounding could explain result
2.0
Moderate confounding required
3.0+
Strong confounding required
5.0+
Very robust to confounding
Why E-values are intuitive: Unlike Rosenbaum's Γ (which measures odds ratios for treatment assignment), the E-value is directly comparable to known risk ratios. If your strongest observed confounder has a risk ratio of 2 with both treatment and outcome, and your E-value is 4, an unobserved confounder would need to be twice as strong to explain away your result.
Computing E-values in practice:
E-value for the confidence interval: Report E-values for both the point estimate and the confidence interval bound closest to the null. If even the CI bound has a high E-value, the finding is robust.
Benchmarking
Compare sensitivity parameters to observed confounders:
How strongly are observed confounders related to treatment and outcome?
Would an unobserved confounder need to be stronger than any observed confounder?
Are there plausible candidates for such strong confounders?
Practical Box: Sensitivity Analysis Reporting
Report:
Point estimate and confidence interval under unconfoundedness
Rosenbaum Γ at which significance is lost
Oster's δ (selection ratio)
E-value for the point estimate and confidence interval
Comparison to strength of observed confounders
Discussion of plausible unobserved confounders
11.6 When Is Conditional Ignorability Credible?
The Fundamental Question
No statistical test can verify unconfoundedness. Its credibility rests on substantive arguments about the selection process.
Key question: Given what we know about how treatment is assigned, is it plausible that all relevant factors are observed?
Institutional Knowledge
The strongest case for unconfoundedness comes from understanding the selection mechanism:
How are treatments assigned?
What information is available to decision-makers?
What factors plausibly influence selection?
If the selection process is well-understood and observed in the data, unconfoundedness is more credible.
Box: Target Trial Emulation
Epidemiologists have developed a useful framework for designing observational studies: target trial emulation (Hernán & Robins 2016).
The idea: Before analyzing observational data, specify the randomized trial you would ideally run. What would be the eligibility criteria? Treatment assignment? Follow-up period? Primary outcome? Then ask: How well can your observational study emulate this trial?
This discipline forces clarity about:
Time zero: When does follow-up begin? (Avoids immortal time bias)
Eligibility: Who is in the study population?
Treatment definition: What exactly is being compared?
Assignment mechanism: What determines who gets treated?
When the observational study cannot emulate the target trial—because of confounding, selection, or measurement issues—at least the limitations are explicit.
This framework is equally valuable in economics. Before running regressions, specify the experiment you wish you could run. Then assess how well your observational design approximates it.
Example: Hospital Choice
Studying the effect of hospital quality on outcomes, selection depends on:
Patient preference and information
Distance and transportation
Referral patterns
Emergency vs. planned admission
If we observe these factors, we might argue for conditional ignorability. But patients may choose based on unobserved health factors or private information—undermining the assumption.
Data Quality
Better data makes unconfoundedness more plausible:
Administrative data: Often captures the actual selection process
Rich longitudinal data: Pre-treatment outcomes may proxy for unobserved factors
Multiple measures: Different proxies for the same construct reduce measurement error
The Selection Story
A credible analysis tells a clear story about selection:
Who gets treated and why? Describe the selection process.
What do we observe? List the covariates and why they matter.
What might we miss? Acknowledge potential unobserved confounders.
How sensitive are results? Show sensitivity analysis.
Validation Through Placebo Tests
While unconfoundedness cannot be directly tested, placebo tests offer indirect evidence about its plausibility.
The logic: Estimate a treatment effect on an outcome that should not be affected by treatment. If you find an effect, something is wrong—likely unobserved confounding.
Common placebo outcomes:
Lagged outcomes: Use a pre-treatment measure of the outcome. The treatment cannot have caused changes in the past.
Predetermined covariates: Variables fixed before treatment should show no "effect."
Conceptually unrelated outcomes: Outcomes with no plausible causal link to treatment.
Example: LaLonde Placebo Test
LaLonde (1986) and subsequent analyses used 1975 earnings as a placebo outcome for a job training program that occurred afterward. The true treatment effect on 1975 earnings is zero by construction.
Using experimental data: placebo estimates are close to zero (as expected).
Using nonexperimental data: placebo estimates are large and negative, even with modern methods—suggesting that selection into treatment is correlated with earnings trajectories in ways the observed covariates do not capture.
Interpreting placebo tests:
Null placebo effect: Consistent with (but does not prove) unconfoundedness. The covariates may capture selection.
Non-null placebo effect: Strong evidence against unconfoundedness. Either important confounders are missing, or the confounders affect the outcome differently in placebo vs. actual periods.
Multiple pretreatment periods strengthen placebo tests: When several lagged outcomes are available, testing for "effects" on each provides more convincing evidence. Imbens, Rubin & Sacerdote (2001), studying lottery winners, used six years of pre-winning earnings as placebos—their null effects across all years strongly supported unconfoundedness.
Practical Guidance: Placebo Test Checklist
Warning Signs
Be skeptical of unconfoundedness when:
Treatment depends on private information (patient symptoms, student motivation)
Selection is based on anticipated outcomes (Roy model selection)
Strong economic incentives drive selection
Similar studies with better identification yield different results
Sensitivity analysis shows results are fragile
Box: The LaLonde Benchmark—What 40 Years Taught Us
In 1986, Robert LaLonde asked a simple question: Can nonexperimental methods replicate the results of a randomized experiment? Using data from the National Supported Work demonstration—a job training program with a randomized control group—he compared experimental estimates to those from matching trainees to observational comparison groups.
LaLonde's original conclusion was stark: Nonexperimental methods failed. Estimates varied wildly across specifications and often had the wrong sign. His paper became a foundational critique of observational methods and helped spark the credibility revolution.
What we've learned since (Imbens & Xu 2025):
Overlap matters enormously. LaLonde's comparison groups differed dramatically from trainees—average 1975 earnings of $14,000–$19,000 versus $3,000 for trainees. With such poor overlap, estimates relied on extrapolation.
With overlap ensured, estimator choice matters less. Modern methods (matching, IPW, AIPW, causal forests) yield similar estimates once samples are trimmed to ensure comparable units exist in both groups.
But similar estimates don't guarantee causal validity. The LDW-CPS sample produces ATT estimates close to the experimental benchmark using various modern methods. Success? Not quite: placebo tests using 1975 earnings (before treatment) fail badly, suggesting unconfoundedness does not hold.
Recovering average effects is easier than heterogeneous effects. Even when ATT estimates match experimental benchmarks, conditional average treatment effects (CATTs) diverge substantially between experimental and nonexperimental analyses.
The nuanced answer to LaLonde's question: Sometimes nonexperimental methods can replicate experimental benchmarks—and we now have better tools to assess when. The key is not the estimation method but whether (a) overlap exists and (b) unconfoundedness is plausible. Placebo tests and sensitivity analyses help evaluate the latter.
The sobering implication: Methods that robustly estimate the statistical estimand (covariate-adjusted treatment-control difference) may still fail to recover the causal estimand if unobserved confounding exists. No amount of methodological sophistication substitutes for a credible design.
Comparison to Other Methods
Selection on observables is one tool among many. Consider:
A natural experiment exists
IV, DiD, or RD
Panel data available
Fixed effects to control time-invariant confounders
Selection on unobservables likely
IV or bounds
Key confounder is unobserved but can be proxied
Proxy variable methods
Running Example Connection: China's Growth
SOO is fundamentally a micro-level strategy. It assumes we can identify and measure all relevant confounders—plausible when studying individual choices where selection depends on observable characteristics. But for macro questions like what caused China's post-1978 growth, the approach breaks down entirely. We cannot list all the confounders affecting both policy choices (market liberalization, trade openness, institutional reform) and outcomes (GDP growth). Even if we could, we would have only one China to observe. Selection on observables requires comparing treated and control units with similar characteristics—but there is no control China. This is why macro questions require the triangulation of methods discussed in Chapters 1 and 23.
Practical Guidance
Method Selection
Unconfoundedness plausible, good overlap
Any method; AIPW preferred
Concern about functional form
Matching or AIPW with flexible models
Extreme propensity scores
Trim or match; avoid pure IPW
Multiple confounders, continuous treatment
GPS methods
Concern about unobserved confounding
SOO with extensive sensitivity analysis
Strong concern about unobserved confounding
Consider different identification strategy
Common Pitfalls
Pitfall 1: Controlling for Post-Treatment Variables
Including variables affected by treatment biases estimates—often severely.
How to avoid: Map out the causal structure. Only control for pre-treatment confounders.
Pitfall 2: Ignoring the Overlap Assumption
If propensity scores are near 0 or 1, there is no comparable counterfactual. Estimates rely on extrapolation. The LaLonde data illustrate this dramatically: without ensuring overlap, ATT estimates range from −$8,000 to +$2,000 depending on the estimator used.
How to avoid: Plot propensity score distributions by treatment status. Trim or match on propensity scores to ensure common support. Report overlap diagnostics. With good overlap, estimates become much more stable across methods—often more stable than any other specification choice.
Pitfall 3: Claiming Causation Without Defending Unconfoundedness
Adding controls does not automatically justify causal interpretation.
How to avoid: Explicitly state and defend the unconfoundedness assumption. Conduct sensitivity analysis.
Pitfall 4: Over-Relying on Propensity Score Prediction
A good propensity score model predicts treatment. But prediction quality is not the goal—balance is.
How to avoid: Focus on covariate balance, not prediction accuracy. Use balance diagnostics.
Pitfall 5: The Table 2 Fallacy
When you estimate Y=α+τD+βX+ε and report both τ^ and β^, it's tempting to interpret β^ causally too. This is almost always wrong.
The regression is designed to identify the effect of D on Y by controlling for X. But the coefficient on X is not identified for a causal interpretation—you haven't controlled for the confounders of the X→Y relationship.
Example: Regressing wages on education and parental income, parental income "controls for" family background when estimating returns to education. But the coefficient on parental income does NOT estimate the causal effect of parental income on wages—that would require controlling for its confounders (parental education, geography, genetics, etc.).
The pattern: Papers often present regression results with one causally-interpreted coefficient (the treatment) and many descriptive coefficients (the controls). Readers—and sometimes authors—interpret all coefficients causally. This is the "Table 2 Fallacy" (Westreich and Greenland 2013).
How to avoid: Only interpret causally the coefficient you've designed the study to identify. Label control variable coefficients as "adjustment factors" or "associations," not effects.
Implementation Checklist
Design:
Estimation:
Inference and Robustness:
Reporting:
Qualitative Bridge
What Quantitative Balance Cannot Capture
Matching and weighting achieve balance on observed covariates. But:
Do these covariates capture what matters for selection?
Are the measures valid and reliable?
What unobserved factors might remain?
Qualitative research can help answer these questions.
Using Qualitative Knowledge for Selection Stories
Interviews with decision-makers: How do they assign treatment? What factors do they consider?
Case studies: Detailed examination of selection in specific instances.
Expert knowledge: Domain specialists often know the selection process better than data reveal.
Example: Program Evaluation
Evaluating a job training program using observational data:
Quantitative: Match participants to non-participants on demographics, prior employment, education.
Qualitative: Interview case workers who refer clients. Learn that they refer the "most motivated" clients—an unobserved confounder.
This qualitative insight suggests the matching estimate is upward biased and motivates searching for better identification.
Integration Note
Connections to Other Chapters
Ch. 9 (Causal Framework)
Formalizes unconfoundedness assumption
Ch. 10 (Experiments)
Experiments eliminate selection bias that SOO must assume away
Ch. 12 (IV)
IV handles selection on unobservables when SOO fails
Ch. 20 (Heterogeneity)
Propensity scores for heterogeneous effects (CATE)
Ch. 21 (ML for Causal)
Machine learning for flexible propensity/outcome models
When to Use Selection on Observables
SOO is appropriate when:
No better identification strategy is available
Rich data capture the selection process
Institutional knowledge supports unconfoundedness
Sensitivity analysis shows robustness
SOO is inappropriate when:
Strong unobserved confounders are likely
A cleaner identification strategy exists
Results are sensitive to plausible confounding
The selection process is poorly understood
Summary
Key takeaways:
Selection on observables assumes all confounders are observed: This allows conditioning on covariates to identify causal effects, but the assumption is untestable.
Multiple methods implement the assumption: Regression, matching, propensity score weighting, and doubly robust estimation all rely on the same identifying assumption but differ in how they adjust for confounders.
Doubly robust methods provide insurance: AIPW is consistent if either the outcome model or propensity score is correct—a valuable hedge against misspecification.
Overlap is as important as balance: Before worrying about covariate balance, verify that treated and control groups have common support in their propensity score distributions. With poor overlap, estimates depend on extrapolation and can vary wildly across methods. Trimming to ensure overlap often matters more than estimator choice.
Placebo tests probe unconfoundedness: Since unconfoundedness cannot be tested directly, estimate "effects" on outcomes that should be unaffected (pre-treatment measures, conceptually unrelated variables). Failed placebo tests are strong evidence against unconfoundedness; passed tests increase credibility.
Sensitivity analysis is essential: We must assess how much unobserved confounding would be needed to overturn findings. Rosenbaum bounds, E-values, and Oster's method provide complementary approaches.
Credibility requires substantive defense: Statistical methods cannot establish unconfoundedness. It must be defended with institutional knowledge and careful reasoning about selection.
Returning to the opening question: We can credibly estimate causal effects by controlling for confounders when we have good reason to believe all relevant confounders are observed and controlled for. "Good reason" comes not from statistical tests but from understanding the selection process, having rich data, and demonstrating robustness to potential unobserved confounding. When these conditions are met, selection on observables methods are powerful tools. When they are not, we should seek better identification strategies.
Further Reading
Essential
Rosenbaum & Rubin (1983). "The Central Role of the Propensity Score in Observational Studies for Causal Effects." Biometrika. The foundational paper.
Imbens (2004). "Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review." RESTAT.
For Deeper Understanding
Imbens & Rubin (2015). Causal Inference for Statistics, Social, and Biomedical Sciences, Chapters 12-18. Comprehensive textbook treatment.
Rosenbaum (2002). Observational Studies, 2nd ed. Deep dive into matching and sensitivity analysis.
The Epidemiological Perspective
Hernán & Robins (2020). Causal Inference: What If. Freely available online. The definitive treatment of g-methods.
Hernán & Robins (2016). "Using Big Data to Emulate a Target Trial." AJE. Introduces target trial emulation.
Robins (1986). "A New Approach to Causal Inference in Mortality Studies." Mathematical Modelling. The original g-computation paper.
Doubly Robust Methods
Bang & Robins (2005). "Doubly Robust Estimation in Missing Data and Causal Inference Models." Biometrics.
Kennedy (2016). "Semiparametric Theory and Empirical Processes in Causal Inference." Survey of modern methods.
Model Interpretation and Marginal Effects
Arel-Bundock, Greifer & Heiss (2024). "How to Interpret Statistical Models Using marginaleffects in R and Python." Journal of Statistical Software 111(9), 1–32. Clarifies the confusing terminology around marginal effects and provides a unified computational framework.
Long (1997). Regression Models for Categorical and Limited Dependent Variables. Classic treatment of interpreting nonlinear models.
Sensitivity Analysis
Rosenbaum (2002). Observational Studies, Chapters 4-5. Rosenbaum bounds.
Oster (2019). "Unobservable Selection and Coefficient Stability." JBES.
VanderWeele & Ding (2017). "Sensitivity Analysis in Observational Research." Annals of Internal Medicine. E-values.
Continuous and Multivalued Treatments
Hirano & Imbens (2004). "The Propensity Score with Continuous Treatments." In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives. The foundational paper on generalized propensity scores.
Imbens (2000). "The Role of the Propensity Score in Estimating Dose-Response Functions." Biometrika. Early treatment of dose-response with GPS.
The LaLonde Literature
Imbens & Xu (2025). "Comparing Experimental and Nonexperimental Methods: What Lessons Have We Learned Four Decades after LaLonde (1986)?" Journal of Economic Perspectives 39(4): 173–202. Essential retrospective with replication data and tutorial at https://yiqingxu.org/tutorials/lalonde/.
LaLonde (1986). "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." AER. The original benchmark study.
Dehejia & Wahba (1999). "Causal Effects in Nonexperimental Studies." JASA. Classic matching application using LaLonde data.
Smith & Todd (2005). "Does Matching Overcome LaLonde's Critique of Nonexperimental Estimators?" Journal of Econometrics. Important follow-up showing sensitivity to comparison group choice.
Other Applications
Heckman, Ichimura & Todd (1997). "Matching as an Econometric Evaluation Estimator." RES.
Imbens, Rubin & Sacerdote (2001). "Estimating the Effect of Unearned Income on Labor Earnings." AER. Lottery study with strong placebo test support for unconfoundedness.
Exercises
Conceptual
Explain why controlling for a collider can introduce bias even when the collider is correlated with both treatment and outcome. Give an example.
A researcher estimates the effect of college quality on earnings by matching students on SAT scores and family income. A colleague suggests also matching on post-college occupation.
Is occupation a good control?
Draw a DAG illustrating the issue.
How would including occupation affect the estimate?
Applied
Using observational data on a job training program (e.g., Dehejia-Wahba data):
Estimate the ATT using (a) OLS, (b) nearest-neighbor matching, (c) IPW, and (d) AIPW
Assess covariate balance before and after matching
How much do the estimates differ? Why?
Conduct a sensitivity analysis for your preferred estimate from Exercise 3:
Calculate Rosenbaum's Γ at which significance is lost
Calculate the E-value
Discuss: How much unobserved confounding would be needed to overturn the finding?
Discussion
"Selection on observables is always inferior to quasi-experimental methods because the key assumption cannot be tested." Evaluate this claim. Under what circumstances might SOO be preferable to (or at least as credible as) IV or DiD?
Data Exercise
LaLonde replication (data available at https://yiqingxu.org/tutorials/lalonde/):
Using the LDW-CPS data, estimate the ATT using regression, nearest-neighbor matching, IPW, and AIPW
Plot propensity score distributions for treated and control groups. Is there adequate overlap?
Trim the sample to improve overlap (e.g., remove units with propensity scores outside [0.1, 0.9]). How do estimates change? Do they converge across methods?
Conduct a placebo test using 1975 earnings as the outcome (excluding 1975 earnings from conditioning variables). What do you find?
Discuss: Do your results support causal interpretation of the ATT estimate? What does this exercise teach about the relationship between statistical and causal estimands?
Technical Appendix
A.1 Propensity Score Weighting Derivation
Under unconfoundedness and overlap:
E[e(X)DY]=E[E[e(X)DYX]]=E[e(X)E[DY∣X]]
Since E[DY∣X]=E[Y∣D=1,X]⋅P(D=1∣X)=E[Y∣D=1,X]⋅e(X):
E[e(X)DY]=E[E[Y∣D=1,X]]=E[E[Y(1)∣X]]=E[Y(1)]
Similarly, E[1−e(X)(1−D)Y]=E[Y(0)].
A.2 Double Robustness Proof (Sketch)
The AIPW estimator can be written as:
τ^AIPW=n1∑iϕ^(Zi)
where ϕ^ is the estimated influence function. Under regularity conditions:
If the propensity score is correct:
The IPW term consistently estimates the bias in the outcome model
The combination is consistent
If the outcome model is correct:
The outcome model term is consistent
The IPW correction term has expectation zero
The combination is consistent
The key is that the cross-term has expectation zero when either model is correct.
A.3 Overlap and Positivity
The overlap or positivity assumption requires:
0<e(X)<1 for all X in the support
Without overlap:
Some treated units have no comparable controls (or vice versa)
IPW weights become infinite
Estimates rely on extrapolation, not data
Practical violations: Near-violations where e(X)≈0 or ≈1 also cause problems through extreme weights and high variance.
Last updated