Chapter 18: Programming Companion—Causal Inference
Opening Question
How do we implement the identification strategies from Part III in practical code?
Chapter Overview
This chapter provides practical implementations of the causal inference methods covered in Chapters 9-17. We cover experimental analysis, matching and weighting, instrumental variables, difference-in-differences, regression discontinuity, synthetic control, and time series causal methods.
The emphasis is on commonly used packages and workflows. For each method, we show basic implementation, diagnostics, and visualization in both R and Python where mature packages exist.
What you will learn:
How to analyze experimental data with proper randomization inference
How to implement matching and propensity score methods
How to estimate IV, DiD, RD, and synthetic control designs
How to create publication-quality tables and diagnostic plots
Prerequisites: Chapters 9-17 (conceptual foundations), Chapter 4 (programming basics)
18.1 Experimental Analysis
Power Analysis
R with pwr:
Python with statsmodels:
Balance Testing
R:
Python:
Treatment Effect Estimation
R:
Python:
Randomization Inference
R with ri2:
18.2 Matching and Weighting
Propensity Score Estimation
R:
Python:
Matching with MatchIt
R:
Inverse Probability Weighting
R with WeightIt:
Python:
IPW for Continuous Treatments
When treatment is continuous (dosage, duration, intensity), weights are based on probability density functions rather than propensity scores. See Chapter 11.4 for the conceptual foundation.
R with WeightIt:
R manual implementation (to understand the mechanics):
Python:
Diagnostic Note: With continuous treatments, check that weights are not too extreme. Unlike binary IPW where weights blow up near propensity scores of 0 or 1, continuous IPW produces extreme weights when someone's treatment value is very unlikely given their covariates. Examine the weight distribution and consider trimming at the 1st and 99th percentiles.
Doubly Robust Estimation
R:
Model Interpretation with marginaleffects
When estimating treatment effects from nonlinear models (logit, probit, Poisson) or models with interactions, regression coefficients are not directly interpretable as causal effects. The marginaleffects package (Arel-Bundock, Greifer & Heiss 2024) provides a unified framework for computing predictions, comparisons, and slopes across 100+ model types—implementing the g-computation approach discussed in Chapter 11.
Core functions:
predictions(): Predicted outcomes at specific covariate valuescomparisons(): Differences in predictions (contrasts, risk differences)slopes(): Marginal effects (derivatives)avg_*()versions: Average across all observations (AME)
R:
Python:
Connecting to causal estimands:
ATE (binary treatment)
avg_comparisons(model, variables = "treatment")
ATT
avg_comparisons(model, variables = "treatment", newdata = subset(data, treatment == 1))
Dose-response
avg_comparisons(model, variables = list(dose = c(0, 1, 2, 3)))
CATE by subgroup
comparisons(model, variables = "treatment", by = "subgroup")
Marginal effect of continuous X
avg_slopes(model, variables = "X")
Why this matters: When you run glm(y ~ treatment, family = binomial) and report the coefficient, you're reporting a log odds ratio—not the probability change that answers "what is the effect of treatment?" The average marginal effect from avg_comparisons() answers the question you actually care about.
Integration with causal inference packages:
Practical Box: When to Use marginaleffects
Use
marginaleffectswhen:
Your outcome model is nonlinear (logit, probit, Poisson, etc.)
You have interactions and want effects at specific covariate values
You want to report effects on the natural scale (probabilities, counts)
You need to combine it with matching/weighting
You don't need it when:
Your model is linear with no interactions (coefficients = marginal effects)
You're using packages that already compute ATEs (e.g., AIPW, did)
18.3 Instrumental Variables
Basic 2SLS
R with fixest:
R with ivreg:
Python:
Weak Instrument Diagnostics
R:
18.4 Difference-in-Differences
Basic Two-Period DiD
R:
Python:
Event Studies
R with fixest:
Custom event study plot:
Modern DiD Estimators
Callaway-Sant'Anna with did package:
Sun-Abraham with fixest:
18.5 Regression Discontinuity
Sharp RD with rdrobust
R:
Visualization:
Python:
Manipulation Testing
R:
Fuzzy RD
R:
Covariate Balance at Cutoff
R:
18.6 Synthetic Control
Basic Synthetic Control
R with Synth:
Inference with Permutation
R:
tidysynth Package
R with tidysynth (more user-friendly):
Python Implementation
Python lacks a dedicated synthetic control package as mature as R's Synth, but implementation is straightforward using optimization:
Permutation Inference for Synthetic Control
Note: For production use, consider the
SyntheticControlMethodspackage on PyPI or implement withcvxpyfor more robust optimization. The above code illustrates the core algorithm.
18.7 Time Series Causal Methods
VAR and SVAR
R with vars:
Local Projections
R manual implementation:
R with lpirfs package:
Python:
Practical Guidance
Package Recommendations
Power analysis
pwr
statsmodels.stats.power
Balance tables
tableone, cobalt
Custom
Matching
MatchIt
causalinference
IPW
WeightIt
causalinference
Doubly robust
AIPW
econml
Marginal effects
marginaleffects
marginaleffects
IV/2SLS
fixest, ivreg
linearmodels
DiD (basic)
fixest
linearmodels
DiD (modern)
did, fixest (sunab)
-
RD
rdrobust
rdrobust
Synthetic control
Synth, tidysynth
-
VAR
vars
statsmodels
Local projections
lpirfs
statsmodels (manual)
Common Pitfalls
Pitfall 1: Forgetting to Cluster Standard errors are often wrong without clustering on the appropriate level.
How to avoid: Default to clustered SEs. Cluster at the level of treatment assignment or above.
Box: Python Clustering—A Complete Guide
Python's ecosystem for clustered SEs is more fragmented than R's. Here's how to cluster in each package:
statsmodels (cross-sectional)
linearmodels (panel data)
pyfixest (new, recommended)
Warning: Default SEs in most Python packages are not clustered. Always specify clustering explicitly.
Pitfall 2: Wrong Reference Period in Event Studies Not properly normalizing to a pre-treatment period.
How to avoid: Explicitly set reference period to t-1. Verify coefficient at t-1 is exactly zero.
Pitfall 3: Ignoring Weak Instruments Proceeding with IV despite weak first stage.
How to avoid: Always report first-stage F. Use weak-instrument-robust inference if F < 10.
Implementation Checklist
Summary
Key takeaways:
fixest (R) and linearmodels (Python) provide efficient implementations of regression-based methods with proper standard errors.
Modern DiD estimators (Callaway-Sant'Anna, Sun-Abraham) address problems with staggered adoption; use did or fixest's sunab().
rdrobust provides the standard implementation for RD with optimal bandwidth selection and robust inference.
Returning to the opening question: The packages covered in this chapter implement the identification strategies from Part III. The key is not just running the commands but understanding the assumptions each method requires and conducting appropriate diagnostics. Code without understanding is dangerous; understanding without code is incomplete.
Further Reading
Essential
Cunningham, S. (2021). "Causal Inference: The Mixtape." Yale UP. [Code examples throughout]
Huntington-Klein, N. (2021). "The Effect." CRC Press. [R and Stata code]
Model Interpretation
Arel-Bundock, Greifer & Heiss (2024). "How to Interpret Statistical Models Using marginaleffects in R and Python." Journal of Statistical Software 111(9), 1–32. The definitive guide to marginal effects terminology and implementation.
Package Documentation
marginaleffects: https://marginaleffects.com/
fixest: https://lrberge.github.io/fixest/
did: https://bcallaway11.github.io/did/
rdrobust: https://rdpackages.github.io/rdrobust/
Exercises
Conceptual
Why do different matching methods (nearest neighbor, caliper, optimal) sometimes produce different treatment effect estimates? What does this tell us about the role of implementation choices in causal inference?
Explain why the
fixestpackage reports both "regular" and "sunab" coefficients for event studies. When would you prefer one over the other?A researcher runs
rdrobustand gets a bandwidth of 5 units, but believes domain knowledge suggests a bandwidth of 10 is more appropriate. What are the tradeoffs of using the data-driven versus researcher-selected bandwidth?
Applied
Using experimental data, conduct a full analysis: balance table, ATE estimation with and without controls, and randomization inference. Compare the results.
Implement both matching (MatchIt) and IPW (WeightIt) on observational data. Compare the balance achieved and the estimated treatment effects.
Replicate a published DiD study using the
didpackage. Compare the Callaway-Sant'Anna estimates to standard TWFE.Conduct an RD analysis: estimate the effect, create an RD plot, test for manipulation, and check covariate balance at the cutoff.
Discussion
R and Python ecosystems offer similar functionality for causal inference, but the packages differ in design philosophy. Compare
fixest(R) withlinearmodelsorpyfixest(Python). What are the practical implications for researchers choosing between them?
Last updated