Chapter 18: Programming Companion—Causal Inference

Opening Question

How do we implement the identification strategies from Part III in practical code?


Chapter Overview

This chapter provides practical implementations of the causal inference methods covered in Chapters 9-17. We cover experimental analysis, matching and weighting, instrumental variables, difference-in-differences, regression discontinuity, synthetic control, and time series causal methods.

The emphasis is on commonly used packages and workflows. For each method, we show basic implementation, diagnostics, and visualization in both R and Python where mature packages exist.

What you will learn:

  • How to analyze experimental data with proper randomization inference

  • How to implement matching and propensity score methods

  • How to estimate IV, DiD, RD, and synthetic control designs

  • How to create publication-quality tables and diagnostic plots

Prerequisites: Chapters 9-17 (conceptual foundations), Chapter 4 (programming basics)


18.1 Experimental Analysis

Power Analysis

R with pwr:

Python with statsmodels:

Balance Testing

R:

Python:

Treatment Effect Estimation

R:

Python:

Randomization Inference

R with ri2:


18.2 Matching and Weighting

Propensity Score Estimation

R:

Python:

Matching with MatchIt

R:

Inverse Probability Weighting

R with WeightIt:

Python:

IPW for Continuous Treatments

When treatment is continuous (dosage, duration, intensity), weights are based on probability density functions rather than propensity scores. See Chapter 11.4 for the conceptual foundation.

R with WeightIt:

R manual implementation (to understand the mechanics):

Python:

Diagnostic Note: With continuous treatments, check that weights are not too extreme. Unlike binary IPW where weights blow up near propensity scores of 0 or 1, continuous IPW produces extreme weights when someone's treatment value is very unlikely given their covariates. Examine the weight distribution and consider trimming at the 1st and 99th percentiles.

Doubly Robust Estimation

R:

Model Interpretation with marginaleffects

When estimating treatment effects from nonlinear models (logit, probit, Poisson) or models with interactions, regression coefficients are not directly interpretable as causal effects. The marginaleffects package (Arel-Bundock, Greifer & Heiss 2024) provides a unified framework for computing predictions, comparisons, and slopes across 100+ model types—implementing the g-computation approach discussed in Chapter 11.

Core functions:

  • predictions(): Predicted outcomes at specific covariate values

  • comparisons(): Differences in predictions (contrasts, risk differences)

  • slopes(): Marginal effects (derivatives)

  • avg_*() versions: Average across all observations (AME)

R:

Python:

Connecting to causal estimands:

Causal Quantity
marginaleffects Function

ATE (binary treatment)

avg_comparisons(model, variables = "treatment")

ATT

avg_comparisons(model, variables = "treatment", newdata = subset(data, treatment == 1))

Dose-response

avg_comparisons(model, variables = list(dose = c(0, 1, 2, 3)))

CATE by subgroup

comparisons(model, variables = "treatment", by = "subgroup")

Marginal effect of continuous X

avg_slopes(model, variables = "X")

Why this matters: When you run glm(y ~ treatment, family = binomial) and report the coefficient, you're reporting a log odds ratio—not the probability change that answers "what is the effect of treatment?" The average marginal effect from avg_comparisons() answers the question you actually care about.

Integration with causal inference packages:

Practical Box: When to Use marginaleffects

Use marginaleffects when:

  • Your outcome model is nonlinear (logit, probit, Poisson, etc.)

  • You have interactions and want effects at specific covariate values

  • You want to report effects on the natural scale (probabilities, counts)

  • You need to combine it with matching/weighting

You don't need it when:

  • Your model is linear with no interactions (coefficients = marginal effects)

  • You're using packages that already compute ATEs (e.g., AIPW, did)


18.3 Instrumental Variables

Basic 2SLS

R with fixest:

R with ivreg:

Python:

Weak Instrument Diagnostics

R:


18.4 Difference-in-Differences

Basic Two-Period DiD

R:

Python:

Event Studies

R with fixest:

Custom event study plot:

Modern DiD Estimators

Callaway-Sant'Anna with did package:

Sun-Abraham with fixest:


18.5 Regression Discontinuity

Sharp RD with rdrobust

R:

Visualization:

Python:

Manipulation Testing

R:

Fuzzy RD

R:

Covariate Balance at Cutoff

R:


18.6 Synthetic Control

Basic Synthetic Control

R with Synth:

Inference with Permutation

R:

tidysynth Package

R with tidysynth (more user-friendly):

Python Implementation

Python lacks a dedicated synthetic control package as mature as R's Synth, but implementation is straightforward using optimization:

Permutation Inference for Synthetic Control

Note: For production use, consider the SyntheticControlMethods package on PyPI or implement with cvxpy for more robust optimization. The above code illustrates the core algorithm.


18.7 Time Series Causal Methods

VAR and SVAR

R with vars:

Local Projections

R manual implementation:

R with lpirfs package:

Python:


Practical Guidance

Package Recommendations

Method
R Package
Python Package

Power analysis

pwr

statsmodels.stats.power

Balance tables

tableone, cobalt

Custom

Matching

MatchIt

causalinference

IPW

WeightIt

causalinference

Doubly robust

AIPW

econml

Marginal effects

marginaleffects

marginaleffects

IV/2SLS

fixest, ivreg

linearmodels

DiD (basic)

fixest

linearmodels

DiD (modern)

did, fixest (sunab)

-

RD

rdrobust

rdrobust

Synthetic control

Synth, tidysynth

-

VAR

vars

statsmodels

Local projections

lpirfs

statsmodels (manual)

Common Pitfalls

Pitfall 1: Forgetting to Cluster Standard errors are often wrong without clustering on the appropriate level.

How to avoid: Default to clustered SEs. Cluster at the level of treatment assignment or above.

Box: Python Clustering—A Complete Guide

Python's ecosystem for clustered SEs is more fragmented than R's. Here's how to cluster in each package:

statsmodels (cross-sectional)

linearmodels (panel data)

pyfixest (new, recommended)

Warning: Default SEs in most Python packages are not clustered. Always specify clustering explicitly.

Pitfall 2: Wrong Reference Period in Event Studies Not properly normalizing to a pre-treatment period.

How to avoid: Explicitly set reference period to t-1. Verify coefficient at t-1 is exactly zero.

Pitfall 3: Ignoring Weak Instruments Proceeding with IV despite weak first stage.

How to avoid: Always report first-stage F. Use weak-instrument-robust inference if F < 10.

Implementation Checklist


Summary

Key takeaways:

  1. fixest (R) and linearmodels (Python) provide efficient implementations of regression-based methods with proper standard errors.

  2. Modern DiD estimators (Callaway-Sant'Anna, Sun-Abraham) address problems with staggered adoption; use did or fixest's sunab().

  3. rdrobust provides the standard implementation for RD with optimal bandwidth selection and robust inference.

Returning to the opening question: The packages covered in this chapter implement the identification strategies from Part III. The key is not just running the commands but understanding the assumptions each method requires and conducting appropriate diagnostics. Code without understanding is dangerous; understanding without code is incomplete.


Further Reading

Essential

  • Cunningham, S. (2021). "Causal Inference: The Mixtape." Yale UP. [Code examples throughout]

  • Huntington-Klein, N. (2021). "The Effect." CRC Press. [R and Stata code]

Model Interpretation

  • Arel-Bundock, Greifer & Heiss (2024). "How to Interpret Statistical Models Using marginaleffects in R and Python." Journal of Statistical Software 111(9), 1–32. The definitive guide to marginal effects terminology and implementation.

Package Documentation

  • marginaleffects: https://marginaleffects.com/

  • fixest: https://lrberge.github.io/fixest/

  • did: https://bcallaway11.github.io/did/

  • rdrobust: https://rdpackages.github.io/rdrobust/


Exercises

Conceptual

  1. Why do different matching methods (nearest neighbor, caliper, optimal) sometimes produce different treatment effect estimates? What does this tell us about the role of implementation choices in causal inference?

  2. Explain why the fixest package reports both "regular" and "sunab" coefficients for event studies. When would you prefer one over the other?

  3. A researcher runs rdrobust and gets a bandwidth of 5 units, but believes domain knowledge suggests a bandwidth of 10 is more appropriate. What are the tradeoffs of using the data-driven versus researcher-selected bandwidth?

Applied

  1. Using experimental data, conduct a full analysis: balance table, ATE estimation with and without controls, and randomization inference. Compare the results.

  2. Implement both matching (MatchIt) and IPW (WeightIt) on observational data. Compare the balance achieved and the estimated treatment effects.

  3. Replicate a published DiD study using the did package. Compare the Callaway-Sant'Anna estimates to standard TWFE.

  4. Conduct an RD analysis: estimate the effect, create an RD plot, test for manipulation, and check covariate balance at the cutoff.

Discussion

  1. R and Python ecosystems offer similar functionality for causal inference, but the packages differ in design philosophy. Compare fixest (R) with linearmodels or pyfixest (Python). What are the practical implications for researchers choosing between them?

Last updated