Chapter 13: Difference-in-Differences

Opening Question

When a policy changes in one place but not another, how can we use the comparison to learn about the policy's causal effect?

Chapter Overview

Difference-in-differences (DiD) is perhaps the most widely used quasi-experimental method in applied economics. The core idea is elegantly simple: compare the change in outcomes over time for a group affected by a treatment to the change for a group that was not affected. If both groups would have evolved similarly in the absence of treatment, the difference in their changes identifies the causal effect.

This simplicity is deceptive. The parallel trends assumption—that treated and control groups would have followed the same trajectory absent treatment—is fundamentally untestable. Moreover, recent methodological work has revealed that the standard two-way fixed effects (TWFE) estimator can produce severely biased estimates when treatment effects vary across units or over time, precisely the settings where DiD is most commonly applied. This chapter develops both the classical DiD framework and the modern solutions to these problems.

What you will learn:

The logic of DiD and the parallel trends assumption
How to assess parallel trends and conduct event study analysis
Why TWFE fails with staggered adoption and heterogeneous effects
Modern estimators that address these problems (Callaway-Sant'Anna, Sun-Abraham, and others)
When DiD is and is not a credible identification strategy

Prerequisites: Chapter 9 (Causal Framework), Chapter 3 (Statistical Foundations)

13.1 The Basic Difference-in-Differences Setup

The Canonical 2×2 Case

Consider the simplest DiD setting: two groups observed in two time periods, where one group receives treatment between periods and the other does not. Let $Y_{it}$ denote the outcome for unit $i$ at time $t$ , with $t \in \{0, 1\}$ (pre and post) and groups $G \in \{0, 1\}$ (control and treated).

The DiD estimator is:

$\hat{\tau}^{DiD} = (\bar{Y}_{1,1} - \bar{Y}_{1,0}) - (\bar{Y}_{0,1} - \bar{Y}_{0,0})$

where $\bar{Y}_{g,t}$ is the average outcome for group $g$ in period $t$ . This is the change in the treated group minus the change in the control group—the "difference in differences."

Definition 13.1 (Parallel Trends Assumption): In the absence of treatment, the average outcomes for treated and control groups would have followed parallel paths over time: $E[Y_{i1}(0) - Y_{i0}(0) | G_i = 1] = E[Y_{i1}(0) - Y_{i0}(0) | G_i = 0]$ where $Y_{it}(0)$ denotes the potential outcome without treatment.

Intuition: The control group's change over time tells us what would have happened to the treated group had they not been treated. We're not assuming the groups are identical—they can have different levels—only that they would have changed by the same amount.

Regression Implementation

The 2×2 DiD can be estimated via regression:

$Y_{it} = \alpha + \beta \cdot Treated_i + \gamma \cdot Post_t + \tau \cdot (Treated_i \times Post_t) + \varepsilon_{it}$

where:

$\alpha$ = baseline mean for control group in pre-period
$\beta$ = level difference between treated and control groups (pre-treatment)
$\gamma$ = time trend common to both groups
$\tau$ = the DiD estimate (coefficient of interest)

This regression formulation extends naturally to include covariates and to settings with more groups and time periods.

Example: Card and Krueger's Minimum Wage Study

The most famous DiD application is Card and Krueger's (1994) study of New Jersey's 1992 minimum wage increase. New Jersey raised its minimum wage from $4.25 to $5.05 per hour; neighboring Pennsylvania did not change its minimum wage.

Setup:

Treated group: Fast food restaurants in New Jersey
Control group: Fast food restaurants in eastern Pennsylvania
Pre-period: February 1992 (before NJ increase)
Post-period: November 1992 (after NJ increase)
Outcome: Employment (full-time equivalent employees)

Results:

NJ (Treated)

PA (Control)

Difference

Before

20.44

23.33

-2.89

After

21.03

21.17

-0.14

Change

+0.59

-2.16

+2.75

The DiD estimate of +2.75 suggests the minimum wage increase raised employment, contradicting the standard competitive model prediction. This finding sparked decades of debate and methodological refinement.

Why DiD here? Simple before-after comparison in NJ would confound the minimum wage effect with any other changes occurring between February and November 1992 (seasonal effects, macroeconomic conditions). The PA comparison removes these common trends.

13.2 The Parallel Trends Assumption

What Parallel Trends Means (and Doesn't Mean)

Parallel trends is an assumption about counterfactual outcomes—what would have happened to the treated group absent treatment. This is fundamentally untestable because we never observe the treated group's counterfactual post-treatment trajectory.

What parallel trends does NOT require:

That treated and control groups have similar outcome levels
That groups have identical characteristics
That the groups would have the same outcomes in any period

What parallel trends DOES require:

That in the absence of treatment, both groups would have experienced the same change in outcomes
That any time-varying factors affecting outcomes are either common to both groups or unrelated to treatment assignment

Pre-Trends Analysis

While we cannot test parallel trends directly, we can examine whether outcomes evolved similarly before treatment. If pre-treatment trends diverge, parallel trends in the post-period is implausible.

Implementation: Estimate an event study specification (Section 13.3) and examine coefficients for pre-treatment periods. Statistically significant pre-trends are a red flag.

Limitations of pre-trends testing:

Statistical power: Failure to reject parallel pre-trends may reflect low power rather than true parallel trends
Anticipation effects: Divergence just before treatment may reflect anticipation rather than pre-existing trends
Non-linear trends: Parallel linear pre-trends don't guarantee parallel post-trends if true trajectories are non-linear
Ashenfelter's dip: In some settings (e.g., job training programs), treated units experience a dip just before treatment that would reverse even absent treatment

Pitfall: The Pre-Trends Fallacy Showing that pre-treatment coefficients are individually insignificant does not validate parallel trends. A joint test of all pre-treatment coefficients provides more power. Even better: present the pre-trends visually and discuss their magnitude relative to estimated effects.

Conditional Parallel Trends

When unconditional parallel trends is implausible, we may appeal to a weaker assumption:

Definition 13.2 (Conditional Parallel Trends): Parallel trends holds after conditioning on observable covariates $X_i$ : $E[Y_{i1}(0) - Y_{i0}(0) | G_i = 1, X_i] = E[Y_{i1}(0) - Y_{i0}(0) | G_i = 0, X_i]$

Implementation approaches:

Regression with covariates: Include $X_i$ in the DiD regression (but only baseline covariates—not post-treatment variables)
Matching/reweighting: Match treated and control units on $X_i$ before computing DiD
Doubly robust DiD: Combine outcome modeling and propensity score weighting (Sant'Anna and Zhao, 2020)

Which covariates to include? Include variables that:

Predict outcome trends (not just levels)
Differ between treated and control groups
Are measured pre-treatment (never condition on post-treatment variables)

The Credibility of Parallel Trends

Parallel trends is most plausible when:

Treatment is as-if random conditional on fixed effects: Policy variation driven by idiosyncratic factors unrelated to outcome trends
Treated and control units are similar: Geographic neighbors, same industry, comparable demographics
Treatment timing is plausibly exogenous: Not driven by anticipation of differential trends
Pre-trends are convincingly parallel: Multiple pre-periods show similar evolution

Parallel trends is least plausible when:

Selection into treatment: Units adopt treatment because of anticipated effects
Treated units are structurally different: Different industries, demographics, or growth trajectories
Treatment timing is endogenous: Early adopters differ systematically from late adopters
Only one pre-period: Cannot assess pre-existing trends

13.3 Event Studies

From DiD to Dynamic Treatment Effects

The basic 2×2 DiD assumes a constant treatment effect. In practice, effects may:

Build up gradually over time
Fade out after initial impact
Differ before and after treatment (anticipation effects)

Event study designs generalize DiD to trace out the dynamic path of treatment effects.

The Event Study Specification

For a setting where all treated units receive treatment at the same time $t^*$ , the event study specification is:

$Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \tau_k \cdot \mathbf{1}[t - t^* = k] \cdot Treated_i + \varepsilon_{it}$

where:

$\alpha_i$ = unit fixed effects
$\gamma_t$ = time fixed effects
$k$ = event time (periods relative to treatment)
$\tau_k$ = effect at event time $k$
The omitted category ( $k = -1$ ) normalizes effects relative to the period just before treatment

Interpreting the coefficients:

$\tau_k$ for $k < 0$ : Pre-treatment differences (should be ~0 under parallel trends)
$\tau_0$ : Immediate treatment effect
$\tau_k$ for $k > 0$ : Post-treatment effects (can show dynamics)

Visualization

Event study plots are among the most informative figures in applied economics:

         Pre-treatment          |    Post-treatment
                                |
    ●       ●       ●       ●   |   ●       ●       ●
    |       |       |       |   |   |       |       |
----+-------+-------+-------+---+---+-------+-------+----→ Event time
   -4      -3      -2      -1   0   1       2       3
                        (normalized)

A well-executed event study shows:

Point estimates for each event time
Confidence intervals (95%)
Clear marking of the treatment date
Pre-trends hovering around zero
Post-treatment effects revealing the dynamic pattern

Practical Implementation

Choosing the event window:

Include enough pre-periods to assess parallel trends (typically 3-5+)
Include enough post-periods to capture treatment dynamics
Bin or drop endpoints if sample sizes become small

Endpoint binning: With varying treatment timing, not all units contribute to all event times. Common practice:

Bin endpoints: $\tau_{-K^-}$ captures all $k \leq -K^-$ , $\tau_{K^+}$ captures all $k \geq K^+$
Drop endpoints: Estimate only for event times with sufficient observations

Standard errors: Cluster at the level of treatment assignment (typically state, firm, or region).

Example: Minimum Wage Event Study

Extending Card-Krueger, Dube, Lester, and Reich (2010) conduct event studies using county-pair comparisons along state borders. For each state minimum wage increase, they trace employment effects from several quarters before to several quarters after:

Pre-trends: Employment in affected counties tracks employment in neighboring counties closely before minimum wage changes
Post-treatment: Little evidence of employment decline after minimum wage increases
Dynamic pattern: No delayed effects emerging over time

This event study approach addresses concerns about state-level confounders by using geographically proximate control counties.

13.4 Staggered Adoption and the Failure of TWFE

The Staggered Adoption Setting

Most real-world DiD applications involve staggered adoption: different units receive treatment at different times. States adopt policies in different years. Firms implement changes at different dates. Hospitals adopt technologies sequentially.

The natural approach is two-way fixed effects (TWFE):

$Y_{it} = \alpha_i + \gamma_t + \tau \cdot D_{it} + \varepsilon_{it}$

where $D_{it} = 1$ if unit $i$ is treated at time $t$ (and 0 otherwise), and $\alpha_i$ and $\gamma_t$ are unit and time fixed effects.

For decades, applied researchers assumed $\hat{\tau}^{TWFE}$ identified a weighted average of unit-specific treatment effects. Recent work has shown this is wrong—badly wrong when treatment effects are heterogeneous.

The Problem with TWFE

Goodman-Bacon (2021) provides the key decomposition. The TWFE estimator is a weighted average of all possible 2×2 DiD comparisons in the data. Some of these comparisons are sensible:

Clean comparisons: Early-treated vs. never-treated (good)
Clean comparisons: Late-treated vs. never-treated (good)

But others are problematic:

Forbidden comparisons: Early-treated vs. late-treated, using early-treated as controls after they've been treated

In this forbidden comparison, already-treated units serve as the control group. If treatment effects vary over time (e.g., effects grow after treatment), this comparison yields biased estimates—potentially even the wrong sign.

Theorem 13.1 (Goodman-Bacon Decomposition): The TWFE estimator equals a weighted average of all 2×2 DiD estimators: $\hat{\tau}^{TWFE} = \sum_{k} \sum_{l \neq k} w_{kl} \hat{\tau}_{kl}$ where $\hat{\tau}_{kl}$ is the 2×2 DiD comparing timing group $k$ to timing group $l$ , and weights $w_{kl}$ depend on group sizes and treatment timing variance.

Intuition: TWFE doesn't know which comparisons are valid. It uses all of them, weighting by mechanical factors (sample size, variance) rather than causal relevance.

When Does This Matter?

TWFE problems are severe when:

Treatment effects are heterogeneous across units: Different units have different effect sizes
Treatment effects vary over time: Effects grow, shrink, or change sign over exposure time
Treatment timing is correlated with effect sizes: Early adopters have different effects than late adopters
There are no never-treated units: All comparisons involve already-treated controls

TWFE is less problematic when:

Treatment effects are constant across units and over time
There are many never-treated units
All treatment happens at similar times

Negative Weights

de Chaisemartin and d'Haultfoeuille (2020) show that TWFE assigns negative weights to some treatment effects under heterogeneity. This means:

Even if all unit-level effects are positive, TWFE can estimate a negative average effect
The estimated "average treatment effect" may not be a proper average of anything

Their diagnostic: Compute the weights TWFE assigns to each unit-time observation. If many weights are negative, TWFE estimates are unreliable.

13.5 Modern DiD Estimators

The "DiD revolution" of 2018-2022 produced several estimators that avoid TWFE's problems. All share a common insight: construct valid comparisons explicitly rather than relying on regression mechanics.

Callaway and Sant'Anna (2021)

Key idea: Estimate separate treatment effects for each cohort (defined by treatment timing) at each event time, then aggregate as desired.

Group-time average treatment effects: For units first treated at time $g$ , the effect at time $t$ is:

$ATT(g,t) = E[Y_t(g) - Y_t(0) | G_g = 1]$

where $G_g = 1$ indicates first treatment at time $g$ .

Estimation: Each $ATT(g,t)$ is estimated using only:

Treated units: Those first treated at time $g$
Control units: Never-treated units (or not-yet-treated units)

This avoids forbidden comparisons entirely.

Aggregation: Once we have all $ATT(g,t)$ estimates, we can aggregate:

By event time: Average effects at each $k = t - g$
By cohort: Average effects for each treatment-timing group
Overall: Single summary measure

Inference: Simultaneous confidence bands account for multiple comparisons.

Sun and Abraham (2021)

Key idea: Saturate the event study regression with cohort-specific effects, then aggregate.

Interaction-weighted estimator: Estimate:

$Y_{it} = \alpha_i + \gamma_t + \sum_g \sum_{k \neq -1} \tau_{g,k} \cdot \mathbf{1}[G_i = g] \cdot \mathbf{1}[t - g = k] + \varepsilon_{it}$

This estimates separate event-time effects ( $\tau_{g,k}$ ) for each cohort $g$ . The overall event-time effect is then:

$\hat{\tau}_k = \sum_g \hat{w}_g \hat{\tau}_{g,k}$

where weights $\hat{w}_g$ are chosen by the researcher (e.g., cohort size).

Comparison to Callaway-Sant'Anna: Similar results in most applications. Sun-Abraham is regression-based (familiar syntax); Callaway-Sant'Anna is more explicit about the estimand.

Other Approaches

de Chaisemartin and d'Haultfoeuille (2020): Estimate instantaneous effects using "switching" variation—units that switch treatment status.

Borusyak, Jaravel, and Spiess (2024): Imputation approach. Estimate counterfactual outcomes for treated observations using untreated observations, then compare.

Wooldridge (2021): Extended TWFE with cohort-specific trends. Shows that properly specified TWFE can work, but requires many interaction terms.

Choosing an Estimator

Consideration

Recommendation

Standard setting (staggered adoption)

Callaway-Sant'Anna or Sun-Abraham

Want regression syntax

Sun-Abraham

Want explicit aggregation choices

Callaway-Sant'Anna

Concerned about model specification

Doubly robust (Sant'Anna-Zhao)

Have panel with switching

de Chaisemartin-d'Haultfoeuille

In practice, reporting multiple estimators and showing they agree strengthens credibility.

13.6 Extensions and Special Cases

Triple Differences (DDD)

When parallel trends is questionable, adding a third difference can help. Triple differences requires:

Two groups (treated vs. control) × two time periods × two subgroups (affected vs. unaffected)

$\tau^{DDD} = [(Y^{T,A}_{post} - Y^{T,A}_{pre}) - (Y^{T,U}_{post} - Y^{T,U}_{pre})] - [(Y^{C,A}_{post} - Y^{C,A}_{pre}) - (Y^{C,U}_{post} - Y^{C,U}_{pre})]$

where T/C = treated/control region, A/U = affected/unaffected subgroup.

Example: Studying health insurance mandate effects. Compare:

Individuals affected by mandate (age-eligible) vs. unaffected (older)
In states that implemented vs. didn't implement
Before vs. after implementation

Triple differences removes state-specific trends and age-specific trends, requiring only that differential state-by-age trends are parallel.

Warning: The "Parallel Gaps" Assumption in DDD
DDD is often presented as more robust than DiD—"if parallel trends fails, add another difference." But DDD requires its own identifying assumption that is frequently overlooked.
The parallel gaps assumption: The gap between affected and unaffected subgroups must evolve in parallel across treated and control regions. Formally:
$E[Y^{T,A}(0) - Y^{T,U}(0)]_{post} - E[Y^{T,A}(0) - Y^{T,U}(0)]_{pre} = E[Y^{C,A}(0) - Y^{C,U}(0)]_{post} - E[Y^{C,A}(0) - Y^{C,U}(0)]_{pre}$
This is not just parallel trends for each group separately. The gap must evolve similarly.
When parallel gaps fails:
If treated states were on different economic trajectories that affected subgroups differently
If the "unaffected" subgroup in treated states was indirectly affected (spillovers)
If composition of affected/unaffected groups changed differentially
Example: Studying minimum wage effects using young (affected) vs. older (unaffected) workers. If treated states had stronger overall growth, this might differentially benefit young workers (whose employment is more cyclical). The gap would widen more in treated states even without the policy—violating parallel gaps.
What to do:
Test for parallel pre-treatment gaps, not just parallel pre-trends for each group
Consider whether any shock could differentially affect the gap
Report DDD alongside simple DiD; if they diverge substantially, investigate why

Synthetic Difference-in-Differences

Arkhangelsky et al. (2021) combine DiD with synthetic control methods:

Reweight control units to match treated units' pre-treatment trends
Reweight time periods to emphasize pre-treatment periods
Estimate treatment effect using the reweighted comparison

This relaxes parallel trends—control units need not be parallel, just reweightable to be parallel.

Continuous Treatment

Standard DiD assumes binary treatment. With continuous treatment intensity:

$Y_{it} = \alpha_i + \gamma_t + \tau \cdot TreatIntensity_{it} + \varepsilon_{it}$

Challenges:

TWFE problems are even more severe with continuous treatment
No clear analogue to "never-treated" group
Interpretation shifts from ATE to dose-response

Solutions: de Chaisemartin and d'Haultfoeuille extend their estimator to continuous treatment; Callaway, Goodman-Bacon, and Sant'Anna (2024) develop methods for this setting.

Practical Guidance

When to Use DiD

Situation

Appropriate?

Notes

Policy change affecting some units, not others

Yes

Classic DiD setting

Staggered policy adoption across units

Yes

Use modern estimators

Geographic policy variation

Yes

Consider spatial spillovers

Before-after with no control group

Cannot separate treatment from time trends

Treatment assigned based on outcome trends

Violates parallel trends

Very different treated and control groups

Maybe

Conditional PT may help; scrutinize carefully

Inference with Few Clusters

DiD analyses often have few clusters—sometimes as few as 50 states, 10 treated hospitals, or a handful of policy changes. Standard cluster-robust standard errors perform poorly in these settings, leading to over-rejection (too many false positives).

The problem: Cluster-robust variance estimators require many clusters ( $G \to \infty$ ) for their asymptotic justification. With small $G$ , they understate uncertainty.

Rule of thumb: Be concerned when:

$G < 50$ total clusters, OR
Fewer than 10-20 treated clusters, OR
Highly unbalanced cluster sizes

Solutions:

1. Wild Cluster Bootstrap (Cameron, Gelbach & Miller 2008)

The wild cluster bootstrap imposes the null hypothesis and constructs the distribution of the test statistic under the null by resampling residuals at the cluster level:

# R with fixest
library(fixest)
model <- feols(y ~ treatment | unit + time, data = df, cluster = ~state)

# Wild cluster bootstrap p-value
boot_pvalue <- boottest(model, param = "treatment", B = 9999,
                        clustid = "state")

# R with fwildclusterboot (more options)
library(fwildclusterboot)
boot <- boottest(model, param = "treatment", B = 9999,
                 clustid = ~state, type = "webb")  # Webb weights recommended

* Stata
reghdfe y treatment, absorb(unit time) cluster(state)
boottest treatment, cluster(state) reps(9999)

2. Randomization Inference

If treatment was assigned randomly (or quasi-randomly) across clusters, permutation-based inference is valid regardless of the number of clusters:

# Permute treatment assignment across clusters
# Compute test statistic for each permutation
# Compare observed statistic to permutation distribution

3. Aggregation to Cluster Level

With very few clusters, analyze data collapsed to cluster-by-time cells. This makes the small- $G$ problem explicit and avoids spurious precision.

Which to use?

Number of clusters

Recommendation

$G \geq 50$

Standard cluster-robust SEs are usually fine

$20 \leq G < 50$

Use wild cluster bootstrap; compare to cluster-robust

$G < 20$

Wild cluster bootstrap essential; consider aggregation

$G < 10$

Aggregation or randomization inference; bootstrap may fail

Warning: Papers with few treated states/regions and standard cluster SEs should be viewed skeptically. The minimum wage literature, for example, has wrestled with this—studies using variation across ~50 states are far more credible than those using a handful of state policy changes.

Common Pitfalls

Pitfall 1: Trusting TWFE with staggered adoption TWFE can produce severely biased estimates—even wrong-signed—when treatment effects are heterogeneous. This is the norm, not the exception.
How to avoid: Use modern estimators (Callaway-Sant'Anna, Sun-Abraham). Run Goodman-Bacon decomposition to diagnose TWFE problems. Report both TWFE and modern estimates; if they differ substantially, trust the modern ones.

Pitfall 2: Cherry-picking the control group Researchers sometimes choose control groups that show parallel pre-trends, dropping controls that diverge. This is specification searching that invalidates inference.
How to avoid: Pre-specify the control group based on substantive criteria (geography, industry, demographics), not pre-trend fit. Document the choice before examining results.

Pitfall 3: Conditioning on post-treatment variables Including covariates measured after treatment can induce bias by controlling for outcomes of treatment.
How to avoid: Only condition on pre-treatment (baseline) covariates.

Pitfall 4: Ignoring anticipation effects If units adjust behavior before treatment takes effect (e.g., firms change hiring before minimum wage increase), effects appear in pre-treatment periods.
How to avoid: Consider whether anticipation is plausible. If so, allow for anticipation in event study specification (e.g., set $k = -2$ as reference period).

Pitfall 5: Spillovers between treated and control If treatment affects control units (e.g., minimum wage in NJ affects PA restaurants near the border), DiD is biased.
How to avoid: Consider whether spillovers are plausible. Test by examining control units near vs. far from treated units.

Implementation Checklist

Clearly define treatment, treated units, control units, and timing
Justify parallel trends assumption with institutional knowledge
Examine pre-trends visually and statistically
Use modern estimators if staggered adoption (not TWFE)
Report event study figures showing dynamic effects
Cluster standard errors at treatment assignment level
Consider placebo tests (fake treatment timing)
Discuss potential violations (anticipation, spillovers, selection)
Report sensitivity analysis for parallel trends

Qualitative Bridge

How Qualitative Methods Complement DiD

DiD identifies an average treatment effect under parallel trends, but leaves key questions unanswered:

Why did some units adopt treatment and others not?
How did treatment produce its effects?
Why might parallel trends hold (or fail)?

Qualitative research can address these gaps.

When to Combine

Understanding treatment assignment: Interviews with policymakers, analysis of legislative debates, and process tracing can reveal why some states adopted a policy and others did not. If selection was driven by factors unrelated to outcome trends, parallel trends is more plausible.

Mechanism exploration: DiD tells us minimum wage increases didn't reduce employment. Qualitative research—employer interviews, observation of workplace dynamics—can reveal why: did firms raise prices, reduce profits, increase productivity, or substitute toward different workers?

Assessing external validity: Case studies of specific treated units help understand whether findings generalize. What makes New Jersey's experience with minimum wage relevant (or not) for other states?

Example: Minimum Wage Research

Card and Krueger's quantitative finding sparked intensive qualitative investigation:

Employer interviews revealed adjustment mechanisms (reduced turnover, price increases) that reconciled findings with theory
Analysis of policy debates showed minimum wage increases were driven by political factors, not economic conditions—supporting parallel trends
Industry case studies identified heterogeneity: effects differed by restaurant type, location, and competitive environment

This triangulation strengthened the overall evidence base beyond what DiD alone could provide.

Integration Note

Connections to Other Methods

Method

Relationship

See Chapter

Instrumental Variables

DiD as instrument for endogenous treatment intensity

Ch. 12

Regression Discontinuity

RD in time (sharp change at treatment date)

Ch. 14

Synthetic Control

Alternative when single treated unit; can combine

Ch. 15

Selection on Observables

Conditional DiD uses similar propensity score methods

Ch. 11

Triangulation Strategies

DiD estimates gain credibility when combined with:

Different control groups: Do results hold with alternative comparison units?
Different outcome variables: Do related outcomes show consistent patterns?
Different time windows: Are results robust to expanding or shifting the analysis period?
Alternative estimators: Do Callaway-Sant'Anna, Sun-Abraham, and TWFE agree?
Synthetic control: For small numbers of treated units, does SCM yield similar conclusions?
Qualitative evidence: Do interviews and case studies support the mechanism?

Running Example: China's Special Economic Zones

China's Special Economic Zones (SEZs) provide a staggered DiD setting: different cities received SEZ status at different times starting in 1980, with waves in 1984, 1988, 1992, and later years.

Research question: What was the effect of SEZ designation on local economic growth?

DiD setup:

Treated units: Cities designated as SEZs
Control units: Similar cities without SEZ status
Treatment timing: Varies by city (staggered adoption)
Outcomes: GDP, employment, investment, exports

Challenges illustrating this chapter's themes:

Selection: SEZ designation was not random—initial zones were in strategic coastal locations. Why were these cities chosen? Does selection violate parallel trends?
Staggered adoption: Different waves may have faced different economic environments. TWFE would use early SEZs as controls for later SEZs—problematic if SEZ effects grew over time.
Heterogeneous effects: SEZs near Hong Kong (Shenzhen) likely had different effects than inland SEZs designated in the 1990s.
Spillovers: SEZs may have affected neighboring non-SEZ cities, contaminating the control group.

Modern approach: Wang (2013) and subsequent work apply careful DiD methods:

Use never-SEZ cities as controls (avoiding forbidden comparisons)
Examine pre-trends in growth rates
Estimate heterogeneous effects by SEZ cohort
Test for spillovers to nearby cities

Findings: SEZ designation substantially increased local GDP and exports, with effects persisting and growing over time. Effects were largest for early SEZs with access to Hong Kong capital and expertise.

Methodological lesson: The China SEZ case illustrates both DiD's power (exploiting policy variation for causal inference) and its challenges (selection into treatment, heterogeneous effects, spillovers). Credible analysis requires modern estimators and careful attention to identification threats.

Summary

Key takeaways:

DiD identifies causal effects by comparing changes: The difference between treated and control group changes removes time-invariant confounders and common time trends.
Parallel trends is the key assumption: The control group's trajectory must reflect the treated group's counterfactual. This is untestable but can be probed through pre-trends analysis.
TWFE fails with staggered adoption and heterogeneous effects: The standard approach uses "forbidden comparisons" with already-treated units as controls, potentially producing wrong-signed estimates.
Modern estimators solve the TWFE problem: Callaway-Sant'Anna, Sun-Abraham, and related methods construct valid comparisons explicitly. These should be standard practice for staggered adoption settings.
Event studies visualize dynamics and assess parallel trends: Plotting coefficients by event time reveals both pre-trends (for diagnostics) and the dynamic path of treatment effects.

Returning to the opening question: When a policy changes in one place but not another, we can use the comparison to identify the policy's causal effect—but only if the comparison group's trajectory reveals the treated group's counterfactual. This requires parallel trends, careful attention to treatment timing and heterogeneity, and (in modern practice) estimators designed to handle the complications that arise in real policy settings.

Exercises

Conceptual

Explain why the parallel trends assumption is fundamentally untestable. What can researchers do to make it more plausible, and what are the limitations of these approaches?
Consider a state that raises its minimum wage because unemployment has been falling rapidly and the legislature believes workers can now command higher wages. Would DiD using other states as controls identify the causal effect of minimum wage on employment? Why or why not?
In the Goodman-Bacon decomposition, what makes a comparison "forbidden"? Give an example of how such a comparison could yield a wrong-signed estimate.

Applied

Download state-level minimum wage and employment data. Implement both TWFE and Callaway-Sant'Anna estimators for a recent period of minimum wage changes. Do the estimates differ? Produce a Goodman-Bacon decomposition to diagnose the sources of any differences.
Create an event study plot for a policy change of your choosing. Discuss what the pre-trends suggest about parallel trends and what the post-treatment dynamics reveal about the effect's evolution.

Discussion

Some researchers argue that the DiD revolution has made causal inference harder by showing that standard methods were problematic. Others argue it has made inference easier by providing tools to address these problems. Which view do you find more compelling, and why?

PreviousChapter 12: Instrumental Variables NextChapter 14: Regression Discontinuity

Last updated 4 days ago

hashtagOpening Question

hashtagChapter Overview

hashtag13.1 The Basic Difference-in-Differences Setup

hashtagThe Canonical 2×2 Case

hashtagRegression Implementation

hashtagExample: Card and Krueger's Minimum Wage Study

hashtag13.2 The Parallel Trends Assumption

hashtagWhat Parallel Trends Means (and Doesn't Mean)

hashtagPre-Trends Analysis

hashtagConditional Parallel Trends

hashtagThe Credibility of Parallel Trends

hashtag13.3 Event Studies

hashtagFrom DiD to Dynamic Treatment Effects

hashtagThe Event Study Specification

hashtagVisualization

hashtagPractical Implementation

hashtagExample: Minimum Wage Event Study

hashtag13.4 Staggered Adoption and the Failure of TWFE

hashtagThe Staggered Adoption Setting

hashtagThe Problem with TWFE

hashtagWhen Does This Matter?

hashtagNegative Weights

hashtag13.5 Modern DiD Estimators

hashtagCallaway and Sant'Anna (2021)

hashtagSun and Abraham (2021)

hashtagOther Approaches

hashtagChoosing an Estimator

hashtag13.6 Extensions and Special Cases

hashtagTriple Differences (DDD)

hashtagSynthetic Difference-in-Differences

hashtagContinuous Treatment

hashtagPractical Guidance

hashtagWhen to Use DiD

hashtagInference with Few Clusters

hashtagCommon Pitfalls

hashtagImplementation Checklist

hashtagQualitative Bridge

hashtagHow Qualitative Methods Complement DiD

hashtagWhen to Combine

hashtagExample: Minimum Wage Research

hashtagIntegration Note

hashtagConnections to Other Methods

hashtagTriangulation Strategies

hashtagRunning Example: China's Special Economic Zones

hashtagSummary

hashtagFurther Reading

hashtagEssential

hashtagFor Deeper Understanding

hashtagAdvanced/Specialized

hashtagApplications

hashtagExercises

hashtagConceptual

hashtagApplied

hashtagDiscussion