Chapter 13: Difference-in-Differences

Opening Question

When a policy changes in one place but not another, how can we use the comparison to learn about the policy's causal effect?


Chapter Overview

Difference-in-differences (DiD) is perhaps the most widely used quasi-experimental method in applied economics. The core idea is elegantly simple: compare the change in outcomes over time for a group affected by a treatment to the change for a group that was not affected. If both groups would have evolved similarly in the absence of treatment, the difference in their changes identifies the causal effect.

This simplicity is deceptive. The parallel trends assumption—that treated and control groups would have followed the same trajectory absent treatment—is fundamentally untestable. Moreover, recent methodological work has revealed that the standard two-way fixed effects (TWFE) estimator can produce severely biased estimates when treatment effects vary across units or over time, precisely the settings where DiD is most commonly applied. This chapter develops both the classical DiD framework and the modern solutions to these problems.

What you will learn:

  • The logic of DiD and the parallel trends assumption

  • How to assess parallel trends and conduct event study analysis

  • Why TWFE fails with staggered adoption and heterogeneous effects

  • Modern estimators that address these problems (Callaway-Sant'Anna, Sun-Abraham, and others)

  • When DiD is and is not a credible identification strategy

Prerequisites: Chapter 9 (Causal Framework), Chapter 3 (Statistical Foundations)


13.1 The Basic Difference-in-Differences Setup

The Canonical 2×2 Case

Consider the simplest DiD setting: two groups observed in two time periods, where one group receives treatment between periods and the other does not. Let YitY_{it} denote the outcome for unit ii at time tt, with t{0,1}t \in \{0, 1\} (pre and post) and groups G{0,1}G \in \{0, 1\} (control and treated).

The DiD estimator is:

τ^DiD=(Yˉ1,1Yˉ1,0)(Yˉ0,1Yˉ0,0)\hat{\tau}^{DiD} = (\bar{Y}_{1,1} - \bar{Y}_{1,0}) - (\bar{Y}_{0,1} - \bar{Y}_{0,0})

where Yˉg,t\bar{Y}_{g,t} is the average outcome for group gg in period tt. This is the change in the treated group minus the change in the control group—the "difference in differences."

Definition 13.1 (Parallel Trends Assumption): In the absence of treatment, the average outcomes for treated and control groups would have followed parallel paths over time: E[Yi1(0)Yi0(0)Gi=1]=E[Yi1(0)Yi0(0)Gi=0]E[Y_{i1}(0) - Y_{i0}(0) | G_i = 1] = E[Y_{i1}(0) - Y_{i0}(0) | G_i = 0] where Yit(0)Y_{it}(0) denotes the potential outcome without treatment.

Intuition: The control group's change over time tells us what would have happened to the treated group had they not been treated. We're not assuming the groups are identical—they can have different levels—only that they would have changed by the same amount.

Regression Implementation

The 2×2 DiD can be estimated via regression:

Yit=α+βTreatedi+γPostt+τ(Treatedi×Postt)+εitY_{it} = \alpha + \beta \cdot Treated_i + \gamma \cdot Post_t + \tau \cdot (Treated_i \times Post_t) + \varepsilon_{it}

where:

  • α\alpha = baseline mean for control group in pre-period

  • β\beta = level difference between treated and control groups (pre-treatment)

  • γ\gamma = time trend common to both groups

  • τ\tau = the DiD estimate (coefficient of interest)

This regression formulation extends naturally to include covariates and to settings with more groups and time periods.

Example: Card and Krueger's Minimum Wage Study

The most famous DiD application is Card and Krueger's (1994) study of New Jersey's 1992 minimum wage increase. New Jersey raised its minimum wage from $4.25 to $5.05 per hour; neighboring Pennsylvania did not change its minimum wage.

Setup:

  • Treated group: Fast food restaurants in New Jersey

  • Control group: Fast food restaurants in eastern Pennsylvania

  • Pre-period: February 1992 (before NJ increase)

  • Post-period: November 1992 (after NJ increase)

  • Outcome: Employment (full-time equivalent employees)

Results:

NJ (Treated)
PA (Control)
Difference

Before

20.44

23.33

-2.89

After

21.03

21.17

-0.14

Change

+0.59

-2.16

+2.75

The DiD estimate of +2.75 suggests the minimum wage increase raised employment, contradicting the standard competitive model prediction. This finding sparked decades of debate and methodological refinement.

Why DiD here? Simple before-after comparison in NJ would confound the minimum wage effect with any other changes occurring between February and November 1992 (seasonal effects, macroeconomic conditions). The PA comparison removes these common trends.

Figure 13.1: The Basic Difference-in-Differences Design. The treated group's outcome jumps at treatment; the control group follows its trend. The DiD estimate equals the treated group's change minus the control group's change. The dashed line shows the counterfactual path for the treated group under parallel trends.

Parallel trends is an assumption about counterfactual outcomes—what would have happened to the treated group absent treatment. This is fundamentally untestable because we never observe the treated group's counterfactual post-treatment trajectory.

What parallel trends does NOT require:

  • That treated and control groups have similar outcome levels

  • That groups have identical characteristics

  • That the groups would have the same outcomes in any period

What parallel trends DOES require:

  • That in the absence of treatment, both groups would have experienced the same change in outcomes

  • That any time-varying factors affecting outcomes are either common to both groups or unrelated to treatment assignment

While we cannot test parallel trends directly, we can examine whether outcomes evolved similarly before treatment. If pre-treatment trends diverge, parallel trends in the post-period is implausible.

Implementation: Estimate an event study specification (Section 13.3) and examine coefficients for pre-treatment periods. Statistically significant pre-trends are a red flag.

Limitations of pre-trends testing:

  1. Statistical power: Failure to reject parallel pre-trends may reflect low power rather than true parallel trends

  2. Anticipation effects: Divergence just before treatment may reflect anticipation rather than pre-existing trends

  3. Non-linear trends: Parallel linear pre-trends don't guarantee parallel post-trends if true trajectories are non-linear

  4. Ashenfelter's dip: In some settings (e.g., job training programs), treated units experience a dip just before treatment that would reverse even absent treatment

Pitfall: The Pre-Trends Fallacy Showing that pre-treatment coefficients are individually insignificant does not validate parallel trends. A joint test of all pre-treatment coefficients provides more power. Even better: present the pre-trends visually and discuss their magnitude relative to estimated effects.

Figure 13.2: Parallel Trends—Valid vs. Violated. Left panel shows parallel pre-trends, supporting the DiD identification strategy. Right panel shows diverging pre-trends, indicating the treated group was already on a different trajectory before treatment—the DiD estimate would be biased.

When unconditional parallel trends is implausible, we may appeal to a weaker assumption:

Definition 13.2 (Conditional Parallel Trends): Parallel trends holds after conditioning on observable covariates XiX_i: E[Yi1(0)Yi0(0)Gi=1,Xi]=E[Yi1(0)Yi0(0)Gi=0,Xi]E[Y_{i1}(0) - Y_{i0}(0) | G_i = 1, X_i] = E[Y_{i1}(0) - Y_{i0}(0) | G_i = 0, X_i]

Implementation approaches:

  1. Regression with covariates: Include XiX_i in the DiD regression (but only baseline covariates—not post-treatment variables)

  2. Matching/reweighting: Match treated and control units on XiX_i before computing DiD

  3. Doubly robust DiD: Combine outcome modeling and propensity score weighting (Sant'Anna and Zhao, 2020)

Which covariates to include? Include variables that:

  • Predict outcome trends (not just levels)

  • Differ between treated and control groups

  • Are measured pre-treatment (never condition on post-treatment variables)

Parallel trends is most plausible when:

  1. Treatment is as-if random conditional on fixed effects: Policy variation driven by idiosyncratic factors unrelated to outcome trends

  2. Treated and control units are similar: Geographic neighbors, same industry, comparable demographics

  3. Treatment timing is plausibly exogenous: Not driven by anticipation of differential trends

  4. Pre-trends are convincingly parallel: Multiple pre-periods show similar evolution

Parallel trends is least plausible when:

  1. Selection into treatment: Units adopt treatment because of anticipated effects

  2. Treated units are structurally different: Different industries, demographics, or growth trajectories

  3. Treatment timing is endogenous: Early adopters differ systematically from late adopters

  4. Only one pre-period: Cannot assess pre-existing trends


13.3 Event Studies

From DiD to Dynamic Treatment Effects

The basic 2×2 DiD assumes a constant treatment effect. In practice, effects may:

  • Build up gradually over time

  • Fade out after initial impact

  • Differ before and after treatment (anticipation effects)

Event study designs generalize DiD to trace out the dynamic path of treatment effects.

The Event Study Specification

For a setting where all treated units receive treatment at the same time tt^*, the event study specification is:

Yit=αi+γt+k1τk1[tt=k]Treatedi+εitY_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \tau_k \cdot \mathbf{1}[t - t^* = k] \cdot Treated_i + \varepsilon_{it}

where:

  • αi\alpha_i = unit fixed effects

  • γt\gamma_t = time fixed effects

  • kk = event time (periods relative to treatment)

  • τk\tau_k = effect at event time kk

  • The omitted category (k=1k = -1) normalizes effects relative to the period just before treatment

Interpreting the coefficients:

  • τk\tau_k for k<0k < 0: Pre-treatment differences (should be ~0 under parallel trends)

  • τ0\tau_0: Immediate treatment effect

  • τk\tau_k for k>0k > 0: Post-treatment effects (can show dynamics)

Visualization

Event study plots are among the most informative figures in applied economics:

A well-executed event study shows:

  • Point estimates for each event time

  • Confidence intervals (95%)

  • Clear marking of the treatment date

  • Pre-trends hovering around zero

  • Post-treatment effects revealing the dynamic pattern

Figure 13.3: Event Study Design. Pre-treatment coefficients should be near zero (testing parallel trends). Post-treatment coefficients trace out the dynamic treatment effect. The reference period (t=-1) is normalized to zero. Confidence intervals widen further from the event.

Practical Implementation

Choosing the event window:

  • Include enough pre-periods to assess parallel trends (typically 3-5+)

  • Include enough post-periods to capture treatment dynamics

  • Bin or drop endpoints if sample sizes become small

Endpoint binning: With varying treatment timing, not all units contribute to all event times. Common practice:

  • Bin endpoints: τK\tau_{-K^-} captures all kKk \leq -K^-, τK+\tau_{K^+} captures all kK+k \geq K^+

  • Drop endpoints: Estimate only for event times with sufficient observations

Standard errors: Cluster at the level of treatment assignment (typically state, firm, or region).

Example: Minimum Wage Event Study

Extending Card-Krueger, Dube, Lester, and Reich (2010) conduct event studies using county-pair comparisons along state borders. For each state minimum wage increase, they trace employment effects from several quarters before to several quarters after:

  • Pre-trends: Employment in affected counties tracks employment in neighboring counties closely before minimum wage changes

  • Post-treatment: Little evidence of employment decline after minimum wage increases

  • Dynamic pattern: No delayed effects emerging over time

This event study approach addresses concerns about state-level confounders by using geographically proximate control counties.


13.4 Staggered Adoption and the Failure of TWFE

The Staggered Adoption Setting

Most real-world DiD applications involve staggered adoption: different units receive treatment at different times. States adopt policies in different years. Firms implement changes at different dates. Hospitals adopt technologies sequentially.

The natural approach is two-way fixed effects (TWFE):

Yit=αi+γt+τDit+εitY_{it} = \alpha_i + \gamma_t + \tau \cdot D_{it} + \varepsilon_{it}

where Dit=1D_{it} = 1 if unit ii is treated at time tt (and 0 otherwise), and αi\alpha_i and γt\gamma_t are unit and time fixed effects.

For decades, applied researchers assumed τ^TWFE\hat{\tau}^{TWFE} identified a weighted average of unit-specific treatment effects. Recent work has shown this is wrong—badly wrong when treatment effects are heterogeneous.

The Problem with TWFE

Goodman-Bacon (2021) provides the key decomposition. The TWFE estimator is a weighted average of all possible 2×2 DiD comparisons in the data. Some of these comparisons are sensible:

  • Clean comparisons: Early-treated vs. never-treated (good)

  • Clean comparisons: Late-treated vs. never-treated (good)

But others are problematic:

  • Forbidden comparisons: Early-treated vs. late-treated, using early-treated as controls after they've been treated

In this forbidden comparison, already-treated units serve as the control group. If treatment effects vary over time (e.g., effects grow after treatment), this comparison yields biased estimates—potentially even the wrong sign.

Theorem 13.1 (Goodman-Bacon Decomposition): The TWFE estimator equals a weighted average of all 2×2 DiD estimators: τ^TWFE=klkwklτ^kl\hat{\tau}^{TWFE} = \sum_{k} \sum_{l \neq k} w_{kl} \hat{\tau}_{kl} where τ^kl\hat{\tau}_{kl} is the 2×2 DiD comparing timing group kk to timing group ll, and weights wklw_{kl} depend on group sizes and treatment timing variance.

Intuition: TWFE doesn't know which comparisons are valid. It uses all of them, weighting by mechanical factors (sample size, variance) rather than causal relevance.

When Does This Matter?

TWFE problems are severe when:

  1. Treatment effects are heterogeneous across units: Different units have different effect sizes

  2. Treatment effects vary over time: Effects grow, shrink, or change sign over exposure time

  3. Treatment timing is correlated with effect sizes: Early adopters have different effects than late adopters

  4. There are no never-treated units: All comparisons involve already-treated controls

TWFE is less problematic when:

  • Treatment effects are constant across units and over time

  • There are many never-treated units

  • All treatment happens at similar times

Negative Weights

de Chaisemartin and d'Haultfoeuille (2020) show that TWFE assigns negative weights to some treatment effects under heterogeneity. This means:

  • Even if all unit-level effects are positive, TWFE can estimate a negative average effect

  • The estimated "average treatment effect" may not be a proper average of anything

Their diagnostic: Compute the weights TWFE assigns to each unit-time observation. If many weights are negative, TWFE estimates are unreliable.

Figure 13.4: The Staggered Adoption Problem. With staggered treatment timing, TWFE uses already-treated units as "controls" for later-treated units. If treatment effects vary over time or across cohorts, this produces biased estimates—potentially even the wrong sign.
Figure 13.5: TWFE Negative Weights. The left panel shows how early and late adopters follow different treatment trajectories. The right panel displays the implicit TWFE weights—some unit-time observations receive negative weights, meaning their positive treatment effects are subtracted from the overall estimate. This can produce estimates with the wrong sign even when all true effects are positive.

13.5 Modern DiD Estimators

The "DiD revolution" of 2018-2022 produced several estimators that avoid TWFE's problems. All share a common insight: construct valid comparisons explicitly rather than relying on regression mechanics.

Callaway and Sant'Anna (2021)

Key idea: Estimate separate treatment effects for each cohort (defined by treatment timing) at each event time, then aggregate as desired.

Group-time average treatment effects: For units first treated at time gg, the effect at time tt is:

ATT(g,t)=E[Yt(g)Yt(0)Gg=1]ATT(g,t) = E[Y_t(g) - Y_t(0) | G_g = 1]

where Gg=1G_g = 1 indicates first treatment at time gg.

Estimation: Each ATT(g,t)ATT(g,t) is estimated using only:

  • Treated units: Those first treated at time gg

  • Control units: Never-treated units (or not-yet-treated units)

This avoids forbidden comparisons entirely.

Aggregation: Once we have all ATT(g,t)ATT(g,t) estimates, we can aggregate:

  • By event time: Average effects at each k=tgk = t - g

  • By cohort: Average effects for each treatment-timing group

  • Overall: Single summary measure

Inference: Simultaneous confidence bands account for multiple comparisons.

Sun and Abraham (2021)

Key idea: Saturate the event study regression with cohort-specific effects, then aggregate.

Interaction-weighted estimator: Estimate:

Yit=αi+γt+gk1τg,k1[Gi=g]1[tg=k]+εitY_{it} = \alpha_i + \gamma_t + \sum_g \sum_{k \neq -1} \tau_{g,k} \cdot \mathbf{1}[G_i = g] \cdot \mathbf{1}[t - g = k] + \varepsilon_{it}

This estimates separate event-time effects (τg,k\tau_{g,k}) for each cohort gg. The overall event-time effect is then:

τ^k=gw^gτ^g,k\hat{\tau}_k = \sum_g \hat{w}_g \hat{\tau}_{g,k}

where weights w^g\hat{w}_g are chosen by the researcher (e.g., cohort size).

Comparison to Callaway-Sant'Anna: Similar results in most applications. Sun-Abraham is regression-based (familiar syntax); Callaway-Sant'Anna is more explicit about the estimand.

Other Approaches

de Chaisemartin and d'Haultfoeuille (2020): Estimate instantaneous effects using "switching" variation—units that switch treatment status.

Borusyak, Jaravel, and Spiess (2024): Imputation approach. Estimate counterfactual outcomes for treated observations using untreated observations, then compare.

Wooldridge (2021): Extended TWFE with cohort-specific trends. Shows that properly specified TWFE can work, but requires many interaction terms.

Choosing an Estimator

Consideration
Recommendation

Standard setting (staggered adoption)

Callaway-Sant'Anna or Sun-Abraham

Want regression syntax

Sun-Abraham

Want explicit aggregation choices

Callaway-Sant'Anna

Concerned about model specification

Doubly robust (Sant'Anna-Zhao)

Have panel with switching

de Chaisemartin-d'Haultfoeuille

In practice, reporting multiple estimators and showing they agree strengthens credibility.


13.6 Extensions and Special Cases

Triple Differences (DDD)

When parallel trends is questionable, adding a third difference can help. Triple differences requires:

  • Two groups (treated vs. control) × two time periods × two subgroups (affected vs. unaffected)

τDDD=[(YpostT,AYpreT,A)(YpostT,UYpreT,U)][(YpostC,AYpreC,A)(YpostC,UYpreC,U)]\tau^{DDD} = [(Y^{T,A}_{post} - Y^{T,A}_{pre}) - (Y^{T,U}_{post} - Y^{T,U}_{pre})] - [(Y^{C,A}_{post} - Y^{C,A}_{pre}) - (Y^{C,U}_{post} - Y^{C,U}_{pre})]

where T/C = treated/control region, A/U = affected/unaffected subgroup.

Example: Studying health insurance mandate effects. Compare:

  • Individuals affected by mandate (age-eligible) vs. unaffected (older)

  • In states that implemented vs. didn't implement

  • Before vs. after implementation

Triple differences removes state-specific trends and age-specific trends, requiring only that differential state-by-age trends are parallel.

Warning: The "Parallel Gaps" Assumption in DDD

DDD is often presented as more robust than DiD—"if parallel trends fails, add another difference." But DDD requires its own identifying assumption that is frequently overlooked.

The parallel gaps assumption: The gap between affected and unaffected subgroups must evolve in parallel across treated and control regions. Formally:

E[YT,A(0)YT,U(0)]postE[YT,A(0)YT,U(0)]pre=E[YC,A(0)YC,U(0)]postE[YC,A(0)YC,U(0)]preE[Y^{T,A}(0) - Y^{T,U}(0)]_{post} - E[Y^{T,A}(0) - Y^{T,U}(0)]_{pre} = E[Y^{C,A}(0) - Y^{C,U}(0)]_{post} - E[Y^{C,A}(0) - Y^{C,U}(0)]_{pre}

This is not just parallel trends for each group separately. The gap must evolve similarly.

When parallel gaps fails:

  • If treated states were on different economic trajectories that affected subgroups differently

  • If the "unaffected" subgroup in treated states was indirectly affected (spillovers)

  • If composition of affected/unaffected groups changed differentially

Example: Studying minimum wage effects using young (affected) vs. older (unaffected) workers. If treated states had stronger overall growth, this might differentially benefit young workers (whose employment is more cyclical). The gap would widen more in treated states even without the policy—violating parallel gaps.

What to do:

  • Test for parallel pre-treatment gaps, not just parallel pre-trends for each group

  • Consider whether any shock could differentially affect the gap

  • Report DDD alongside simple DiD; if they diverge substantially, investigate why

Synthetic Difference-in-Differences

Arkhangelsky et al. (2021) combine DiD with synthetic control methods:

  1. Reweight control units to match treated units' pre-treatment trends

  2. Reweight time periods to emphasize pre-treatment periods

  3. Estimate treatment effect using the reweighted comparison

This relaxes parallel trends—control units need not be parallel, just reweightable to be parallel.

Continuous Treatment

Standard DiD assumes binary treatment. With continuous treatment intensity:

Yit=αi+γt+τTreatIntensityit+εitY_{it} = \alpha_i + \gamma_t + \tau \cdot TreatIntensity_{it} + \varepsilon_{it}

Challenges:

  • TWFE problems are even more severe with continuous treatment

  • No clear analogue to "never-treated" group

  • Interpretation shifts from ATE to dose-response

Solutions: de Chaisemartin and d'Haultfoeuille extend their estimator to continuous treatment; Callaway, Goodman-Bacon, and Sant'Anna (2024) develop methods for this setting.


Practical Guidance

When to Use DiD

Situation
Appropriate?
Notes

Policy change affecting some units, not others

Yes

Classic DiD setting

Staggered policy adoption across units

Yes

Use modern estimators

Geographic policy variation

Yes

Consider spatial spillovers

Before-after with no control group

No

Cannot separate treatment from time trends

Treatment assigned based on outcome trends

No

Violates parallel trends

Very different treated and control groups

Maybe

Conditional PT may help; scrutinize carefully

Inference with Few Clusters

DiD analyses often have few clusters—sometimes as few as 50 states, 10 treated hospitals, or a handful of policy changes. Standard cluster-robust standard errors perform poorly in these settings, leading to over-rejection (too many false positives).

The problem: Cluster-robust variance estimators require many clusters (GG \to \infty) for their asymptotic justification. With small GG, they understate uncertainty.

Rule of thumb: Be concerned when:

  • G<50G < 50 total clusters, OR

  • Fewer than 10-20 treated clusters, OR

  • Highly unbalanced cluster sizes

Solutions:

1. Wild Cluster Bootstrap (Cameron, Gelbach & Miller 2008)

The wild cluster bootstrap imposes the null hypothesis and constructs the distribution of the test statistic under the null by resampling residuals at the cluster level:

2. Randomization Inference

If treatment was assigned randomly (or quasi-randomly) across clusters, permutation-based inference is valid regardless of the number of clusters:

3. Aggregation to Cluster Level

With very few clusters, analyze data collapsed to cluster-by-time cells. This makes the small-GG problem explicit and avoids spurious precision.

Which to use?

Number of clusters
Recommendation

G50G \geq 50

Standard cluster-robust SEs are usually fine

20G<5020 \leq G < 50

Use wild cluster bootstrap; compare to cluster-robust

G<20G < 20

Wild cluster bootstrap essential; consider aggregation

G<10G < 10

Aggregation or randomization inference; bootstrap may fail

Warning: Papers with few treated states/regions and standard cluster SEs should be viewed skeptically. The minimum wage literature, for example, has wrestled with this—studies using variation across ~50 states are far more credible than those using a handful of state policy changes.

Common Pitfalls

Pitfall 1: Trusting TWFE with staggered adoption TWFE can produce severely biased estimates—even wrong-signed—when treatment effects are heterogeneous. This is the norm, not the exception.

How to avoid: Use modern estimators (Callaway-Sant'Anna, Sun-Abraham). Run Goodman-Bacon decomposition to diagnose TWFE problems. Report both TWFE and modern estimates; if they differ substantially, trust the modern ones.

Pitfall 2: Cherry-picking the control group Researchers sometimes choose control groups that show parallel pre-trends, dropping controls that diverge. This is specification searching that invalidates inference.

How to avoid: Pre-specify the control group based on substantive criteria (geography, industry, demographics), not pre-trend fit. Document the choice before examining results.

Pitfall 3: Conditioning on post-treatment variables Including covariates measured after treatment can induce bias by controlling for outcomes of treatment.

How to avoid: Only condition on pre-treatment (baseline) covariates.

Pitfall 4: Ignoring anticipation effects If units adjust behavior before treatment takes effect (e.g., firms change hiring before minimum wage increase), effects appear in pre-treatment periods.

How to avoid: Consider whether anticipation is plausible. If so, allow for anticipation in event study specification (e.g., set k=2k = -2 as reference period).

Pitfall 5: Spillovers between treated and control If treatment affects control units (e.g., minimum wage in NJ affects PA restaurants near the border), DiD is biased.

How to avoid: Consider whether spillovers are plausible. Test by examining control units near vs. far from treated units.

Implementation Checklist


Qualitative Bridge

How Qualitative Methods Complement DiD

DiD identifies an average treatment effect under parallel trends, but leaves key questions unanswered:

  • Why did some units adopt treatment and others not?

  • How did treatment produce its effects?

  • Why might parallel trends hold (or fail)?

Qualitative research can address these gaps.

When to Combine

Understanding treatment assignment: Interviews with policymakers, analysis of legislative debates, and process tracing can reveal why some states adopted a policy and others did not. If selection was driven by factors unrelated to outcome trends, parallel trends is more plausible.

Mechanism exploration: DiD tells us minimum wage increases didn't reduce employment. Qualitative research—employer interviews, observation of workplace dynamics—can reveal why: did firms raise prices, reduce profits, increase productivity, or substitute toward different workers?

Assessing external validity: Case studies of specific treated units help understand whether findings generalize. What makes New Jersey's experience with minimum wage relevant (or not) for other states?

Example: Minimum Wage Research

Card and Krueger's quantitative finding sparked intensive qualitative investigation:

  • Employer interviews revealed adjustment mechanisms (reduced turnover, price increases) that reconciled findings with theory

  • Analysis of policy debates showed minimum wage increases were driven by political factors, not economic conditions—supporting parallel trends

  • Industry case studies identified heterogeneity: effects differed by restaurant type, location, and competitive environment

This triangulation strengthened the overall evidence base beyond what DiD alone could provide.


Integration Note

Connections to Other Methods

Method
Relationship
See Chapter

Instrumental Variables

DiD as instrument for endogenous treatment intensity

Ch. 12

Regression Discontinuity

RD in time (sharp change at treatment date)

Ch. 14

Synthetic Control

Alternative when single treated unit; can combine

Ch. 15

Selection on Observables

Conditional DiD uses similar propensity score methods

Ch. 11

Triangulation Strategies

DiD estimates gain credibility when combined with:

  1. Different control groups: Do results hold with alternative comparison units?

  2. Different outcome variables: Do related outcomes show consistent patterns?

  3. Different time windows: Are results robust to expanding or shifting the analysis period?

  4. Alternative estimators: Do Callaway-Sant'Anna, Sun-Abraham, and TWFE agree?

  5. Synthetic control: For small numbers of treated units, does SCM yield similar conclusions?

  6. Qualitative evidence: Do interviews and case studies support the mechanism?


Running Example: China's Special Economic Zones

China's Special Economic Zones (SEZs) provide a staggered DiD setting: different cities received SEZ status at different times starting in 1980, with waves in 1984, 1988, 1992, and later years.

Research question: What was the effect of SEZ designation on local economic growth?

DiD setup:

  • Treated units: Cities designated as SEZs

  • Control units: Similar cities without SEZ status

  • Treatment timing: Varies by city (staggered adoption)

  • Outcomes: GDP, employment, investment, exports

Challenges illustrating this chapter's themes:

  1. Selection: SEZ designation was not random—initial zones were in strategic coastal locations. Why were these cities chosen? Does selection violate parallel trends?

  2. Staggered adoption: Different waves may have faced different economic environments. TWFE would use early SEZs as controls for later SEZs—problematic if SEZ effects grew over time.

  3. Heterogeneous effects: SEZs near Hong Kong (Shenzhen) likely had different effects than inland SEZs designated in the 1990s.

  4. Spillovers: SEZs may have affected neighboring non-SEZ cities, contaminating the control group.

Modern approach: Wang (2013) and subsequent work apply careful DiD methods:

  • Use never-SEZ cities as controls (avoiding forbidden comparisons)

  • Examine pre-trends in growth rates

  • Estimate heterogeneous effects by SEZ cohort

  • Test for spillovers to nearby cities

Findings: SEZ designation substantially increased local GDP and exports, with effects persisting and growing over time. Effects were largest for early SEZs with access to Hong Kong capital and expertise.

Methodological lesson: The China SEZ case illustrates both DiD's power (exploiting policy variation for causal inference) and its challenges (selection into treatment, heterogeneous effects, spillovers). Credible analysis requires modern estimators and careful attention to identification threats.


Summary

Key takeaways:

  1. DiD identifies causal effects by comparing changes: The difference between treated and control group changes removes time-invariant confounders and common time trends.

  2. Parallel trends is the key assumption: The control group's trajectory must reflect the treated group's counterfactual. This is untestable but can be probed through pre-trends analysis.

  3. TWFE fails with staggered adoption and heterogeneous effects: The standard approach uses "forbidden comparisons" with already-treated units as controls, potentially producing wrong-signed estimates.

  4. Modern estimators solve the TWFE problem: Callaway-Sant'Anna, Sun-Abraham, and related methods construct valid comparisons explicitly. These should be standard practice for staggered adoption settings.

  5. Event studies visualize dynamics and assess parallel trends: Plotting coefficients by event time reveals both pre-trends (for diagnostics) and the dynamic path of treatment effects.

Returning to the opening question: When a policy changes in one place but not another, we can use the comparison to identify the policy's causal effect—but only if the comparison group's trajectory reveals the treated group's counterfactual. This requires parallel trends, careful attention to treatment timing and heterogeneity, and (in modern practice) estimators designed to handle the complications that arise in real policy settings.


Further Reading

Essential

  • Cunningham (2021), Causal Inference: The Mixtape, Chapter 9 - Accessible introduction with examples

  • Roth et al. (2023), "What's Trending in Difference-in-Differences?" - Comprehensive survey of recent developments

For Deeper Understanding

  • Goodman-Bacon (2021), "Difference-in-Differences with Variation in Treatment Timing" - The key decomposition result explaining TWFE problems

  • Callaway and Sant'Anna (2021), "Difference-in-Differences with Multiple Time Periods" - Modern estimator and aggregation

  • Sun and Abraham (2021), "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects" - Interaction-weighted estimator

Advanced/Specialized

  • de Chaisemartin and d'Haultfoeuille (2020), "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects" - Negative weights and alternative estimator

  • Sant'Anna and Zhao (2020), "Doubly Robust Difference-in-Differences Estimators" - Combining outcome and propensity models

  • Arkhangelsky et al. (2021), "Synthetic Difference-in-Differences" - Combining DiD with synthetic controls

Applications

  • Card and Krueger (1994), "Minimum Wages and Employment" - The canonical DiD application

  • Dube, Lester, and Reich (2010), "Minimum Wage Effects Across State Borders" - County-pair design with event studies

  • Cengiz et al. (2019), "The Effect of Minimum Wages on Low-Wage Jobs" - Bunching estimator approach


Exercises

Conceptual

  1. Explain why the parallel trends assumption is fundamentally untestable. What can researchers do to make it more plausible, and what are the limitations of these approaches?

  2. Consider a state that raises its minimum wage because unemployment has been falling rapidly and the legislature believes workers can now command higher wages. Would DiD using other states as controls identify the causal effect of minimum wage on employment? Why or why not?

  3. In the Goodman-Bacon decomposition, what makes a comparison "forbidden"? Give an example of how such a comparison could yield a wrong-signed estimate.

Applied

  1. Download state-level minimum wage and employment data. Implement both TWFE and Callaway-Sant'Anna estimators for a recent period of minimum wage changes. Do the estimates differ? Produce a Goodman-Bacon decomposition to diagnose the sources of any differences.

  2. Create an event study plot for a policy change of your choosing. Discuss what the pre-trends suggest about parallel trends and what the post-treatment dynamics reveal about the effect's evolution.

Discussion

  1. Some researchers argue that the DiD revolution has made causal inference harder by showing that standard methods were problematic. Others argue it has made inference easier by providing tools to address these problems. Which view do you find more compelling, and why?

Last updated