Chapter 13: Difference-in-Differences
Opening Question
When a policy changes in one place but not another, how can we use the comparison to learn about the policy's causal effect?
Chapter Overview
Difference-in-differences (DiD) is perhaps the most widely used quasi-experimental method in applied economics. The core idea is elegantly simple: compare the change in outcomes over time for a group affected by a treatment to the change for a group that was not affected. If both groups would have evolved similarly in the absence of treatment, the difference in their changes identifies the causal effect.
This simplicity is deceptive. The parallel trends assumption—that treated and control groups would have followed the same trajectory absent treatment—is fundamentally untestable. Moreover, recent methodological work has revealed that the standard two-way fixed effects (TWFE) estimator can produce severely biased estimates when treatment effects vary across units or over time, precisely the settings where DiD is most commonly applied. This chapter develops both the classical DiD framework and the modern solutions to these problems.
What you will learn:
The logic of DiD and the parallel trends assumption
How to assess parallel trends and conduct event study analysis
Why TWFE fails with staggered adoption and heterogeneous effects
Modern estimators that address these problems (Callaway-Sant'Anna, Sun-Abraham, and others)
When DiD is and is not a credible identification strategy
Prerequisites: Chapter 9 (Causal Framework), Chapter 3 (Statistical Foundations)
13.1 The Basic Difference-in-Differences Setup
The Canonical 2×2 Case
Consider the simplest DiD setting: two groups observed in two time periods, where one group receives treatment between periods and the other does not. Let Yit denote the outcome for unit i at time t, with t∈{0,1} (pre and post) and groups G∈{0,1} (control and treated).
The DiD estimator is:
τ^DiD=(Yˉ1,1−Yˉ1,0)−(Yˉ0,1−Yˉ0,0)
where Yˉg,t is the average outcome for group g in period t. This is the change in the treated group minus the change in the control group—the "difference in differences."
Definition 13.1 (Parallel Trends Assumption): In the absence of treatment, the average outcomes for treated and control groups would have followed parallel paths over time: E[Yi1(0)−Yi0(0)∣Gi=1]=E[Yi1(0)−Yi0(0)∣Gi=0] where Yit(0) denotes the potential outcome without treatment.
Intuition: The control group's change over time tells us what would have happened to the treated group had they not been treated. We're not assuming the groups are identical—they can have different levels—only that they would have changed by the same amount.
Regression Implementation
The 2×2 DiD can be estimated via regression:
Yit=α+β⋅Treatedi+γ⋅Postt+τ⋅(Treatedi×Postt)+εit
where:
α = baseline mean for control group in pre-period
β = level difference between treated and control groups (pre-treatment)
γ = time trend common to both groups
τ = the DiD estimate (coefficient of interest)
This regression formulation extends naturally to include covariates and to settings with more groups and time periods.
Example: Card and Krueger's Minimum Wage Study
The most famous DiD application is Card and Krueger's (1994) study of New Jersey's 1992 minimum wage increase. New Jersey raised its minimum wage from $4.25 to $5.05 per hour; neighboring Pennsylvania did not change its minimum wage.
Setup:
Treated group: Fast food restaurants in New Jersey
Control group: Fast food restaurants in eastern Pennsylvania
Pre-period: February 1992 (before NJ increase)
Post-period: November 1992 (after NJ increase)
Outcome: Employment (full-time equivalent employees)
Results:
Before
20.44
23.33
-2.89
After
21.03
21.17
-0.14
Change
+0.59
-2.16
+2.75
The DiD estimate of +2.75 suggests the minimum wage increase raised employment, contradicting the standard competitive model prediction. This finding sparked decades of debate and methodological refinement.
Why DiD here? Simple before-after comparison in NJ would confound the minimum wage effect with any other changes occurring between February and November 1992 (seasonal effects, macroeconomic conditions). The PA comparison removes these common trends.

13.2 The Parallel Trends Assumption
What Parallel Trends Means (and Doesn't Mean)
Parallel trends is an assumption about counterfactual outcomes—what would have happened to the treated group absent treatment. This is fundamentally untestable because we never observe the treated group's counterfactual post-treatment trajectory.
What parallel trends does NOT require:
That treated and control groups have similar outcome levels
That groups have identical characteristics
That the groups would have the same outcomes in any period
What parallel trends DOES require:
That in the absence of treatment, both groups would have experienced the same change in outcomes
That any time-varying factors affecting outcomes are either common to both groups or unrelated to treatment assignment
Pre-Trends Analysis
While we cannot test parallel trends directly, we can examine whether outcomes evolved similarly before treatment. If pre-treatment trends diverge, parallel trends in the post-period is implausible.
Implementation: Estimate an event study specification (Section 13.3) and examine coefficients for pre-treatment periods. Statistically significant pre-trends are a red flag.
Limitations of pre-trends testing:
Statistical power: Failure to reject parallel pre-trends may reflect low power rather than true parallel trends
Anticipation effects: Divergence just before treatment may reflect anticipation rather than pre-existing trends
Non-linear trends: Parallel linear pre-trends don't guarantee parallel post-trends if true trajectories are non-linear
Ashenfelter's dip: In some settings (e.g., job training programs), treated units experience a dip just before treatment that would reverse even absent treatment
Pitfall: The Pre-Trends Fallacy Showing that pre-treatment coefficients are individually insignificant does not validate parallel trends. A joint test of all pre-treatment coefficients provides more power. Even better: present the pre-trends visually and discuss their magnitude relative to estimated effects.

Conditional Parallel Trends
When unconditional parallel trends is implausible, we may appeal to a weaker assumption:
Definition 13.2 (Conditional Parallel Trends): Parallel trends holds after conditioning on observable covariates Xi: E[Yi1(0)−Yi0(0)∣Gi=1,Xi]=E[Yi1(0)−Yi0(0)∣Gi=0,Xi]
Implementation approaches:
Regression with covariates: Include Xi in the DiD regression (but only baseline covariates—not post-treatment variables)
Matching/reweighting: Match treated and control units on Xi before computing DiD
Doubly robust DiD: Combine outcome modeling and propensity score weighting (Sant'Anna and Zhao, 2020)
Which covariates to include? Include variables that:
Predict outcome trends (not just levels)
Differ between treated and control groups
Are measured pre-treatment (never condition on post-treatment variables)
The Credibility of Parallel Trends
Parallel trends is most plausible when:
Treatment is as-if random conditional on fixed effects: Policy variation driven by idiosyncratic factors unrelated to outcome trends
Treated and control units are similar: Geographic neighbors, same industry, comparable demographics
Treatment timing is plausibly exogenous: Not driven by anticipation of differential trends
Pre-trends are convincingly parallel: Multiple pre-periods show similar evolution
Parallel trends is least plausible when:
Selection into treatment: Units adopt treatment because of anticipated effects
Treated units are structurally different: Different industries, demographics, or growth trajectories
Treatment timing is endogenous: Early adopters differ systematically from late adopters
Only one pre-period: Cannot assess pre-existing trends
13.3 Event Studies
From DiD to Dynamic Treatment Effects
The basic 2×2 DiD assumes a constant treatment effect. In practice, effects may:
Build up gradually over time
Fade out after initial impact
Differ before and after treatment (anticipation effects)
Event study designs generalize DiD to trace out the dynamic path of treatment effects.
The Event Study Specification
For a setting where all treated units receive treatment at the same time t∗, the event study specification is:
Yit=αi+γt+∑k=−1τk⋅1[t−t∗=k]⋅Treatedi+εit
where:
αi = unit fixed effects
γt = time fixed effects
k = event time (periods relative to treatment)
τk = effect at event time k
The omitted category (k=−1) normalizes effects relative to the period just before treatment
Interpreting the coefficients:
τk for k<0: Pre-treatment differences (should be ~0 under parallel trends)
τ0: Immediate treatment effect
τk for k>0: Post-treatment effects (can show dynamics)
Visualization
Event study plots are among the most informative figures in applied economics:
A well-executed event study shows:
Point estimates for each event time
Confidence intervals (95%)
Clear marking of the treatment date
Pre-trends hovering around zero
Post-treatment effects revealing the dynamic pattern

Practical Implementation
Choosing the event window:
Include enough pre-periods to assess parallel trends (typically 3-5+)
Include enough post-periods to capture treatment dynamics
Bin or drop endpoints if sample sizes become small
Endpoint binning: With varying treatment timing, not all units contribute to all event times. Common practice:
Bin endpoints: τ−K− captures all k≤−K−, τK+ captures all k≥K+
Drop endpoints: Estimate only for event times with sufficient observations
Standard errors: Cluster at the level of treatment assignment (typically state, firm, or region).
Example: Minimum Wage Event Study
Extending Card-Krueger, Dube, Lester, and Reich (2010) conduct event studies using county-pair comparisons along state borders. For each state minimum wage increase, they trace employment effects from several quarters before to several quarters after:
Pre-trends: Employment in affected counties tracks employment in neighboring counties closely before minimum wage changes
Post-treatment: Little evidence of employment decline after minimum wage increases
Dynamic pattern: No delayed effects emerging over time
This event study approach addresses concerns about state-level confounders by using geographically proximate control counties.
13.4 Staggered Adoption and the Failure of TWFE
The Staggered Adoption Setting
Most real-world DiD applications involve staggered adoption: different units receive treatment at different times. States adopt policies in different years. Firms implement changes at different dates. Hospitals adopt technologies sequentially.
The natural approach is two-way fixed effects (TWFE):
Yit=αi+γt+τ⋅Dit+εit
where Dit=1 if unit i is treated at time t (and 0 otherwise), and αi and γt are unit and time fixed effects.
For decades, applied researchers assumed τ^TWFE identified a weighted average of unit-specific treatment effects. Recent work has shown this is wrong—badly wrong when treatment effects are heterogeneous.
The Problem with TWFE
Goodman-Bacon (2021) provides the key decomposition. The TWFE estimator is a weighted average of all possible 2×2 DiD comparisons in the data. Some of these comparisons are sensible:
Clean comparisons: Early-treated vs. never-treated (good)
Clean comparisons: Late-treated vs. never-treated (good)
But others are problematic:
Forbidden comparisons: Early-treated vs. late-treated, using early-treated as controls after they've been treated
In this forbidden comparison, already-treated units serve as the control group. If treatment effects vary over time (e.g., effects grow after treatment), this comparison yields biased estimates—potentially even the wrong sign.
Theorem 13.1 (Goodman-Bacon Decomposition): The TWFE estimator equals a weighted average of all 2×2 DiD estimators: τ^TWFE=∑k∑l=kwklτ^kl where τ^kl is the 2×2 DiD comparing timing group k to timing group l, and weights wkl depend on group sizes and treatment timing variance.
Intuition: TWFE doesn't know which comparisons are valid. It uses all of them, weighting by mechanical factors (sample size, variance) rather than causal relevance.
When Does This Matter?
TWFE problems are severe when:
Treatment effects are heterogeneous across units: Different units have different effect sizes
Treatment effects vary over time: Effects grow, shrink, or change sign over exposure time
Treatment timing is correlated with effect sizes: Early adopters have different effects than late adopters
There are no never-treated units: All comparisons involve already-treated controls
TWFE is less problematic when:
Treatment effects are constant across units and over time
There are many never-treated units
All treatment happens at similar times
Negative Weights
de Chaisemartin and d'Haultfoeuille (2020) show that TWFE assigns negative weights to some treatment effects under heterogeneity. This means:
Even if all unit-level effects are positive, TWFE can estimate a negative average effect
The estimated "average treatment effect" may not be a proper average of anything
Their diagnostic: Compute the weights TWFE assigns to each unit-time observation. If many weights are negative, TWFE estimates are unreliable.


13.5 Modern DiD Estimators
The "DiD revolution" of 2018-2022 produced several estimators that avoid TWFE's problems. All share a common insight: construct valid comparisons explicitly rather than relying on regression mechanics.
Callaway and Sant'Anna (2021)
Key idea: Estimate separate treatment effects for each cohort (defined by treatment timing) at each event time, then aggregate as desired.
Group-time average treatment effects: For units first treated at time g, the effect at time t is:
ATT(g,t)=E[Yt(g)−Yt(0)∣Gg=1]
where Gg=1 indicates first treatment at time g.
Estimation: Each ATT(g,t) is estimated using only:
Treated units: Those first treated at time g
Control units: Never-treated units (or not-yet-treated units)
This avoids forbidden comparisons entirely.
Aggregation: Once we have all ATT(g,t) estimates, we can aggregate:
By event time: Average effects at each k=t−g
By cohort: Average effects for each treatment-timing group
Overall: Single summary measure
Inference: Simultaneous confidence bands account for multiple comparisons.
Sun and Abraham (2021)
Key idea: Saturate the event study regression with cohort-specific effects, then aggregate.
Interaction-weighted estimator: Estimate:
Yit=αi+γt+∑g∑k=−1τg,k⋅1[Gi=g]⋅1[t−g=k]+εit
This estimates separate event-time effects (τg,k) for each cohort g. The overall event-time effect is then:
τ^k=∑gw^gτ^g,k
where weights w^g are chosen by the researcher (e.g., cohort size).
Comparison to Callaway-Sant'Anna: Similar results in most applications. Sun-Abraham is regression-based (familiar syntax); Callaway-Sant'Anna is more explicit about the estimand.
Other Approaches
de Chaisemartin and d'Haultfoeuille (2020): Estimate instantaneous effects using "switching" variation—units that switch treatment status.
Borusyak, Jaravel, and Spiess (2024): Imputation approach. Estimate counterfactual outcomes for treated observations using untreated observations, then compare.
Wooldridge (2021): Extended TWFE with cohort-specific trends. Shows that properly specified TWFE can work, but requires many interaction terms.
Choosing an Estimator
Standard setting (staggered adoption)
Callaway-Sant'Anna or Sun-Abraham
Want regression syntax
Sun-Abraham
Want explicit aggregation choices
Callaway-Sant'Anna
Concerned about model specification
Doubly robust (Sant'Anna-Zhao)
Have panel with switching
de Chaisemartin-d'Haultfoeuille
In practice, reporting multiple estimators and showing they agree strengthens credibility.
13.6 Extensions and Special Cases
Triple Differences (DDD)
When parallel trends is questionable, adding a third difference can help. Triple differences requires:
Two groups (treated vs. control) × two time periods × two subgroups (affected vs. unaffected)
τDDD=[(YpostT,A−YpreT,A)−(YpostT,U−YpreT,U)]−[(YpostC,A−YpreC,A)−(YpostC,U−YpreC,U)]
where T/C = treated/control region, A/U = affected/unaffected subgroup.
Example: Studying health insurance mandate effects. Compare:
Individuals affected by mandate (age-eligible) vs. unaffected (older)
In states that implemented vs. didn't implement
Before vs. after implementation
Triple differences removes state-specific trends and age-specific trends, requiring only that differential state-by-age trends are parallel.
Warning: The "Parallel Gaps" Assumption in DDD
DDD is often presented as more robust than DiD—"if parallel trends fails, add another difference." But DDD requires its own identifying assumption that is frequently overlooked.
The parallel gaps assumption: The gap between affected and unaffected subgroups must evolve in parallel across treated and control regions. Formally:
E[YT,A(0)−YT,U(0)]post−E[YT,A(0)−YT,U(0)]pre=E[YC,A(0)−YC,U(0)]post−E[YC,A(0)−YC,U(0)]pre
This is not just parallel trends for each group separately. The gap must evolve similarly.
When parallel gaps fails:
If treated states were on different economic trajectories that affected subgroups differently
If the "unaffected" subgroup in treated states was indirectly affected (spillovers)
If composition of affected/unaffected groups changed differentially
Example: Studying minimum wage effects using young (affected) vs. older (unaffected) workers. If treated states had stronger overall growth, this might differentially benefit young workers (whose employment is more cyclical). The gap would widen more in treated states even without the policy—violating parallel gaps.
What to do:
Test for parallel pre-treatment gaps, not just parallel pre-trends for each group
Consider whether any shock could differentially affect the gap
Report DDD alongside simple DiD; if they diverge substantially, investigate why
Synthetic Difference-in-Differences
Arkhangelsky et al. (2021) combine DiD with synthetic control methods:
Reweight control units to match treated units' pre-treatment trends
Reweight time periods to emphasize pre-treatment periods
Estimate treatment effect using the reweighted comparison
This relaxes parallel trends—control units need not be parallel, just reweightable to be parallel.
Continuous Treatment
Standard DiD assumes binary treatment. With continuous treatment intensity:
Yit=αi+γt+τ⋅TreatIntensityit+εit
Challenges:
TWFE problems are even more severe with continuous treatment
No clear analogue to "never-treated" group
Interpretation shifts from ATE to dose-response
Solutions: de Chaisemartin and d'Haultfoeuille extend their estimator to continuous treatment; Callaway, Goodman-Bacon, and Sant'Anna (2024) develop methods for this setting.
Practical Guidance
When to Use DiD
Policy change affecting some units, not others
Yes
Classic DiD setting
Staggered policy adoption across units
Yes
Use modern estimators
Geographic policy variation
Yes
Consider spatial spillovers
Before-after with no control group
No
Cannot separate treatment from time trends
Treatment assigned based on outcome trends
No
Violates parallel trends
Very different treated and control groups
Maybe
Conditional PT may help; scrutinize carefully
Inference with Few Clusters
DiD analyses often have few clusters—sometimes as few as 50 states, 10 treated hospitals, or a handful of policy changes. Standard cluster-robust standard errors perform poorly in these settings, leading to over-rejection (too many false positives).
The problem: Cluster-robust variance estimators require many clusters (G→∞) for their asymptotic justification. With small G, they understate uncertainty.
Rule of thumb: Be concerned when:
G<50 total clusters, OR
Fewer than 10-20 treated clusters, OR
Highly unbalanced cluster sizes
Solutions:
1. Wild Cluster Bootstrap (Cameron, Gelbach & Miller 2008)
The wild cluster bootstrap imposes the null hypothesis and constructs the distribution of the test statistic under the null by resampling residuals at the cluster level:
2. Randomization Inference
If treatment was assigned randomly (or quasi-randomly) across clusters, permutation-based inference is valid regardless of the number of clusters:
3. Aggregation to Cluster Level
With very few clusters, analyze data collapsed to cluster-by-time cells. This makes the small-G problem explicit and avoids spurious precision.
Which to use?
G≥50
Standard cluster-robust SEs are usually fine
20≤G<50
Use wild cluster bootstrap; compare to cluster-robust
G<20
Wild cluster bootstrap essential; consider aggregation
G<10
Aggregation or randomization inference; bootstrap may fail
Warning: Papers with few treated states/regions and standard cluster SEs should be viewed skeptically. The minimum wage literature, for example, has wrestled with this—studies using variation across ~50 states are far more credible than those using a handful of state policy changes.
Common Pitfalls
Pitfall 1: Trusting TWFE with staggered adoption TWFE can produce severely biased estimates—even wrong-signed—when treatment effects are heterogeneous. This is the norm, not the exception.
How to avoid: Use modern estimators (Callaway-Sant'Anna, Sun-Abraham). Run Goodman-Bacon decomposition to diagnose TWFE problems. Report both TWFE and modern estimates; if they differ substantially, trust the modern ones.
Pitfall 2: Cherry-picking the control group Researchers sometimes choose control groups that show parallel pre-trends, dropping controls that diverge. This is specification searching that invalidates inference.
How to avoid: Pre-specify the control group based on substantive criteria (geography, industry, demographics), not pre-trend fit. Document the choice before examining results.
Pitfall 3: Conditioning on post-treatment variables Including covariates measured after treatment can induce bias by controlling for outcomes of treatment.
How to avoid: Only condition on pre-treatment (baseline) covariates.
Pitfall 4: Ignoring anticipation effects If units adjust behavior before treatment takes effect (e.g., firms change hiring before minimum wage increase), effects appear in pre-treatment periods.
How to avoid: Consider whether anticipation is plausible. If so, allow for anticipation in event study specification (e.g., set k=−2 as reference period).
Pitfall 5: Spillovers between treated and control If treatment affects control units (e.g., minimum wage in NJ affects PA restaurants near the border), DiD is biased.
How to avoid: Consider whether spillovers are plausible. Test by examining control units near vs. far from treated units.
Implementation Checklist
Qualitative Bridge
How Qualitative Methods Complement DiD
DiD identifies an average treatment effect under parallel trends, but leaves key questions unanswered:
Why did some units adopt treatment and others not?
How did treatment produce its effects?
Why might parallel trends hold (or fail)?
Qualitative research can address these gaps.
When to Combine
Understanding treatment assignment: Interviews with policymakers, analysis of legislative debates, and process tracing can reveal why some states adopted a policy and others did not. If selection was driven by factors unrelated to outcome trends, parallel trends is more plausible.
Mechanism exploration: DiD tells us minimum wage increases didn't reduce employment. Qualitative research—employer interviews, observation of workplace dynamics—can reveal why: did firms raise prices, reduce profits, increase productivity, or substitute toward different workers?
Assessing external validity: Case studies of specific treated units help understand whether findings generalize. What makes New Jersey's experience with minimum wage relevant (or not) for other states?
Example: Minimum Wage Research
Card and Krueger's quantitative finding sparked intensive qualitative investigation:
Employer interviews revealed adjustment mechanisms (reduced turnover, price increases) that reconciled findings with theory
Analysis of policy debates showed minimum wage increases were driven by political factors, not economic conditions—supporting parallel trends
Industry case studies identified heterogeneity: effects differed by restaurant type, location, and competitive environment
This triangulation strengthened the overall evidence base beyond what DiD alone could provide.
Integration Note
Connections to Other Methods
Instrumental Variables
DiD as instrument for endogenous treatment intensity
Ch. 12
Regression Discontinuity
RD in time (sharp change at treatment date)
Ch. 14
Synthetic Control
Alternative when single treated unit; can combine
Ch. 15
Selection on Observables
Conditional DiD uses similar propensity score methods
Ch. 11
Triangulation Strategies
DiD estimates gain credibility when combined with:
Different control groups: Do results hold with alternative comparison units?
Different outcome variables: Do related outcomes show consistent patterns?
Different time windows: Are results robust to expanding or shifting the analysis period?
Alternative estimators: Do Callaway-Sant'Anna, Sun-Abraham, and TWFE agree?
Synthetic control: For small numbers of treated units, does SCM yield similar conclusions?
Qualitative evidence: Do interviews and case studies support the mechanism?
Running Example: China's Special Economic Zones
China's Special Economic Zones (SEZs) provide a staggered DiD setting: different cities received SEZ status at different times starting in 1980, with waves in 1984, 1988, 1992, and later years.
Research question: What was the effect of SEZ designation on local economic growth?
DiD setup:
Treated units: Cities designated as SEZs
Control units: Similar cities without SEZ status
Treatment timing: Varies by city (staggered adoption)
Outcomes: GDP, employment, investment, exports
Challenges illustrating this chapter's themes:
Selection: SEZ designation was not random—initial zones were in strategic coastal locations. Why were these cities chosen? Does selection violate parallel trends?
Staggered adoption: Different waves may have faced different economic environments. TWFE would use early SEZs as controls for later SEZs—problematic if SEZ effects grew over time.
Heterogeneous effects: SEZs near Hong Kong (Shenzhen) likely had different effects than inland SEZs designated in the 1990s.
Spillovers: SEZs may have affected neighboring non-SEZ cities, contaminating the control group.
Modern approach: Wang (2013) and subsequent work apply careful DiD methods:
Use never-SEZ cities as controls (avoiding forbidden comparisons)
Examine pre-trends in growth rates
Estimate heterogeneous effects by SEZ cohort
Test for spillovers to nearby cities
Findings: SEZ designation substantially increased local GDP and exports, with effects persisting and growing over time. Effects were largest for early SEZs with access to Hong Kong capital and expertise.
Methodological lesson: The China SEZ case illustrates both DiD's power (exploiting policy variation for causal inference) and its challenges (selection into treatment, heterogeneous effects, spillovers). Credible analysis requires modern estimators and careful attention to identification threats.
Summary
Key takeaways:
DiD identifies causal effects by comparing changes: The difference between treated and control group changes removes time-invariant confounders and common time trends.
Parallel trends is the key assumption: The control group's trajectory must reflect the treated group's counterfactual. This is untestable but can be probed through pre-trends analysis.
TWFE fails with staggered adoption and heterogeneous effects: The standard approach uses "forbidden comparisons" with already-treated units as controls, potentially producing wrong-signed estimates.
Modern estimators solve the TWFE problem: Callaway-Sant'Anna, Sun-Abraham, and related methods construct valid comparisons explicitly. These should be standard practice for staggered adoption settings.
Event studies visualize dynamics and assess parallel trends: Plotting coefficients by event time reveals both pre-trends (for diagnostics) and the dynamic path of treatment effects.
Returning to the opening question: When a policy changes in one place but not another, we can use the comparison to identify the policy's causal effect—but only if the comparison group's trajectory reveals the treated group's counterfactual. This requires parallel trends, careful attention to treatment timing and heterogeneity, and (in modern practice) estimators designed to handle the complications that arise in real policy settings.
Further Reading
Essential
Cunningham (2021), Causal Inference: The Mixtape, Chapter 9 - Accessible introduction with examples
Roth et al. (2023), "What's Trending in Difference-in-Differences?" - Comprehensive survey of recent developments
For Deeper Understanding
Goodman-Bacon (2021), "Difference-in-Differences with Variation in Treatment Timing" - The key decomposition result explaining TWFE problems
Callaway and Sant'Anna (2021), "Difference-in-Differences with Multiple Time Periods" - Modern estimator and aggregation
Sun and Abraham (2021), "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects" - Interaction-weighted estimator
Advanced/Specialized
de Chaisemartin and d'Haultfoeuille (2020), "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects" - Negative weights and alternative estimator
Sant'Anna and Zhao (2020), "Doubly Robust Difference-in-Differences Estimators" - Combining outcome and propensity models
Arkhangelsky et al. (2021), "Synthetic Difference-in-Differences" - Combining DiD with synthetic controls
Applications
Card and Krueger (1994), "Minimum Wages and Employment" - The canonical DiD application
Dube, Lester, and Reich (2010), "Minimum Wage Effects Across State Borders" - County-pair design with event studies
Cengiz et al. (2019), "The Effect of Minimum Wages on Low-Wage Jobs" - Bunching estimator approach
Exercises
Conceptual
Explain why the parallel trends assumption is fundamentally untestable. What can researchers do to make it more plausible, and what are the limitations of these approaches?
Consider a state that raises its minimum wage because unemployment has been falling rapidly and the legislature believes workers can now command higher wages. Would DiD using other states as controls identify the causal effect of minimum wage on employment? Why or why not?
In the Goodman-Bacon decomposition, what makes a comparison "forbidden"? Give an example of how such a comparison could yield a wrong-signed estimate.
Applied
Download state-level minimum wage and employment data. Implement both TWFE and Callaway-Sant'Anna estimators for a recent period of minimum wage changes. Do the estimates differ? Produce a Goodman-Bacon decomposition to diagnose the sources of any differences.
Create an event study plot for a policy change of your choosing. Discuss what the pre-trends suggest about parallel trends and what the post-treatment dynamics reveal about the effect's evolution.
Discussion
Some researchers argue that the DiD revolution has made causal inference harder by showing that standard methods were problematic. Others argue it has made inference easier by providing tools to address these problems. Which view do you find more compelling, and why?
Last updated