# Appendix G: A Guide to Empirical Methods in Public Economics

## APPENDIX G

## Empirical Methods --- Technical Details

> "Essentially, all models are wrong, but some are useful."
>
> \--- George E. P. Box, *Empirical Model-Building and Response Surfaces* (1987)

***

### Introduction

Chapter 2 introduced the core ideas of causal inference and surveyed the empirical toolkit that has transformed public economics over the past three decades. This appendix provides the technical foundations that underpin those methods. The treatment here is more formal---we work through derivations, state assumptions precisely, and examine the statistical theory behind each estimator. But the goal remains practical: to equip you to read, evaluate, and eventually produce empirical research in public economics.

We assume familiarity with probability, linear algebra at the level of matrix notation, and basic statistical inference (hypothesis testing, confidence intervals). Readers encountering these methods for the first time should start with Chapter 2 and return here for the technical details. Throughout, we ground the mathematics in concrete public economics applications, because a method is only as useful as the substantive question it helps answer.

The appendix proceeds as follows. Section G.1 revisits the OLS regression framework with attention to its assumptions and failure modes. Section G.2 develops instrumental variables and two-stage least squares. Section G.3 formalizes difference-in-differences and the recent literature on staggered adoption. Section G.4 covers regression discontinuity designs. Section G.5 introduces bunching estimation, a method distinctive to public economics. Section G.6 presents synthetic control methods. Section G.7 discusses the administrative data revolution and its methodological implications.

The table below provides a roadmap, mapping each method to its core identifying assumption, the estimand it recovers, and canonical public economics applications.

| Method                    | Key Assumption                 | Estimand                     | Public Econ Application                           |
| ------------------------- | ------------------------------ | ---------------------------- | ------------------------------------------------- |
| OLS                       | $$E\[\varepsilon \mid X] = 0$$ | Conditional mean             | Descriptive regressions, controls-based estimates |
| IV / 2SLS                 | Exclusion restriction          | LATE (for compliers)         | Draft lottery and earnings; distance to college   |
| Difference-in-Differences | Parallel trends                | ATT                          | EITC and labor supply; minimum wage               |
| Regression Discontinuity  | Continuity at cutoff           | Local ATE at cutoff          | Medicaid thresholds; class size rules             |
| Bunching                  | Smooth counterfactual density  | Elasticity of taxable income | EITC kinks; stamp duty notches                    |
| Synthetic Control         | Weighted match on pre-trends   | ATT for treated unit         | California tobacco program                        |

***

### G.1 The OLS Framework Revisited

#### Population Regression Function vs. Sample Analog

Consider the relationship between an outcome $$Y\_i$$ and a vector of regressors $$X\_i$$. The **population regression function** is defined by the conditional expectation:

$$
E\[Y\_i \mid X\_i] = X\_i' \beta
$$

where $$\beta$$ is a $$k \times 1$$ vector of population parameters. This is a statement about the data-generating process. We never observe the population regression function directly; instead, we estimate it from a sample of $$n$$ observations.

The **sample analog** replaces the population expectation with the sample minimization problem. The OLS estimator $$\hat{\beta}$$ solves:

$$
\hat{\beta} = \arg\min\_b \sum\_{i=1}^{n} (Y\_i - X\_i' b)^2
$$

The familiar closed-form solution is:

$$
\hat{\beta} = (X'X)^{-1} X'Y
$$

where $$X$$ is the $$n \times k$$ matrix of regressors and $$Y$$ is the $$n \times 1$$ vector of outcomes.

An important property of OLS is that it provides the **best linear predictor** of $$Y$$ given $$X$$, regardless of whether the conditional expectation is actually linear. Even when the true relationship is nonlinear, the OLS coefficient converges to the population linear projection coefficient, which has a useful weighted-average-of-slopes interpretation (Angrist and Pischke 2009, Chapter 3). This means OLS is never "wrong" as a descriptive tool---the question is whether the descriptive relationship it captures admits a causal interpretation.

#### Key Assumptions

The causal interpretation of OLS hinges on a set of assumptions that are worth stating precisely, because in public economics they are almost always violated to some degree.

**Assumption 1 (Linearity).** The conditional expectation function is linear in $$X$$:

$$
Y\_i = X\_i' \beta + \varepsilon\_i
$$

where $$\varepsilon\_i$$ is the error term. This is less restrictive than it appears. With flexible functional forms---polynomials, interactions, indicator variables---linearity in parameters can approximate highly nonlinear relationships. What matters is linearity in the parameters $$\beta$$, not in the original variables.

**Assumption 2 (Strict Exogeneity).**

$$
E\[\varepsilon\_i \mid X\_1, X\_2, \ldots, X\_n] = 0
$$

This is the critical assumption for causal inference. It requires that the error term---everything affecting $$Y$$ that is not included in $$X$$---is mean-independent of the regressors. In cross-sectional applications, a weaker version suffices:

$$
E\[\varepsilon\_i \mid X\_i] = 0
$$

When this assumption fails, OLS estimates are **biased** and cannot be interpreted as causal effects. Every quasi-experimental method in this appendix is, at root, a strategy for recovering causal estimates when exogeneity fails for naive OLS.

**Assumption 3 (No Perfect Multicollinearity).** The matrix $$X'X$$ has full rank $$k$$, so that $$(X'X)^{-1}$$ exists. This rules out exact linear relationships among regressors. Near-multicollinearity---regressors that are highly but not perfectly correlated---does not cause bias but inflates standard errors, sometimes dramatically.

**Assumption 4 (Spherical Errors).** The classical assumption is $$\text{Var}(\varepsilon \mid X) = \sigma^2 I\_n$$: errors are homoskedastic and uncorrelated. This assumption is almost never appropriate in applied work.

#### Properties of OLS Under These Assumptions

Under Assumptions 1--3, OLS is **unbiased**: $$E\[\hat{\beta} \mid X] = \beta$$. Adding Assumption 4, the **Gauss-Markov theorem** establishes that OLS is the best linear unbiased estimator (BLUE): no other linear unbiased estimator has a smaller variance. But the Gauss-Markov theorem is often less useful than it sounds. "Best" here means minimum variance among *linear* estimators, and there may be nonlinear estimators (e.g., maximum likelihood under normality) that perform better. More importantly, in applied public economics, the binding constraint is almost always Assumption 2 (exogeneity), not efficiency---we are worried about getting the right answer, not getting the most precise wrong answer.

Under Assumptions 1--4, the variance of $$\hat{\beta}$$ is:

$$
\text{Var}(\hat{\beta} \mid X) = \sigma^2 (X'X)^{-1}
$$

with $$\sigma^2$$ estimated by $$\hat{\sigma}^2 = \frac{1}{n-k} \sum\_{i=1}^{n} \hat{\varepsilon}\_i^2$$. This formula underlies the classical $$t$$-tests and $$F$$-tests reported in regression output. When Assumption 4 fails, these tests are invalid, and we need robust alternatives.

#### Omitted Variable Bias

The most common threat to causal interpretation of OLS is **omitted variable bias (OVB)**. Suppose the true model is:

$$
Y\_i = \beta\_1 X\_{1i} + \beta\_2 X\_{2i} + \varepsilon\_i
$$

but we estimate the "short" regression omitting $$X\_2$$:

$$
Y\_i = \tilde{\beta}*1 X*{1i} + \tilde{\varepsilon}\_i
$$

The bias in $$\tilde{\beta}\_1$$ relative to the true $$\beta\_1$$ is:

$$
E\[\tilde{\beta}*1] - \beta\_1 = \beta\_2 \cdot \delta*{21}
$$

where $$\delta\_{21}$$ is the coefficient from regressing $$X\_{2i}$$ on $$X\_{1i}$$. This is the **omitted variable bias formula**: the bias equals the effect of the omitted variable ($$\beta\_2$$) times the correlation between the omitted and included variables ($$\delta\_{21}$$).

This formula is extraordinarily useful for reasoning about the direction and magnitude of bias, even when you cannot eliminate it. The sign of the bias is determined by the product of two terms:

* If the omitted variable has a positive effect on $$Y$$ and is positively correlated with $$X\_1$$, the bias is positive (OLS overstates the true effect).
* If the omitted variable has a positive effect on $$Y$$ but is negatively correlated with $$X\_1$$, the bias is negative (OLS understates or reverses the true effect).

In many public economics applications, we can sign both terms using economic reasoning, even when we cannot measure the omitted variable directly. This makes OVB analysis a powerful tool for assessing the credibility of OLS estimates.

**Example: Test Scores and School Spending.** Suppose we regress student test scores on per-pupil spending across school districts. The naive coefficient might show a weak or even negative relationship. Why? Districts with high spending are often urban districts with concentrated poverty. Family income affects both test scores ($$\beta\_2 > 0$$) and is negatively correlated with spending in the cross-section ($$\delta\_{21} < 0$$), producing a negative omitted variable bias that can overwhelm the true positive effect of spending. This is why naive cross-sectional regressions of test scores on spending are nearly uninformative about the causal effect of school resources---a point reinforced by the extensive literature following Hanushek (1986) and challenged by Jackson, Johnson, and Persico (2016), who use quasi-experimental variation to find substantial positive effects. (See Chapter 15 for a full discussion.)

#### Sensitivity Analysis: Oster (2019)

A practical tool for assessing the severity of omitted variable bias is the method proposed by Oster (2019), building on Altonji, Elder, and Taber (2005). The idea is to examine how the coefficient of interest changes as observable controls are added to the regression, and to use the degree of change to bound the bias from unobservable confounders.

Specifically, if adding a rich set of controls moves the coefficient from $$\hat{\beta}^{short}$$ to $$\hat{\beta}^{long}$$, and simultaneously moves the $$R^2$$ from $$R^2\_{short}$$ to $$R^2\_{long}$$, we can ask: how much additional confounding from unobservables (measured by the proportional change in $$R^2$$ relative to a hypothetical maximum $$R^2\_{max}$$) would be needed to drive the coefficient to zero?

The key statistic is $$\delta$$: the ratio of unobservable to observable selection that would be needed to fully explain away the estimate. Oster suggests that $$\delta > 1$$ is a reasonable benchmark for robustness---if unobservables would need to be *more* important than observables to eliminate the effect, the result is relatively robust to selection on unobservables.

This approach does not prove causation, but it provides a disciplined way to assess how worried one should be about omitted variable bias in a given application. It has become standard practice in applied microeconomics and is particularly useful in public economics settings where quasi-experimental variation is unavailable.

#### Robust Standard Errors and Clustering

In practice, the spherical errors assumption fails in two important ways that demand different corrections.

**Heteroskedasticity** means $$\text{Var}(\varepsilon\_i \mid X\_i)$$ varies across observations. The solution is the **heteroskedasticity-robust** (or "Huber-White-Eicker") variance estimator:

$$
\widehat{\text{Var}}(\hat{\beta}) = (X'X)^{-1} \left( \sum\_{i=1}^{n} \hat{\varepsilon}\_i^2 X\_i X\_i' \right) (X'X)^{-1}
$$

This is now standard practice. There is essentially no reason to report classical (homoskedastic) standard errors in applied work.

**Clustering** arises when errors are correlated within groups---students within schools, individuals within states, or time periods within firms. If a state-level policy affects all individuals in the state similarly, individual-level observations within that state are not independent. Ignoring this correlation produces standard errors that are too small, leading to over-rejection of null hypotheses.

The **cluster-robust** variance estimator generalizes the Huber-White formula by allowing arbitrary correlation within clusters:

$$
\widehat{\text{Var}}(\hat{\beta}) = (X'X)^{-1} \left( \sum\_{g=1}^{G} X\_g' \hat{\varepsilon}\_g \hat{\varepsilon}\_g' X\_g \right) (X'X)^{-1}
$$

where $$g = 1, \ldots, G$$ indexes clusters and $$\hat{\varepsilon}\_g$$ is the vector of residuals for cluster $$g$$.

A practical rule of thumb from Cameron, Gelbach, and Miller (2008): cluster-robust inference requires a reasonably large number of clusters (often cited as $$G$$ \geq 50$). With fewer clusters, wild cluster bootstrap methods provide more reliable inference. Bertrand, Duflo, and Mullainathan (2004) demonstrated that failing to cluster appropriately in difference-in-differences designs can produce false positive rates of 45 percent or more---a warning that reshaped applied practice in public economics.

**When in doubt about the right level of clustering, cluster at the level of treatment assignment.** If a policy is implemented at the state level, cluster at the state level, even if your data are at the individual level. Abadie, Athey, Imbens, and Wooldridge (2023) provide a more nuanced framework: clustering adjusts for two distinct issues---heterogeneity in treatment effects across clusters and sampling uncertainty when clusters are sampled from a larger population. When the researcher observes all clusters of interest (e.g., all 50 states), the rationale for clustering shifts from sampling to design-based uncertainty.

***

### G.2 Instrumental Variables and Two-Stage Least Squares

#### The Problem

When $$E\[\varepsilon\_i \mid X\_i] \neq 0$$---that is, when the regressor of interest is endogenous---OLS is biased and inconsistent. Endogeneity arises from omitted variables, simultaneity, or measurement error. **Instrumental variables (IV)** provide a solution by isolating variation in the endogenous regressor that is unrelated to the error term.

#### The IV Setup

Consider the structural equation:

$$
Y\_i = \beta\_0 + \beta\_1 X\_i + \varepsilon\_i
$$

where $$X\_i$$ is endogenous: $$\text{Cov}(X\_i, \varepsilon\_i) \neq 0$$. An **instrument** $$Z\_i$$ must satisfy two conditions:

1. **Relevance:** $$\text{Cov}(Z\_i, X\_i) \neq 0$$. The instrument must be correlated with the endogenous regressor.
2. **Exclusion restriction:** $$\text{Cov}(Z\_i, \varepsilon\_i) = 0$$. The instrument affects $$Y$$ only through $$X$$, not directly.

The relevance condition is testable. The exclusion restriction is not---it is an assumption that must be defended on substantive grounds. This asymmetry is the central challenge of IV estimation: the most important assumption is the one you cannot verify empirically.

#### Mechanics of Two-Stage Least Squares

The **Wald estimator** is the simplest IV estimator, applicable when the instrument is binary. If $$Z\_i \in {0, 1}$$:

$$
\hat{\beta}\_1^{IV} = \frac{E\[Y\_i \mid Z\_i = 1] - E\[Y\_i \mid Z\_i = 0]}{E\[X\_i \mid Z\_i = 1] - E\[X\_i \mid Z\_i = 0]} = \frac{\text{Reduced form}}{\text{First stage}}
$$

The numerator is the **reduced form**: the effect of the instrument on the outcome. The denominator is the **first stage**: the effect of the instrument on the endogenous regressor. The IV estimate is the ratio.

This representation yields a useful insight: **always examine the reduced form**. If the instrument has no statistically significant effect on the outcome (the reduced form is zero), the IV estimate is zero regardless of the first stage. Conversely, a strong reduced form combined with a strong first stage is the best evidence for a meaningful IV result. Reporting the reduced form alongside the 2SLS estimate should be standard practice.

More generally, **two-stage least squares (2SLS)** proceeds in two steps:

**First stage:** Regress the endogenous variable on the instrument(s) and any exogenous controls:

$$
X\_i = \pi\_0 + \pi\_1 Z\_i + W\_i' \gamma + v\_i
$$

where $$W\_i$$ are exogenous covariates. Save the predicted values $$\hat{X}\_i$$.

**Second stage:** Regress the outcome on the predicted values and exogenous controls:

$$
Y\_i = \beta\_0 + \beta\_1 \hat{X}\_i + W\_i' \delta + \eta\_i
$$

The coefficient $$\hat{\beta}\_1$$ from the second stage is the 2SLS estimate of the causal effect. Note that standard errors must be computed from the structural equation using the actual (not predicted) residuals; most statistical software handles this automatically. A common pitfall is to run the two stages manually and report second-stage standard errors that condition on the predicted values as if they were data---this produces incorrect standard errors. Always use a dedicated IV/2SLS command.

In matrix notation, with $$Z$$ as the matrix of instruments and exogenous controls:

$$
\hat{\beta}\_{2SLS} = (X'P\_Z X)^{-1} X'P\_Z Y
$$

where $$P\_Z = Z(Z'Z)^{-1}Z'$$ is the projection matrix onto the column space of $$Z$$.

#### The LATE Interpretation

A foundational result by Imbens and Angrist (1994) shows that IV estimates have a specific causal interpretation under heterogeneous treatment effects. When the treatment $$X\_i$$ is binary and the instrument $$Z\_i$$ is binary, the population can be divided into four groups based on how their treatment status responds to the instrument:

* **Compliers:** individuals who take treatment when $$Z = 1$$ and do not when $$Z = 0$$
* **Always-takers:** individuals who take treatment regardless of $$Z$$
* **Never-takers:** individuals who never take treatment regardless of $$Z$$
* **Defiers:** individuals who take treatment when $$Z = 0$$ and do not when $$Z = 1$$

Under the **monotonicity** assumption (no defiers), the IV estimator identifies the **local average treatment effect (LATE)**:

$$
\hat{\beta}\_{IV} \xrightarrow{p} E\[Y\_i(1) - Y\_i(0) \mid \text{Complier}]
$$

This is the average causal effect *for compliers only*---the subpopulation whose behavior is actually changed by the instrument. The LATE need not equal the average treatment effect for the entire population. This matters for policy: if the government is considering scaling up a program, the effect on compliers (those induced to participate by marginal changes in access) may differ from the effect on always-takers (who would participate regardless) or never-takers (who would not participate regardless).

The LATE interpretation also implies that **different instruments for the same endogenous variable can produce different IV estimates**, and both can be "correct." Each instrument identifies the effect for a different subpopulation of compliers. For example, an IV estimate of the returns to schooling using the Vietnam draft lottery as an instrument identifies the effect for men whose education was disrupted by military service, while an estimate using college proximity identifies the effect for men whose education was facilitated by geographic access. These are different populations, and the true treatment effects may genuinely differ.

#### Weak Instruments

When the first stage is weak---the instrument is only weakly correlated with the endogenous regressor---2SLS suffers from severe problems. The estimator is biased toward the OLS estimate in finite samples, confidence intervals have poor coverage, and test statistics are unreliable.

The standard diagnostic is the **first-stage F-statistic** testing $$\pi\_1 = 0$$ in the first-stage regression. Staiger and Stock (1997) proposed the rule of thumb that $$F$$ > 10$ is necessary for reliable 2SLS inference. Stock and Yogo (2005) provided more formal critical values based on the degree of bias or size distortion the researcher is willing to tolerate.

With weak instruments, alternative estimators are available. The **limited information maximum likelihood (LIML)** estimator has better finite-sample properties than 2SLS when instruments are weak. The **Anderson-Rubin (AR) test** provides valid inference regardless of instrument strength, though it can be conservative. More recent approaches by Andrews, Stock, and Sun (2019) provide weak-instrument-robust confidence sets.

Lee, McCrary, Moreira, and Porter (2022) have argued that the traditional $$F$$ > 10$ threshold is insufficient and proposed a more demanding threshold (effectively $$F$$ > 104.7$ for a single endogenous regressor with one instrument when the researcher wants a 5 percent test with correct size). While this stricter threshold remains debated, it underscores the severity of the weak instruments problem and the importance of reporting first-stage diagnostics transparently.

**Report the first-stage F-statistic in every IV paper you write.** This is not optional.

#### Overidentification and Multiple Instruments

When a researcher has more instruments than endogenous regressors (the **overidentified** case), additional diagnostic tools become available. The **Sargan-Hansen test** (also called the $$J$$-test) tests the joint validity of the instruments under the maintained assumption that at least one instrument is valid. The test statistic is:

$$
J = n \cdot \hat{\varepsilon}*{2SLS}' Z (Z'Z)^{-1} Z' \hat{\varepsilon}*{2SLS} / \hat{\sigma}^2 \sim \chi^2(m - k)
$$

where $$m$$ is the number of instruments and $$k$$ is the number of endogenous regressors. Rejection suggests that at least one instrument violates the exclusion restriction.

The test has an important limitation: it has no power against violations that are common to all instruments. If every instrument is invalid in the same direction, the $$J$$-test will not detect the problem. For this reason, it is best thought of as testing whether different instruments produce *consistent* estimates, not whether any individual instrument is valid.

In practice, researchers with multiple instruments should report the 2SLS estimate, the first-stage F-statistic, the $$J$$-test statistic, and---where feasible---separate IV estimates using each instrument individually. If different instruments yield very different estimates, this is informative about the plausibility of the exclusion restriction even if the $$J$$-test fails to reject.

#### Example: The Vietnam Draft Lottery and Earnings

Angrist (1990) used the Vietnam-era draft lottery to estimate the effect of military service on civilian earnings. The draft lottery randomly assigned priority numbers to men based on birth date, creating exogenous variation in the probability of military service.

The instrument (draft lottery number) satisfies relevance: men with low lottery numbers were substantially more likely to serve. The exclusion restriction requires that lottery numbers affect earnings only through military service, not through other channels. This is plausible because the numbers were randomly assigned, though one might worry about psychological effects of being draft-eligible even among non-servers.

Angrist estimated that military service reduced earnings by approximately 15 percent for white veterans---a LATE for compliers (men induced to serve by the draft who would not have volunteered). This is a different population than the average veteran, and the negative earnings effect contrasts with the positive OLS coefficient, which reflects selection (volunteers tend to have characteristics associated with higher earnings). The reversal from positive OLS to negative IV estimates is a textbook demonstration of how selection bias can completely mislead naive estimation.

The draft lottery example also illustrates the distinction between **intent-to-treat (ITT)** and **treatment on the treated (TOT)** effects. The reduced form---the effect of draft eligibility on earnings---is the ITT effect. The 2SLS estimate---the effect of actual military service, instrumented by draft eligibility---is the TOT (or LATE) for compliers. The ITT effect is smaller in magnitude because not everyone who was draft-eligible actually served, but it requires only the (highly credible) assumption that the lottery was truly random. The TOT requires the additional assumption that the lottery affects earnings only through service. In policy contexts, the ITT is often the more directly relevant parameter, since governments control eligibility rules, not individual compliance.

#### Example: Distance to College and Educational Attainment

Card (1993) used geographic proximity to a four-year college as an instrument for years of schooling in an earnings equation. Men who grew up near a college faced lower costs of attending and consequently obtained more schooling. If distance affects earnings only through education---the exclusion restriction---then proximity serves as a valid instrument.

Card estimated returns to schooling of roughly 13 percent per year, substantially above OLS estimates of 7-8 percent. This pattern---IV estimates exceeding OLS---is consistent with the LATE framework if the compliers (individuals whose schooling is affected by proximity) have above-average returns to education. Intuitively, these might be individuals from disadvantaged backgrounds for whom college access is most binding, and who stand to gain the most from additional schooling.

The Card (1993) study also illustrates the fragility of the exclusion restriction. One might worry that growing up near a college reflects living in a more economically developed area, which affects earnings through channels other than education. This concern cannot be fully resolved empirically and requires judgment about the plausibility of alternative channels. (See Chapter 15 for a broader discussion of the returns to education.)

***

### G.3 Difference-in-Differences

#### The Basic 2x2 Setup

Chapter 2 introduced the intuition behind difference-in-differences. Here we formalize it.

Consider two groups ($$g \in {0, 1}$$, where $$g = 1$$ is the treatment group) observed in two periods ($$t \in {0, 1}$$, where $$t = 1$$ is the post-treatment period). The potential outcome for unit $$i$$ in group $$g$$ at time $$t$$ in the absence of treatment is:

$$
Y\_{igt}(0) = \alpha\_g + \lambda\_t + \varepsilon\_{igt}
$$

This additive structure implies that time trends ($$\lambda\_t$$) are common across groups---the **parallel trends assumption**. The treatment effect is additive:

$$
Y\_{igt}(1) = Y\_{igt}(0) + \tau
$$

The observed outcome is:

$$
Y\_{igt} = \alpha\_g + \lambda\_t + \tau \cdot D\_{gt} + \varepsilon\_{igt}
$$

where $$D\_{gt} = 1$$ if $$g = 1$$ and $$t = 1$$ (the treated group in the post period). The DiD estimator is:

$$
\hat{\tau}*{DiD} = (\bar{Y}*{1,1} - \bar{Y}*{1,0}) - (\bar{Y}*{0,1} - \bar{Y}\_{0,0})
$$

Under parallel trends and no anticipation (the treatment group does not change behavior before treatment), $$\hat{\tau}\_{DiD}$$ is an unbiased estimator of the average treatment effect on the treated (ATT).

Note that DiD identifies the ATT, not the ATE. It tells us what happened to the treated group relative to what would have happened without treatment. The effect for the treated group might differ from what would happen if we treated the control group, because the two groups may respond differently. This distinction matters when scaling up policies.

In regression form:

$$
Y\_{igt} = \beta\_0 + \beta\_1 \cdot \text{Treat}\_g + \beta\_2 \cdot \text{Post}\_t + \beta\_3 \cdot (\text{Treat}\_g \times \text{Post}*t) + \varepsilon*{igt}
$$

The coefficient $$\beta\_3$$ is the DiD estimate. The regression framework easily accommodates covariates, multiple groups, multiple time periods, and fixed effects.

#### The Parallel Trends Assumption

The parallel trends assumption is the identifying assumption of DiD. It requires that, in the absence of treatment, the treatment and control groups would have followed the same trajectory over time. Formally:

$$
E\[Y\_{igt}(0) \mid g = 1, t = 1] - E\[Y\_{igt}(0) \mid g = 1, t = 0] = E\[Y\_{igt}(0) \mid g = 0, t = 1] - E\[Y\_{igt}(0) \mid g = 0, t = 0]
$$

This is fundamentally untestable because we never observe $$Y\_{igt}(0)$$ for the treated group in the post-period---that is precisely the counterfactual we are trying to construct. However, researchers can assess its plausibility by examining **pre-treatment trends**. If the two groups moved in parallel before treatment, it is more credible (though not guaranteed) that they would have continued to do so.

Note that parallel trends is about *changes*, not *levels*. The treatment and control groups need not have the same average outcome---they can differ by a fixed amount (the group effect $$\alpha\_g$$). What matters is that the time path would have been the same. This is why DiD is more credible than simple cross-sectional comparisons: it allows for time-invariant differences between groups while eliminating common time trends.

#### Event Study and Dynamic DiD Specifications

The most common way to assess parallel trends and examine dynamic treatment effects is the **event study** specification. With multiple pre- and post-treatment periods, estimate:

$$
Y\_{igt} = \alpha\_i + \lambda\_t + \sum\_{k \neq -1} \gamma\_k \cdot D\_g \cdot 1{t - t^\* = k} + X\_{igt}' \delta + \varepsilon\_{igt}
$$

where $$\alpha\_i$$ are unit fixed effects, $$\lambda\_t$$ are time fixed effects, $$t^\*$$ is the treatment date, and $$k$$ indexes periods relative to treatment. The period $$k = -1$$ is normalized to zero as the reference category.

The pre-treatment coefficients ($$\gamma\_k$$ for $$k < -1$$) should be indistinguishable from zero if parallel trends holds. Statistically significant pre-trends suggest that the treatment and control groups were diverging before the policy change, undermining the DiD design. The post-treatment coefficients ($$\gamma\_k$$ for $$k \geq 0$$) trace out the dynamic treatment effect.

The event study plot---a graph of the $$\hat{\gamma}\_k$$ coefficients with confidence intervals against relative time $$k$$---has become the most important visual diagnostic in applied DiD work. A good event study plot shows: (1) flat pre-trends near zero, (2) a clear break at the treatment date, and (3) a pattern of post-treatment effects consistent with the policy mechanism.

A word of caution: failing to reject the null of zero pre-trends is necessary but not sufficient. Pre-trends tests have limited statistical power, and the absence of evidence is not evidence of absence. Roth (2022) shows that studies that pass conventional pre-trends tests may still suffer from substantial bias. Rambachan and Roth (2023) develop methods for conducting sensitivity analysis under violations of parallel trends, allowing researchers to ask: how large would a violation need to be to overturn the results?

#### Staggered Treatment: Problems with Two-Way Fixed Effects

Much of applied DiD research uses a **two-way fixed effects (TWFE)** regression when treatment is adopted at different times across units:

$$
Y\_{it} = \alpha\_i + \lambda\_t + \beta \cdot D\_{it} + \varepsilon\_{it}
$$

where $$D\_{it} = 1$$ if unit $$i$$ has been treated by time $$t$$. The coefficient $$\beta$$ was traditionally interpreted as the average treatment effect.

A series of papers starting around 2020 demonstrated that this interpretation can be badly wrong when treatment effects are heterogeneous across units or over time.

**Goodman-Bacon (2021)** showed that the TWFE estimator in the staggered setting is a weighted average of all possible 2x2 DiD comparisons. Critically, some of these comparisons use *already-treated* units as controls for *later-treated* units. If treatment effects evolve over time, already-treated units are "contaminated" controls, and the resulting weights can be negative---meaning that some groups' treatment effects enter the overall estimate with the wrong sign. In the most pathological cases, the TWFE estimate can have the opposite sign of every group-specific treatment effect.

**de Chaisemartin and d'Haultfoeuille (2020)** formalized conditions under which TWFE fails and proposed a diagnostic: computing the weights on each group-period treatment effect. If many weights are negative, the TWFE estimate is unreliable.

**Sun and Abraham (2021)** showed that the TWFE event study coefficients are also contaminated: pre-trend tests may appear to pass even when parallel trends is violated, and post-treatment dynamics may be distorted.

These findings sent shockwaves through applied economics. Many published results using staggered DiD were potentially affected. The practical question became: how do we do DiD correctly in the staggered setting?

#### Modern Estimators for Staggered DiD

Several alternative estimators have been proposed that avoid the pitfalls of TWFE:

**Callaway and Sant'Anna (2021)** estimate group-time average treatment effects $$ATT(g, t)$$---the average effect for cohort $$g$$ (units first treated at time $$g$$) at time $$t$$. These building blocks are then aggregated into summary measures using appropriate weights. The approach permits flexible parallel trends assumptions (conditional on covariates or unconditional) and provides valid event study plots. Software: `did` in R, `csdid` in Stata.

**Sun and Abraham (2021)** propose an **interaction-weighted estimator** that estimates cohort-specific effects and then averages them. Their approach is closer to the traditional event study regression but corrects for contamination by saturating the model with cohort-specific relative-time indicators. Software: `eventstudyinteract` in Stata, `sunab` in R's `fixest` package.

**Borusyak, Jaravel, and Spiess (2024)** develop an **imputation estimator** that first estimates the counterfactual outcomes for treated observations using only untreated observations, then computes treatment effects as the difference between observed and imputed outcomes. This approach is computationally efficient, naturally handles unbalanced panels, and has an intuitive two-step structure. Software: `did_imputation` in Stata, `didimputation` in R.

**de Chaisemartin and d'Haultfoeuille (2020)** propose a simple estimator based on "switching" treatment status---comparing units that switch into treatment to those that remain untreated in each period. Software: `did_multiplegt` in Stata.

**Practical guidance.** For new work with staggered treatment adoption:

1. Start with the Callaway-Sant'Anna or Borusyak-Jaravel-Spiess estimator.
2. Report group-time effects and their aggregation.
3. Run TWFE as a comparison but do not rely on it as the primary specification.
4. Check for negative weights using the de Chaisemartin-d'Haultfoeuille diagnostic.
5. Conduct sensitivity analysis for parallel trends violations using Rambachan-Roth.

The good news: in many applications, the modern estimators produce results similar to TWFE. The staggered DiD literature identified a real problem, but it is not always a severe one. The cases where TWFE badly misleads tend to involve substantial heterogeneity in treatment effects across cohorts or over time. When effects are approximately homogeneous, TWFE remains a reasonable approximation.

#### Triple Differences

When parallel trends is questionable for a standard DiD, researchers sometimes add a third differencing dimension. **Triple differences (DDD)** uses a within-group comparison to purge common trends.

For example, to estimate the effect of a state-level policy on women's labor supply, one might compare: (women in treatment states vs. women in control states) minus (men in treatment states vs. men in control states). If men are unaffected by the policy, they serve as a within-state control for trends that differ across states. Formally:

$$
\hat{\tau}*{DDD} = \[(\bar{Y}*{T,W,Post} - \bar{Y}*{T,W,Pre}) - (\bar{Y}*{C,W,Post} - \bar{Y}*{C,W,Pre})] - \[(\bar{Y}*{T,M,Post} - \bar{Y}*{T,M,Pre}) - (\bar{Y}*{C,M,Post} - \bar{Y}\_{C,M,Pre})]
$$

The DDD estimator requires a weaker assumption: that the *difference* in trends between the affected and unaffected subgroups is the same across treatment and control states. This is often more plausible than parallel trends alone, but the estimator is more demanding of data and harder to visualize.

In regression form, DDD involves a full set of two-way and three-way interactions:

$$
Y\_{igst} = \ldots + \beta\_{DDD} \cdot (\text{Treat}\_s \times \text{Post}\_t \times \text{Affected}*g) + \ldots + \varepsilon*{igst}
$$

where $$\text{Affected}\_g$$ is an indicator for the affected subgroup (e.g., women), $$\text{Treat}\_s$$ indicates treatment states, and $$\text{Post}*t$$ indicates the post-treatment period. The triple-interaction coefficient $$\beta*{DDD}$$ is the estimand of interest.

#### Example: Minimum Wage and Employment

Card and Krueger (1994) used a DiD design to study the effect of New Jersey's 1992 minimum wage increase on fast-food employment, comparing New Jersey restaurants to similar restaurants in eastern Pennsylvania. Their finding of no employment decline challenged the standard competitive model's prediction and ignited one of the most productive debates in labor economics.

The parallel trends assumption here requires that fast-food employment would have evolved similarly in New Jersey and eastern Pennsylvania absent the minimum wage increase. The geographic proximity and similarity of the fast-food labor markets on both sides of the border make this more plausible than comparing geographically distant states.

Subsequent work by Dube, Lester, and Reich (2010) generalized this approach by comparing *all* contiguous county pairs straddling state borders, effectively estimating many local DiD designs and finding minimal employment effects of minimum wage increases. This "border discontinuity" approach addresses concerns about selecting control groups and has become a template for credible DiD designs. (See Chapter 17 on tax incidence for related discussion of how minimum wages affect the distribution of surplus.)

#### Example: EITC Expansion and Labor Force Participation

Eissa and Liebman (1996) used a DiD design to study how the 1986 expansion of the Earned Income Tax Credit affected labor force participation of single mothers. The treatment group was single mothers (who benefited from the expansion); the control group was single women without children (who did not).

The study found that the EITC expansion increased labor force participation among single mothers by approximately 2.8 percentage points---evidence that the EITC's phase-in region creates positive work incentives, consistent with economic theory. This study was influential in the subsequent expansions of the EITC in the 1990s and remains a canonical application of DiD in public economics. (See Chapter 18 on the income tax for a thorough treatment of the EITC's incentive structure.)

***

### G.4 Regression Discontinuity Design

#### Sharp RD

A **regression discontinuity (RD) design** exploits a threshold rule that determines treatment assignment. When a continuous variable (the **running variable** or **forcing variable**) $$X\_i$$ crosses a known cutoff $$c$$, treatment status changes discontinuously:

$$
D\_i = 1{X\_i \geq c}
$$

The key assumption is that units just above and just below the cutoff are nearly identical in all respects except treatment status. If potential outcomes are continuous functions of $$X$$ at the cutoff, then any discontinuity in the observed outcome at $$c$$ can be attributed to the treatment:

$$
\tau\_{RD} = \lim\_{x \downarrow c} E\[Y\_i \mid X\_i = x] - \lim\_{x \uparrow c} E\[Y\_i \mid X\_i = x]
$$

This is a **local** estimate: it identifies the treatment effect at the cutoff, for the marginal population. Whether it generalizes to units far from the cutoff is an external validity question.

The appeal of RD is that it approximates the logic of a randomized experiment in a narrow window around the cutoff. Lee and Lemieux (2010) argue that RD is "the closest thing to a real experiment" among observational methods, because---under the continuity assumption---units just above and below the threshold are "as good as randomly assigned." This makes RD designs highly credible when applicable, but the trade-off is that they identify effects only for the marginal population at the threshold.

#### Fuzzy RD

In many settings, crossing the threshold changes the *probability* of treatment but does not determine it perfectly. Students above a test score cutoff may be *eligible* for a scholarship, but not all take it up; patients above an age threshold may be *eligible* for Medicare, but some had private insurance already. This is a **fuzzy RD**.

In the fuzzy case, the treatment probability jumps at the cutoff but does not go from 0 to 1. The estimand becomes:

$$
\tau\_{FRD} = \frac{\lim\_{x \downarrow c} E\[Y\_i \mid X\_i = x] - \lim\_{x \uparrow c} E\[Y\_i \mid X\_i = x]}{\lim\_{x \downarrow c} E\[D\_i \mid X\_i = x] - \lim\_{x \uparrow c} E\[D\_i \mid X\_i = x]}
$$

This is a ratio of the jump in the outcome to the jump in the treatment probability---analogous to the Wald estimator in IV, and with the same LATE interpretation. The fuzzy RD estimate identifies the effect for compliers at the cutoff: units whose treatment status is changed by crossing the threshold.

#### Estimation: Local Polynomial Regression

RD estimation requires modeling the relationship between $$Y$$ and $$X$$ on either side of the cutoff. The standard approach uses **local polynomial regression**: fit separate polynomials above and below $$c$$, using observations within a bandwidth $$h$$ of the cutoff, weighted by a kernel function that gives more weight to observations closer to $$c$$.

The most common specification is a **local linear regression**:

$$
\min\_{\alpha, \beta, \tau} \sum\_{i: |X\_i - c| \leq h} K\left(\frac{X\_i - c}{h}\right) \left\[ Y\_i - \alpha - \beta(X\_i - c) - \tau \cdot D\_i - \gamma \cdot D\_i \cdot (X\_i - c) \right]^2
$$

where $$K(\cdot)$$ is a kernel function (triangular is common) and $$h$$ is the bandwidth. The coefficient $$\tau$$ estimates the discontinuity.

Why local linear and not a higher-order polynomial? Gelman and Imbens (2019) caution against using high-order global polynomials (e.g., fitting a cubic or quartic on each side of the cutoff using all the data). Such specifications can produce misleading results because they are sensitive to observations far from the cutoff, where the polynomial approximation may be poor. Local linear regression with a data-driven bandwidth avoids this problem by focusing estimation on observations close to the cutoff where the linear approximation is most reliable.

**Bandwidth selection** is critical. Too narrow a bandwidth yields imprecise estimates (high variance); too wide a bandwidth introduces bias from the polynomial approximation. Three approaches dominate:

1. **Imbens and Kalyanaraman (2012)** proposed an MSE-optimal bandwidth that minimizes the asymptotic mean squared error of the estimator.
2. **Calonico, Cattaneo, and Titiunik (2014)** developed **robust bias-corrected** inference. Their key insight is that conventional confidence intervals using the MSE-optimal bandwidth have incorrect coverage because the bias does not vanish fast enough. They propose a bias correction combined with a robust standard error that restores correct coverage. The accompanying `rdrobust` software package has become the standard tool.
3. **Coverage error-optimal** bandwidths (Calonico, Cattaneo, and Farrell 2020) choose the bandwidth to minimize coverage error of confidence intervals rather than MSE of the point estimate.

**Practical guidance.** Use `rdrobust` with the default bandwidth selector and report the robust bias-corrected confidence intervals. Show results across a range of bandwidths to demonstrate robustness. Always plot the raw data (a binned scatter plot of $$Y$$ against $$X$$) so readers can visually assess the discontinuity.

#### Validity Tests

RD designs have unusually testable implications, which is part of their appeal.

**Manipulation testing.** If units can precisely manipulate the running variable to land on a desired side of the cutoff, the near-random assignment logic breaks down. McCrary (2008) proposed a density test: if the distribution of the running variable is smooth at the cutoff, manipulation is unlikely. A discontinuity in the density suggests sorting. Cattaneo, Jansson, and Ma (2020) developed an improved test based on local polynomial density estimation (`rddensity` in R and Stata). For example, if students can retake an exam to qualify for a scholarship, bunching just above the threshold signals manipulation.

**Covariate balance.** Pre-determined covariates (characteristics determined before the running variable is realized) should not jump at the cutoff. Testing for discontinuities in covariates provides evidence on whether units just above and below the cutoff are comparable. If, say, family income jumps at a Medicaid eligibility threshold, the RD design is compromised. In practice, researchers run the RD estimation using each covariate as the outcome and verify that the estimated "effect" is close to zero.

**Sensitivity to polynomial order and bandwidth.** Results should be robust to reasonable variation in the polynomial degree and bandwidth. Fragile results that depend on a specific specification choice are less credible. A standard robustness table reports estimates using: (i) the MSE-optimal bandwidth, (ii) half and double the optimal bandwidth, and (iii) local linear versus local quadratic specifications.

**Placebo cutoffs.** Running the RD estimation at cutoff values where no policy change occurs should produce null results. A "significant" effect at a placebo cutoff suggests that the running variable has a nonlinear relationship with the outcome that is being misattributed to the treatment.

#### Example: Medicaid Eligibility and Healthcare Utilization

Medicaid eligibility in the United States is determined by income relative to the federal poverty level, creating sharp cutoffs. Card and Shore-Sheppard (2004) and subsequent studies exploit these thresholds to estimate the effect of Medicaid coverage on healthcare utilization.

The running variable is income relative to the eligibility threshold. The identifying assumption is that families just above and just below the threshold are comparable in all respects except Medicaid eligibility. This is a fuzzy RD because not all eligible families enroll (takeup is incomplete) and some ineligible families obtain coverage through other means.

Studies using this design find that Medicaid eligibility increases doctor visits, reduces emergency department use for non-urgent conditions, and improves access to preventive care---results that informed the ACA's Medicaid expansion. (See Chapter 14 for the full healthcare economics discussion.)

#### Example: Class Size Effects (Angrist and Lavy 1999)

Angrist and Lavy (1999) exploited **Maimonides' Rule**, an administrative rule in Israeli schools that caps class size at 40 students. When enrollment exceeds 40, a second class is created, causing average class size to drop discontinuously. Enrollment of 40 produces one class of 40; enrollment of 41 produces two classes of approximately 20-21.

The running variable is enrollment; the cutoffs are at multiples of 40. This generates a fuzzy RD because actual class sizes do not adjust with the mechanical precision of the rule. The authors found that smaller classes significantly improved reading and math scores in fourth and fifth grade, providing some of the most credible evidence on class size effects.

This study illustrates a useful feature of RD: the treatment effect can be estimated at multiple cutoffs (40, 80, 120), providing built-in replication and allowing researchers to check whether effects are consistent across thresholds.

***

### G.5 Bunching Estimation

#### Setup: Kink Points and Notch Points in Tax Schedules

**Bunching estimation** is a method developed specifically for public economics, exploiting the fact that tax and transfer schedules often have kinks or notches that create incentives to locate at specific points in the income distribution.

A **kink point** occurs where the marginal tax rate changes discontinuously. For example, at the point where the EITC phase-in ends and the plateau begins, the effective marginal tax rate on earned income drops from negative (a subsidy) to zero. Economic theory predicts that individuals with preferences near this kink will cluster---or "bunch"---at the kink point.

A **notch point** occurs where the *average* tax rate (or a benefit level) changes discontinuously. For instance, a property tax exemption that applies only below a price threshold creates a notch: a house priced just above the threshold pays significantly more tax than one just below. Notches create dominated regions where no optimizing agent should locate.

The distinction matters for identification. At a kink, individuals have an incentive to bunch at the kink point, but there is no "hole" in the distribution---individuals from just above the kink relocate to it. At a notch, theory predicts both bunching below the notch and a "missing mass" above it, because the notch creates a region of dominated choices. The missing mass provides additional information for identification and makes notches more powerful for estimating behavioral responses.

#### Identifying Behavioral Responses from Excess Mass

The key idea is to compare the *observed* distribution of income (or prices, or other choice variables) near the kink or notch to the *counterfactual* distribution that would prevail if the kink or notch did not exist. Excess mass at the kink point reveals behavioral responses to tax incentives.

For a convex kink (where the marginal tax rate increases), the amount of bunching is directly related to the compensated elasticity of taxable income. In the simple Saez (2010) framework, the excess mass $$B$$ at a kink where the net-of-tax rate changes from $$1 - t\_0$$ to $$1 - t\_1$$ is approximately:

$$
B \approx e \cdot z^\* \cdot \frac{t\_1 - t\_0}{1 - t\_1}
$$

where $$e$$ is the compensated elasticity of taxable income, $$z^\*$$ is income at the kink point, and the right-hand side is the percentage change in the net-of-tax rate. This provides a direct mapping from observed bunching to the structural elasticity that governs behavioral responses to taxation.

#### Estimation Procedure

The practical estimation follows Chetty, Friedman, Olsen, and Pistaferri (2011) and Kleven (2016):

1. **Fit a counterfactual distribution.** Estimate a flexible polynomial in income bin dummies, excluding bins near the kink point:

$$
C\_j = \sum\_{p=0}^{P} \beta\_p (z\_j)^p + \sum\_{i=z\_L}^{z\_U} \gamma\_i \cdot 1{z\_j = i} + \varepsilon\_j
$$

where $$C\_j$$ is the count of taxpayers in bin $$j$$, $$z\_j$$ is the income level for bin $$j$$, and the $$\gamma\_i$$ terms capture the bunching region $$\[z\_L, z\_U]$$.

2. **Compute the counterfactual.** The counterfactual distribution is the predicted values from the polynomial, omitting the bunching dummies. An important technical detail: the counterfactual must be adjusted so that the total number of individuals is preserved (integration constraint). The excess mass at the kink must come from *somewhere*---individuals who bunch were "shifted" from incomes above the kink. Failing to impose this constraint can bias the elasticity estimate.
3. **Measure excess mass.** The bunching estimate $$\hat{B}$$ is the difference between the observed and counterfactual counts in the bunching region, normalized by the counterfactual density at the kink:

$$
\hat{b} = \frac{\hat{B}}{h\_0(z^\*)}
$$

where $$h\_0(z^\*)$$ is the counterfactual density at the kink point.

4. **Back out the elasticity.** Given the tax rate change at the kink, invert the bunching formula to recover the elasticity:

$$
\hat{e} = \frac{\hat{b}}{\Delta \log(1-t) \cdot z^\*}
$$

Standard errors are typically obtained via bootstrap, re-estimating the counterfactual distribution in each replication.

**Practical choices.** Results can be sensitive to the polynomial degree $$P$$ and the width of the excluded region $$\[z\_L, z\_U]$$. Researchers should report estimates across a range of specifications. A polynomial order of $$P = 7$$ is common, but results should be checked for $$P = 5$$ through $$P = 10$$. The bunching window should be wide enough to capture all the excess mass but not so wide as to distort the counterfactual fit.

#### Optimization Frictions and Their Implications

A striking finding in the bunching literature is that observed bunching is typically *much smaller* than standard models predict. Chetty (2012) argued that this reflects **optimization frictions**: adjustment costs, inattention, and institutional constraints that prevent individuals from locating precisely at kink points.

Chetty's key insight is that small amounts of bunching need not imply small elasticities. If frictions prevent individuals from responding to *small* tax rate changes, observed bunching underestimates the elasticity relevant for *large* tax reforms. The paper formalizes this with a model of optimization frictions and shows that the welfare implications of taxation depend on the "structural" elasticity (net of frictions) rather than the "observed" elasticity at kinks.

This has profound implications for tax policy. Small bunching at EITC kinks might lead one to conclude that low-income workers are inelastic, justifying high marginal tax rates. But if the small response reflects frictions rather than preferences, large reforms could trigger much larger responses. The distinction matters enormously for optimal tax design. (See Chapter 16 on optimal taxation and Chapter 18 on income taxation for further discussion.)

#### Example: EITC Kink and Self-Employment Income

Saez (2010) documented sharp bunching at the first EITC kink point---where the credit reaches its maximum---but almost entirely among self-employed taxpayers. Wage earners, whose income is reported by employers on W-2 forms, show minimal bunching.

This asymmetry reveals something important. Self-employed workers have more control over their reported income (whether through real labor supply adjustments or reporting behavior). The bunching among the self-employed reflects either genuine labor supply responses or income misreporting---the method alone cannot distinguish the two. Wage earners face frictions: their hours are set by employers, and their income is third-party reported.

The Saez (2010) finding has shaped how economists think about the elasticity of taxable income. The relevant elasticity depends on the enforcement environment: with third-party reporting (wages), responses are small; with self-reporting (self-employment), responses are larger, but much of the "response" may be avoidance or evasion rather than real economic activity.

#### Example: UK Stamp Duty Notch

Best and Kleven (2018) exploited notches in the UK property transaction tax (stamp duty), where the tax rate jumps discontinuously at specific price thresholds. For example, a house selling for 250,001 pounds might face a substantially higher tax bill than one selling for 250,000 pounds.

The authors documented sharp bunching just below the notch points and a "missing mass" just above---exactly the pattern predicted by optimization at a notch. They used the size of the bunching and missing mass to estimate the elasticity of house prices with respect to the transaction tax. Their estimates imply substantial real effects on housing market activity, with transactions falling sharply near notch points.

This application illustrates how bunching estimation extends beyond income taxation to any setting where a policy creates discontinuities in incentives. Other applications include speed limit enforcement thresholds (as in the case of Danish speeding penalties studied by Kleven 2016), retirement age notches, and corporate tax brackets.

***

### G.6 Synthetic Control Methods

#### The Setup

The **synthetic control method** (Abadie and Gardeazabal 2003; Abadie, Diamond, and Hainmueller 2010) addresses a common problem in comparative policy analysis: estimating the effect of an intervention when it affects a single aggregate unit (a state, country, or large organization) and there is no obvious control group.

Consider a treated unit (say, California) that adopts a policy at time $$T\_0$$. We observe $$J$$ untreated units (the **donor pool**) over the same period. The goal is to construct a counterfactual: what would California's outcome have been absent the policy?

#### Constructing the Synthetic Control

The synthetic control is a **weighted average** of untreated units chosen to match the treated unit's pre-treatment characteristics and outcomes. Formally, the synthetic control weight vector $$W^\* = (w\_2^*, \ldots, w\_{J+1}^*)$$ solves:

$$
W^\* = \arg\min\_W (X\_1 - X\_0 W)' V (X\_1 - X\_0 W)
$$

subject to $$w\_j \geq 0$$ for all $$j$$ and $$\sum\_j w\_j = 1$$, where $$X\_1$$ is a vector of pre-treatment characteristics for the treated unit, $$X\_0$$ is the corresponding matrix for donor units, and $$V$$ is a positive definite matrix that weights the relative importance of different characteristics.

The non-negativity and summing-to-one constraints ensure that the synthetic control is a *convex combination* of real units, preventing extrapolation beyond the support of the data. This is a key advantage over regression-based approaches, which can assign negative weights (equivalent to extrapolation).

The choice of $$V$$ matters. Common approaches include: (i) using $$V$$ proportional to the inverse of the covariance matrix of the matching variables, (ii) choosing $$V$$ to minimize the pre-treatment prediction error (cross-validated), or (iii) setting $$V$$ diagonal with elements reflecting the predictive power of each variable. The results can be sensitive to this choice, and researchers should check robustness.

The estimated treatment effect at time $$t > T\_0$$ is:

$$
\hat{\tau}*t = Y*{1t} - \sum\_{j=2}^{J+1} w\_j^\* Y\_{jt}
$$

If the synthetic control closely matches the treated unit's pre-treatment trajectory, deviations in the post-treatment period are plausibly attributable to the intervention.

#### Inference: Placebo Tests

Because the synthetic control method typically involves a single treated unit, standard statistical inference is inapplicable. Instead, inference proceeds through **placebo tests** (sometimes called permutation tests).

The idea is simple: apply the synthetic control method to *each* untreated unit in the donor pool, pretending it was treated. This generates a distribution of "placebo effects"---estimated effects for units that were not actually treated. If the estimated effect for the truly treated unit is large relative to the placebo distribution, the effect is considered statistically significant.

Formally, for each donor unit $$j$$, construct a synthetic control from the remaining units and compute the post-treatment gap. The **p-value** is the fraction of placebo gaps as large or larger than the treated unit's gap. With $$J$$ donor units, the smallest achievable p-value is $$1/(J + 1)$$, which places a practical lower bound on the number of donor units needed.

A refinement normalizes the post-treatment gaps by the pre-treatment fit (the root mean squared prediction error, RMSPE). This accounts for the fact that some placebo units may be poorly matched pre-treatment, inflating their post-treatment gaps even absent any real effect. The ratio of post/pre RMSPE provides a more informative test statistic.

#### Example: California's Tobacco Control Program

Abadie, Diamond, and Hainmueller (2010) estimated the effect of California's Proposition 99, a 1988 tobacco control program that combined a 25-cent cigarette tax increase with anti-smoking media campaigns and clean indoor air legislation.

The synthetic control for California was constructed from a weighted average of other states that did not implement similar programs. The resulting synthetic California closely tracked actual California's per-capita cigarette consumption in the pre-treatment period (1970-1988). After 1988, actual consumption in California declined sharply relative to its synthetic control, with the gap widening over time to approximately 25 packs per capita by 2000.

Placebo tests confirmed the significance of this finding: California's estimated effect was far larger than the effects estimated for any donor state. This study established the synthetic control method as a viable tool for policy evaluation and has been cited thousands of times.

#### Extensions: Generalized Synthetic Control and Matrix Completion

The original synthetic control method works best with a single treated unit and a relatively long pre-treatment period. Several extensions address its limitations:

**Xu (2017)** proposed the **generalized synthetic control (GSC)** method, which combines interactive fixed effects models with the synthetic control idea. GSC can handle multiple treated units, does not require that the treated unit lie within the convex hull of the donor pool, and provides standard errors through parametric bootstrap. The method estimates a factor model from the untreated observations and uses it to impute counterfactual outcomes for treated units. Software: `gsynth` in R.

**Athey, Bayati, Doudchenko, Imbens, and Khosravi (2021)** proposed a **matrix completion** approach that frames the problem as filling in missing entries (treated unit's counterfactual outcomes) in a matrix of outcomes. The method uses nuclear norm minimization, borrowing techniques from the machine learning literature on recommender systems. It nests both DiD and synthetic control as special cases and can handle multiple treated units with staggered adoption.

**Synthetic DiD (Arkhangelsky, Athey, Hirshberg, Imbens, and Wager 2021)** combines the synthetic control and DiD approaches by reweighting both units *and* time periods. The method finds unit weights that balance pre-treatment outcomes (as in synthetic control) and time weights that balance pre-treatment periods (as in DiD), then estimates the treatment effect using the reweighted data. This approach inherits the strengths of both methods and has been shown to perform well in simulations.

These methods are particularly relevant for public economics applications involving state-level policy variation (Medicaid expansions, tax changes, criminal justice reforms) where the number of treated states is small relative to the total. (Chapter 23 on fiscal federalism discusses how state-level policy variation creates opportunities for evaluation.)

***

### G.7 Administrative Data in Public Economics

#### The Administrative Data Revolution

The past two decades have witnessed a transformation in the data available for public economics research. Where economists once relied primarily on surveys---the Current Population Survey, the Survey of Income and Program Participation, the Panel Study of Income Dynamics---they now increasingly use **administrative records** generated by government agencies as a byproduct of program administration.

This shift has expanded the frontier of what is empirically answerable. Questions that were once intractable due to small samples, measurement error, or short time horizons can now be studied with unprecedented precision.

#### Types of Administrative Data

**Tax records.** The Internal Revenue Service maintains detailed records on income, deductions, credits, and tax payments for every filing unit in the United States. Researchers access these data through the IRS Statistics of Income division, the Office of Tax Analysis at the Treasury Department, or formal data-sharing agreements. Tax data provide a near-census of the income distribution and can be linked across years to construct long panels.

**Program administration data.** Social Security Administration records contain lifetime earnings histories for virtually all American workers. The Centers for Medicare and Medicaid Services (CMS) maintain detailed claims data for Medicare and Medicaid beneficiaries. Unemployment insurance records provide quarterly earnings data at the state level. Each program's administrative system generates data as a byproduct of paying benefits or collecting contributions.

**Linked datasets.** The most powerful applications combine records across agencies. Chetty, Hendren, Kline, and Saez (2014) linked tax records across generations to study intergenerational mobility. Finkelstein, Hendren, and Luttmer (2019) linked health insurance records to mortality data. These linkages---typically performed through Social Security numbers or probabilistic matching---enable research designs that no single data source could support.

**Education records.** State longitudinal data systems track students from kindergarten through college and into the labor market. The National Student Clearinghouse provides nationwide college enrollment and completion data. These records enable studies of educational interventions with long-run outcome measurement.

#### Advantages

**Universe coverage.** Administrative data often cover the entire population rather than a sample. This eliminates sampling error and enables analysis of rare events, small subgroups, and geographic variation that would be impossible with surveys. Chetty et al. (2014) could estimate intergenerational mobility for every commuting zone in America only because tax records cover nearly all families.

**No recall bias.** Survey respondents misremember, approximate, and sometimes fabricate. Administrative records capture transactions as they occur. Income reported to the IRS is anchored by third-party reporting (W-2s, 1099s), making it far more accurate than self-reported survey income, particularly in the tails of the distribution.

**Precise measurement.** Tax records measure income to the dollar. Claims data record exact dates of service, diagnoses, and charges. This precision enables methods like bunching estimation that require detecting subtle features of distributions.

**Longitudinal tracking.** Administrative records naturally follow individuals over time as they continue to interact with government programs. The Social Security earnings file covers individuals' entire careers, enabling studies of life-cycle dynamics that no prospective survey could match.

**Large samples for subgroup analysis.** With millions or billions of observations, researchers can estimate effects for narrowly defined subgroups---specific birth cohorts, small geographic areas, or particular demographic intersections---with statistical precision.

#### Challenges

**Access restrictions.** Administrative data are not publicly available. Accessing IRS records, for instance, requires either affiliation with a government agency (Treasury, Joint Committee on Taxation) or a formal data-sharing agreement that can take years to negotiate. This creates an uneven playing field where researchers with government affiliations or established relationships have significant advantages. The Census Bureau's Federal Statistical Research Data Centers provide secure access to some linked administrative-survey data, but slots are limited and the approval process is lengthy.

**Privacy protections.** Administrative records contain sensitive personal information. Access agreements impose strict confidentiality requirements, including approval of all output for disclosure risk. Researchers cannot share individual-level data and must ensure that published results cannot be used to identify specific individuals. Differential privacy and synthetic data approaches are emerging solutions, but they introduce noise that can affect research findings.

**Lack of survey variables.** Administrative data measure what government systems record, which may not include the variables researchers most want. Tax records do not capture wealth (prior to recent proposals), preferences, subjective well-being, health status, or detailed demographics. Program data may lack information about non-participants. This limitation means that administrative data often complement rather than replace surveys.

**Variable definitions tied to program rules.** "Income" in tax data means taxable income as defined by the tax code, which differs from economic income. Changes in tax law can alter the definition of measured variables over time, creating artificial trends. Medicaid claims data reflect utilization conditional on enrollment, not underlying health needs. Researchers must understand the institutional details that generate the data---a point that underscores the importance of domain knowledge in public economics research.

**Selection into the data.** People who do not file tax returns are missing from IRS data. Non-citizens may be absent from Social Security records. Program data cover only participants. The populations that are hardest to study---the very poor, the undocumented, the disconnected---are often the populations with the weakest administrative data footprint. Bollinger, Hirsch, Hokayem, and Ziliak (2019) show that CPS non-response is correlated with poverty and program participation, meaning that even linking surveys to administrative data does not fully solve the problem.

#### Example: Teacher Value-Added and Long-Run Outcomes

Chetty, Friedman, and Rockoff (2014) linked school district records (identifying students' teachers) to IRS tax records (measuring students' adult outcomes) to estimate the long-run effects of teacher quality. They found that students randomly assigned to high-value-added teachers in elementary school earned significantly more as adults, were more likely to attend college, and were less likely to have teenage pregnancies.

The key methodological contribution was using administrative data to measure outcomes decades after the intervention. No survey could have tracked hundreds of thousands of students from third grade through their late twenties with the precision of tax records. The study also demonstrated the power of linking data across agencies---school district records identified the treatment (teacher assignment), while IRS records measured the outcome (adult earnings).

The findings had direct policy implications: they implied that replacing a teacher in the bottom 5 percent of the value-added distribution with an average teacher would increase students' lifetime earnings by approximately $250,000 per classroom. This calculation illustrates how administrative data can transform abstract research findings into concrete policy-relevant magnitudes. (See Chapter 15 for extended discussion of education production functions and teacher effectiveness.)

#### Example: The Oregon Health Insurance Experiment with Medicaid Claims

Finkelstein et al. (2012) combined the Oregon Health Insurance Experiment's randomized design with Medicaid claims data and hospital discharge records to measure the effects of insurance coverage on healthcare utilization.

The administrative data provided several advantages over the accompanying survey. Claims data captured *all* interactions with the healthcare system, not just those respondents remembered. They provided exact dates, diagnoses (via ICD codes), and charges. Hospital discharge records covered emergency department visits regardless of insurance status, enabling measurement of a key outcome (uncompensated care) that surveys would miss.

The study found that Medicaid coverage increased emergency department visits by about 40 percent---a finding that surprised many observers who expected insurance to shift care from emergency departments to primary care settings. The precision of the administrative data was essential for detecting this effect and ruling out the hypothesis that insurance *reduces* emergency department use. (See Chapter 14 for the full implications of the Oregon experiment for healthcare policy.)

#### The Democratization Challenge

The administrative data revolution has been remarkably productive, but it raises equity concerns within the profession. A small number of research teams---often based at elite universities with established government partnerships---have produced a disproportionate share of the most influential administrative data studies. Researchers at teaching-focused institutions, early-career scholars, and those outside the United States face significant barriers to accessing comparable data.

Several initiatives aim to address this imbalance. The Census Bureau's Research Data Centers provide broader access, though with constraints. The IRS's Statistics of Income public-use files offer limited microdata. Chetty's Opportunity Insights group has released aggregated statistics from their administrative data analyses. International data-sharing arrangements---particularly in the Scandinavian countries, where population registers are extensive and access is more institutionalized---offer alternative models. Yet the fundamental tension between data access and privacy protection remains unresolved, and it shapes who gets to answer which questions in public economics.

***

### Summary

The methods surveyed in this appendix share a common logic: each addresses the fundamental problem of causal inference---the impossibility of observing the same unit in both treated and untreated states---through a different strategy for constructing credible counterfactuals. Randomized experiments solve the problem by design. Instrumental variables isolate exogenous variation in treatment. Difference-in-differences exploits parallel trends between treatment and control groups. Regression discontinuity leverages threshold rules that create near-random assignment. Bunching estimation infers behavioral responses from the shape of distributions at policy-relevant kinks. Synthetic control constructs counterfactuals from weighted combinations of untreated units. Administrative data enable all of these methods to operate at unprecedented scale and precision.

No method is universally superior. Each requires assumptions---some testable, some not---and each answers a slightly different question. The Angrist and Pischke (2009) mantra of "what is your identification strategy?" remains the right starting point for any empirical analysis. But identification is necessary, not sufficient: measurement matters, inference must be appropriate, and external validity requires judgment that goes beyond any single study.

The credibility revolution in economics has raised the evidentiary bar for policy claims, and rightly so. But it has also created a temptation to value identification over importance---to favor well-identified estimates of small effects over suggestive evidence on large questions. The best public economics research combines rigorous methods with substantive ambition. The tools in this appendix are means, not ends.

***

**KEY CONCEPTS**

* **Population regression function:** The conditional expectation $$E\[Y \mid X]$$ that OLS estimates target.
* **Omitted variable bias:** Bias arising from excluding a relevant variable; equals (effect of omitted) $$\times$$ (correlation with included).
* **Cluster-robust standard errors:** Standard errors that account for within-group correlation in the error term.
* **Instrumental variable:** A variable correlated with the endogenous regressor but uncorrelated with the error term.
* **Local average treatment effect (LATE):** The causal effect for "compliers"---units whose treatment status is changed by the instrument.
* **Exclusion restriction:** The assumption that an instrument affects the outcome only through the endogenous regressor.
* **Parallel trends assumption:** The identifying assumption in DiD that treatment and control groups would have followed the same trajectory absent treatment.
* **Staggered adoption:** Treatment rollout where different units adopt at different times; requires modern DiD estimators for valid inference.
* **Regression discontinuity:** A design exploiting threshold rules for treatment assignment; identifies local effects at the cutoff.
* **Bunching estimation:** A method that infers behavioral responses from excess mass at kink or notch points in policy schedules.
* **Optimization frictions:** Adjustment costs and inattention that attenuate observed behavioral responses relative to structural elasticities.
* **Synthetic control:** A weighted average of untreated units constructed to match the treated unit's pre-treatment trajectory.
* **Administrative data:** Government records generated as a byproduct of program administration; provide large-scale, precise, longitudinal data.

***

### Discussion Questions

1. **Comprehension.** The omitted variable bias formula is $$\text{Bias} = \beta\_2 \cdot \delta\_{21}$$. In a regression of health outcomes on Medicaid enrollment, what is the likely sign of the omitted variable bias from income? What does this imply about the direction of bias in the naive OLS estimate of Medicaid's effect?
2. **Application.** A researcher uses the number of siblings as an instrument for years of schooling in an earnings regression. Evaluate the relevance and exclusion restriction for this instrument. Under what conditions might the exclusion restriction fail?
3. **Methodology.** A state adopts a new job training program in 2015. Using data from 2010--2020, you estimate a DiD comparing participants to non-participants. A colleague suggests this design is flawed and recommends comparing the adopting state to non-adopting states instead. Who is right, and why?
4. **Evaluation.** A bunching study finds substantial excess mass at an EITC kink point among self-employed workers but none among wage earners. Two interpretations are possible: (a) self-employed workers have more elastic labor supply, or (b) self-employed workers manipulate reported income. What evidence would help distinguish these interpretations?
5. **Research design.** You want to estimate the effect of a recent Medicaid expansion on infant mortality. Sketch two feasible research designs (using methods from this appendix), state the identifying assumptions for each, and discuss which assumption you find more credible.

***

### Further Reading

**ACCESSIBLE**

* Angrist, Joshua D., and Jorn-Steffen Pischke. *Mastering 'Metrics: The Path from Cause to Effect*. Princeton University Press, 2015. A readable introduction to the five core methods of causal inference, with numerous examples from public policy.
* Cunningham, Scott. *Causal Inference: The Mixtape*. Yale University Press, 2021. An accessible treatment with code examples in Stata and R, freely available online; particularly strong on DiD and synthetic control.

**INTERMEDIATE**

* Angrist, Joshua D., and Jorn-Steffen Pischke. *Mostly Harmless Econometrics: An Empiricist's Companion*. Princeton University Press, 2009. The standard reference for applied microeconometrics; assumes familiarity with regression and matrix algebra.
* Huntington-Klein, Nick. *The Effect: An Introduction to Research Design and Causality*. Chapman and Hall/CRC, 2021. A modern treatment with excellent DAG-based visualizations and code in R, Python, and Stata.
* Cattaneo, Matias D., Nicolas Idrobo, and Rocio Titiunik. *A Practical Introduction to Regression Discontinuity Designs*. Cambridge University Press, 2020. The definitive practical guide to RD estimation, written by the developers of `rdrobust`.

**ADVANCED**

* Abadie, Alberto, and Matias D. Cattaneo. "Econometric Methods for Program Evaluation." *Annual Review of Economics* 10 (2018): 465--503. A comprehensive survey of modern program evaluation methods, covering IV, DiD, RD, and synthetic control.
* de Chaisemartin, Clement, and Xavier d'Haultfoeuille. "Two-Way Fixed Effects and Differences-in-Differences with Heterogeneous Treatment Effects: A Survey." *Econometrics Journal* 26, no. 3 (2023): C1--C30. The definitive survey of the staggered DiD literature with practical recommendations.
* Kleven, Henrik J. "Bunching." *Annual Review of Economics* 8 (2016): 435--464. The standard reference on bunching estimation methods in public economics.
* Card, David, Stefano DellaVigna, Patricia Funk, and Nagore Iriberri. "Are Referees and Editors in Economics Gender Neutral?" *Quarterly Journal of Economics* 135, no. 1 (2020): 269--327. Not directly about methods, but an exemplary use of administrative data (submission records) with RD and DiD methods applied to questions about the economics profession itself.
