Chapter 12: Instrumental Variables
Opening Question
When you cannot randomly assign a treatment and observable confounders do not explain all the selection, can you still learn something about causal effects?
Chapter Overview
The methods we have encountered so far require either controlling the treatment assignment (experiments) or observing all relevant confounders (selection on observables). But what if neither condition holds? What if people select into treatment based on factors we cannot see or measure?
Instrumental variables (IV) offer an answer: find an external source of variation that affects treatment but has no direct effect on outcomes. This is the logic behind draft lotteries, compulsory schooling laws, and weather shocks. When such variation exists, we can use it to identify causal effects even in the presence of unobserved confounding.
This chapter develops the theory and practice of IV estimation. We begin with the core logic, move through estimation and inference, confront the complications of weak instruments and heterogeneous effects, and end with practical guidance on when IV is credible. The returns to education serves as our primary running example---a question that has driven methodological innovation for three decades.
What you will learn:
The logic of instrumental variables as a source of exogenous variation
How 2SLS estimation works and when it recovers causal effects
What LATE means and why it matters for interpretation
How to detect and handle weak instruments
When IV estimates are (and are not) credible
Prerequisites: Chapter 9 (The Causal Framework), Chapter 3 (Statistical Foundations)
12.1 The Logic of Instruments
Why We Need Instruments
Consider the classic question: what is the causal effect of an additional year of education on earnings?
If we simply regress log wages on years of schooling, we get a correlation. But this correlation likely overstates the causal effect. People who obtain more education may differ systematically from those who do not---in motivation, ability, family background, and countless other ways that also affect earnings. These unobserved factors confound the relationship between education and wages.
Chapter 11 addressed this problem by assuming we could observe and control for all relevant confounders. But what if we cannot measure ability? What if family background is only partially captured by parental education and income?
Instrumental variables offer an alternative path forward.
The Core Idea
The logic of IV is simple in principle:
Find a variable Z that affects the treatment D
Verify that Z affects the outcome Y only through its effect on D
Use the variation in D induced by Z to estimate the causal effect
The variable Z is called an instrument. It must satisfy two conditions:
Condition 1: Relevance The instrument must affect the treatment: Cov(Z,D)=0
Condition 2: Exclusion The instrument must affect the outcome only through treatment: Cov(Z,ε)=0 where ε represents all factors affecting Y other than D
The relevance condition is testable. The exclusion restriction is not---it is an assumption about the world that requires substantive justification.
Example: Vietnam-Era Draft Lottery
Perhaps the most famous IV for education is the Vietnam-era draft lottery. In the early 1970s, the U.S. military drafted men based on randomly assigned lottery numbers. Men with low lottery numbers faced high probability of military service; men with high numbers faced almost none.
How does this help with returns to education? Men who received low lottery numbers often sought draft deferments---and one reliable deferment was college enrollment. So the draft lottery affected education. But the lottery number itself was randomly assigned, meaning it should not be correlated with ability, family background, or other confounders.
The logic of the IV strategy:
Z = Draft lottery number (or indicator for "high draft risk")
D = Years of education
Y = Log earnings
The draft lottery provides exogenous variation in education. By comparing outcomes for men with different lottery numbers, we can estimate how education causally affects earnings---without needing to observe ability or other confounders.
Running Example: Returns to Education
The effect of education on earnings is our primary running example for this chapter. We will see the draft lottery IV (Angrist 1990), the compulsory schooling IV (Angrist & Krueger 1991), and the geographic proximity IV (Card 1995). Each illustrates different aspects of IV methodology. By chapter's end, you will understand both the power and the limitations of IV for answering this fundamental question.

12.2 Formal Framework
The Structural Equations
Consider the standard setup with a single endogenous regressor:
Yi=β0+β1Di+εi
where:
Yi is the outcome
Di is the treatment (endogenous: Cov(D,ε)=0)
β1 is the causal effect we want to estimate
εi captures unobserved factors affecting Y
The problem is that D is correlated with ε. People with high ε (high ability, say) tend to have high D (more education). OLS conflates the causal effect β1 with the correlation between D and ε.
Now introduce an instrument Z satisfying:
Assumption 12.1 (Relevance): Cov(Z,D)=0
Assumption 12.2 (Exclusion): Cov(Z,ε)=0
The Wald Estimator
Under these assumptions, we can derive the IV estimator. The key insight is that:
Cov(Z,Y)=Cov(Z,β0+β1D+ε)=β1⋅Cov(Z,D)+Cov(Z,ε)
If the exclusion restriction holds (Cov(Z,ε)=0), then:
Cov(Z,Y)=β1⋅Cov(Z,D)
Solving for β1:
β1=Cov(Z,D)Cov(Z,Y)
This is the Wald estimator when Z is binary. It equals the ratio of:
The reduced form: how Z affects Y
The first stage: how Z affects D
Interpretation: The effect of Z on Y comes entirely through D. Divide by how much Z moves D to get how much D moves Y.
Two-Stage Least Squares (2SLS)
The Wald estimator extends naturally to multiple instruments and continuous instruments via two-stage least squares:
First stage: Regress D on Z (and any exogenous controls X): Di=π0+π1Zi+π2′Xi+νi
Second stage: Regress Y on the fitted values D^ (and controls X): Yi=β0+β1D^i+β2′Xi+ui
The coefficient β^1 from the second stage is the 2SLS estimator.
Intuition: The first stage isolates the variation in D that comes from Z. The second stage uses only this "clean" variation to estimate the effect on Y. Variation in D coming from ε is stripped away.

What Could Go Wrong?
The two assumptions---relevance and exclusion---may fail:
Weak instruments: If Z only weakly predicts D, the first stage is weak. This creates several problems:
Large sampling variance
Bias toward OLS
Unreliable standard errors
We address weak instruments in Section 12.4.
Exclusion violation: If Z directly affects Y---or affects Y through some channel other than D---the IV estimate is biased. Unlike weak instruments, exclusion violations cannot be detected from the data. They require substantive argument.
12.3 Estimation and Inference
Implementing 2SLS
Standard software makes 2SLS easy. In Stata:
In R:
Important: Always use the built-in IV commands. Do not manually run two regressions---this produces incorrect standard errors.
First Stage Diagnostics
Before trusting IV estimates, examine the first stage:
F-statistic on excluded instruments: A rule of thumb is F > 10 (Stock, Wright & Yogo 2002). Modern practice often demands F > 100 for robust inference.
First stage coefficient: Is the sign correct? Is the magnitude plausible?
Visual inspection: Plot D against Z. Is there a clear relationship?
Practical Box: First Stage Checklist
Standard Errors
With a valid instrument and large samples, the 2SLS standard errors are consistent. However:
Cluster your standard errors if the instrument varies at a group level (e.g., state policy, lottery cohort)
Use robust standard errors by default to handle heteroskedasticity
With weak instruments, standard errors may be misleading---see Section 12.4
The Overidentification Test (J-Test)
With more instruments than endogenous variables (overidentification), we can partially test instrument validity. The Sargan-Hansen J-test checks whether all instruments give the same answer.
The idea: If all instruments are valid, they should all point to the same β. The J-test checks whether the instruments disagree more than sampling variation would explain.
Implementation:
Interpretation: Under the null (all instruments valid), the J-statistic is χ2 with degrees of freedom equal to the number of overidentifying restrictions (number of instruments minus number of endogenous variables). Rejection suggests at least one instrument is invalid.
Box: Why the J-Test Has Limited Value
The overidentification test sounds appealing—a way to test the untestable exclusion restriction. But it has severe limitations:
1. It tests consistency, not validity If all instruments are invalid in the same direction, they will "agree" and the J-test will pass. Example: If both father's and mother's education affect child's earnings directly (not just through child's education), but both biases are upward, the J-test won't detect the problem.
2. Rejection is ambiguous If the J-test rejects, you know something is wrong—but not which instrument. With three instruments, one, two, or all three could be invalid.
3. Non-rejection proves nothing Passing the J-test does not mean your instruments are valid. It only means they're consistent with each other.
4. Power is often low The test may fail to reject even when instruments are moderately invalid, especially with weak instruments.
Practical guidance: Report the J-test if you have multiple instruments, but don't treat non-rejection as validation. The J-test is necessary but far from sufficient for credibility. The real work is defending exclusion substantively.
12.4 Weak Instruments
The Problem
An instrument is "weak" if it explains little of the variation in D. Formally, the first-stage F-statistic is small.
Weak instruments cause three problems:
Bias: The IV estimator is biased toward OLS in finite samples
Variance: Standard errors become large and unstable
Inference: t-tests and confidence intervals may be misleading
The severity depends on how weak the instrument is. Stock, Wright & Yogo (2002) showed that F = 10 is approximately the threshold below which conventional inference becomes unreliable.

Detection
Test for weak instruments using the first-stage F-statistic:
F > 100
Strong instrument
20 < F < 100
Probably adequate
10 < F < 20
Borderline; consider robust inference
F < 10
Weak; do not trust standard 2SLS
Robust Inference
When instruments may be weak, use weak-instrument-robust methods:
Anderson-Rubin test: Tests the null β1=β0 for any hypothesized value. Invert to get confidence intervals. Valid regardless of instrument strength.
Conditional likelihood ratio (CLR): More efficient than AR with multiple instruments.
tF adjustment: Lee et al. (2022) propose adjusting critical values based on first-stage F.
In Stata:
Common Pitfall: The "Just Significant" First Stage
Some researchers accept any statistically significant first stage as adequate. This is wrong. An instrument can be significant but still weak. Focus on the F-statistic, not the p-value.
How to avoid: Report F-statistics. If F < 20, seriously consider weak-instrument-robust methods.
Alternative Estimators: LIML and Fuller
With weak instruments, 2SLS is biased toward OLS. Alternative estimators can reduce this bias:
Limited Information Maximum Likelihood (LIML)
LIML is an alternative to 2SLS that is median-unbiased—its median equals the true parameter even with weak instruments. The bias of 2SLS is proportional to the number of instruments; LIML's bias does not depend on the number of instruments.
β^LIML=(X′Z(Z′Z)−1Z′X−κ^X′MZX)−1(X′Z(Z′Z)−1Z′Y−κ^X′MZY)
where κ^ is the smallest root of a generalized eigenvalue problem.
Intuition: LIML can be understood as 2SLS applied to a transformed model where the first-stage residual variance is scaled appropriately. This scaling corrects the finite-sample bias.
Fuller's Modified LIML
Fuller (1977) proposed a modification that reduces LIML's variance at the cost of slightly more bias. The Fuller estimator replaces κ^ with κ^−c/(n−K) where c=1 or c=4 are common choices.
Fuller(1): Less biased than 2SLS, lower variance than LIML
Fuller(4): Even lower variance, slightly more bias
2SLS
Toward OLS with weak instruments
Lowest (if F > 100)
Strong instruments only
LIML
Median-unbiased
Higher than 2SLS
Weak instruments, inference focus
Fuller(1)
Less than 2SLS
Between 2SLS and LIML
Weak instruments, MSE focus
Practical guidance: When the first-stage F-statistic is between 10 and 50, report both 2SLS and LIML. If they differ substantially, the instruments are likely too weak for reliable inference under any method.
12.5 LATE: What Are We Actually Estimating?
Heterogeneous Treatment Effects
So far we assumed a constant effect β1. But treatment effects may vary across people. Some people gain a lot from education; others gain little.
When effects are heterogeneous, what does IV estimate?
Compliers, Always-Takers, Never-Takers
Consider a binary instrument Z and binary treatment D. Define four groups:
Compliers
0
1
Take treatment only when induced by Z
Always-takers
1
1
Always take treatment
Never-takers
0
0
Never take treatment
Defiers
1
0
Do the opposite of Z
The draft lottery example:
Compliers: Men who enrolled in college because of draft risk
Always-takers: Men who would have attended college regardless
Never-takers: Men who would not have attended college regardless
Defiers: Men who dropped out because of draft risk (assumed rare)

Box: Why Monotonicity Matters—The Problem of Defiers
The monotonicity assumption (Di(1)≥Di(0) for all i) rules out defiers. This assumption often receives less attention than exclusion, but violations can be equally devastating.
What goes wrong with defiers: The Wald estimator divides the reduced-form effect by the first stage: β^IV=E[D∣Z=1]−E[D∣Z=0]E[Y∣Z=1]−E[Y∣Z=0]
With defiers present, the denominator includes offsetting effects:
Compliers: D goes from 0 to 1 when Z=1 (positive contribution)
Defiers: D goes from 1 to 0 when Z=1 (negative contribution)
The first stage is now the net effect. If compliers and defiers have different treatment effects, IV estimates a weighted average where defiers receive negative weight. The result can fall outside the range of any individual's treatment effect.
Example: Suppose a job training instrument encourages most workers to enroll (compliers), but makes a few workers suspicious and refuse (defiers). If defiers are high-ability workers who would have benefited most from training, the IV estimate will be biased downward—it subtracts the defiers' large effects from the compliers' smaller effects.
When to worry:
Instruments that induce both positive and negative responses in subgroups
Policies with both "nudge" and "reactance" effects
Price changes where some consumers increase and others decrease consumption
What to do:
Argue substantively that defiers are implausible or rare
Look for subgroups where defiers might exist and test for sign reversals
Consider partial identification approaches that allow for defiers (Huber and Mellace 2015)
The Local Average Treatment Effect
Theorem 12.1 (LATE; Imbens & Angrist 1994)
Under the assumptions of relevance, exclusion, and monotonicity (no defiers), IV identifies:
βIV=E[Y1−Y0∣Compliers]
The local average treatment effect---the average effect for those whose treatment status is changed by the instrument.
Intuition: IV uses variation induced by Z. This variation only affects compliers. Always-takers and never-takers contribute no variation. So IV estimates the effect for compliers only.
Implications
LATE has profound implications:
External validity: IV estimates apply to compliers, who may not be representative of the population. Draft lottery compliers were probably marginal college-goers. Their returns may differ from average returns.
Different instruments, different estimates: Two valid instruments can give different estimates if they identify different groups of compliers. This is not a contradiction---it is a feature.
Policy relevance: LATE may or may not be policy-relevant, depending on whether the policy affects the same people as the instrument.

Running Example: Who Are the Compliers?
In the draft lottery study, compliers were men whose college enrollment depended on draft risk. These were likely:
Men from families where college was possible but not certain
Men for whom military service was particularly unappealing
Their returns to education may exceed the population average if marginal college-goers benefit more from the credential signal.
12.6 Dose-Response with IV
Beyond Binary Treatment
Many treatments are continuous: years of education, dosage of medication, hours of training. Can IV handle continuous treatments?
Yes, with additional assumptions. The standard approach:
Assume a linear relationship between D and Y
Use IV to estimate the slope
This gives the effect of a one-unit increase in treatment, averaged across whatever shifts the instrument induces.
Control Function Approach
An alternative is the control function approach:
First stage: D=π0+π1Z+ν
Outcome equation: Y=β0+β1D+ρν^+u
The residual ν^ captures the endogenous variation in D. Including it as a control removes the bias. The coefficient β1 on D gives the causal effect.
Advantages:
Clearer about what endogeneity you are correcting
Easier to test for endogeneity (ρ=0?)
More flexible for nonlinear first stages
12.7 Shift-Share (Bartik) Instruments
Shift-share instruments have become one of the most widely used identification strategies in applied economics, particularly in labor, trade, and migration research. Understanding their structure and the ongoing debates about their validity is essential for modern empirical work.
The Basic Structure
A shift-share instrument combines:
Shares: Pre-determined exposure weights (e.g., initial industry employment shares in a region)
Shocks: Aggregate changes (e.g., national industry growth rates)
The instrument for region r at time t is:
Zrt=∑ksrk,t0×gkt
where:
srk,t0 = region r's share of industry k at baseline t0
gkt = national (leave-one-out) growth in industry k at time t
The Classic Application: Bartik (1991)
Timothy Bartik studied how local labor demand affects wages and employment. The challenge: local labor demand is endogenous to local wages.
Solution: Construct predicted labor demand growth by interacting:
Each region's initial industry composition (shares)
National industry growth rates (shocks)
Regions with more manufacturing are predicted to grow faster when national manufacturing booms—not because of anything special about that region, but because of its pre-determined exposure to national trends.
Two Views on Identification
A major debate concerns the source of identifying variation:
1. Exogenous Shares (Goldsmith-Pinkham, Sorkin & Swift 2020)
Identification comes from the shares being uncorrelated with unobserved regional characteristics. The estimator is equivalent to a GMM estimator using each industry share as a separate instrument.
Assumption: Baseline shares are as good as randomly assigned
Test: Check balance of shares against pre-trends and observables
Best suited for: Settings where initial specialization patterns are plausibly exogenous
2. Exogenous Shocks (Borusyak, Hull & Jaravel 2022)
Identification comes from the shocks being exogenous—the national industry trends are independent of region-specific factors.
Assumption: Shocks are as good as randomly assigned across industries
Test: Check balance of shocks; clustering at shock level
Best suited for: Settings with many quasi-random shocks (e.g., trade shocks from policy changes)
Practical Implementation
Key Considerations
Inference: Standard errors must reflect the structure:
If relying on shock exogeneity: cluster at the shock level (industry)
If relying on share exogeneity: cluster at the unit level (region)
Exposure-robust standard errors (Adão, Kolesár & Morales 2019) are often appropriate
Leave-one-out: Always construct national shocks excluding the focal region to avoid mechanical correlation.
Many weak shocks: With many small shocks, aggregation helps. With few dominant shocks, the instrument may be weak.
Examples in the Literature
Autor, Dorn & Hanson (2013)
Initial industry shares
Chinese import growth
China trade shock reduced manufacturing employment
Card (2001)
Initial immigrant shares
Immigrant inflows by origin
Immigration affects local wages
Nakamura & Steinsson (2014)
Regional military spending shares
National defense spending
Fiscal multiplier estimation
When to Use Shift-Share
Clear shock source (policy, trade, technology)
Favors shock-based identification
Pre-period shares are quasi-random
Favors share-based identification
Many small industries/shocks
Aggregation provides power
Few dominant shocks
May have weak instrument problems
Shares and shocks both questionable
Consider alternative strategies
Warning: Shift-share instruments are not a "free lunch." The identifying assumptions—whether on shares or shocks—must be defended. The popularity of this approach has led to applications where neither shares nor shocks are plausibly exogenous.
Practical Guidance
When to Use IV
Strong, clearly exogenous instrument available
Yes
The ideal case
Instrument's exclusion is debatable
Maybe
Depends on quality of argument
First stage F < 10
Caution
Use robust methods, report wide bounds
Multiple weak instruments available
Maybe
Consider LIML or regularization
No plausible instrument exists
No
Do not invent one; consider bounds
Common Pitfalls
Pitfall 1: Assuming Any Correlation Is Causal
The fact that Z predicts D does not make Z a valid instrument. Many predictors are themselves endogenous.
How to avoid: Articulate why Z is exogenous. What is the source of randomness?
Pitfall 2: Ignoring the Exclusion Restriction
The exclusion restriction cannot be tested. Researchers sometimes treat it as automatically satisfied because Z is "random."
How to avoid: Think hard about channels. Could Z affect Y directly? Through other variables?
Pitfall 3: Overinterpreting LATE
LATE applies to compliers, not the population. Policy effects may differ.
How to avoid: Characterize compliers when possible. Discuss external validity.
Implementation Checklist
Qualitative Bridge
How Qualitative Methods Complement IV
IV provides a point estimate for compliers---but leaves much unknown:
Who are the compliers? IV identifies a causal effect for an unknown subgroup. Qualitative research can characterize this group through interviews or case studies.
Is the exclusion restriction plausible? The exclusion restriction is an assumption about mechanisms. Qualitative investigation of how the instrument operates can strengthen or weaken the case.
What mechanisms drive the effect? IV tells us that D causes Y, not how. Process tracing and qualitative case studies can illuminate mechanisms.
Example: Understanding Draft Lottery Compliers
Card and Lemieux (2001) complemented the draft lottery IV with detailed investigation of who was affected by draft risk. They examined:
Enrollment patterns by socioeconomic status
Timing of enrollment decisions
Subsequent labor market behavior
This qualitative work helped interpret what the IV estimate meant and for whom it was relevant.
Integration Note
Connections to Other Methods
Selection on observables
IV handles unobserved confounding; SOO requires observed confounders
Ch. 11
RDD (fuzzy)
Fuzzy RDD is IV at a threshold
Ch. 14
Time series IV
External instruments in SVAR
Ch. 16
Bounds
When IV fails, bounds may still apply
Ch. 17
Triangulation Strategies
IV estimates should ideally be compared with:
Other instruments: Do different sources of variation give similar answers?
SOO with sensitivity: How much selection would be needed to explain the IV result?
Experiments: When available, do experiments confirm the IV magnitude?
The returns to education literature illustrates this well: draft lottery, compulsory schooling, and twins studies all give broadly similar estimates, strengthening confidence in the finding.
Summary
Key takeaways:
IV uses exogenous variation in an instrument to identify causal effects when confounding is unobserved
Validity requires relevance (testable) and exclusion (assumed)
Weak instruments bias IV toward OLS and distort inference
With heterogeneous effects, IV estimates LATE---the effect for compliers
Different instruments can give different answers because they identify different compliers
Returning to the opening question: Yes, we can learn about causal effects even with unobserved confounding---if we can find an external source of variation that shifts treatment without directly affecting outcomes. The challenge is finding such instruments and defending their validity.
Further Reading
Essential
Angrist, Imbens & Rubin (1996). "Identification of Causal Effects Using Instrumental Variables." JASA.
Imbens (2014). "Instrumental Variables: An Econometrician's Perspective."
For Deeper Understanding
Stock, Wright & Yogo (2002). "A Survey of Weak Instruments." Journal of Business & Economic Statistics.
Angrist & Krueger (2001). "Instrumental Variables and the Search for Identification." JEP.
Advanced/Specialized
Andrews, Stock & Sun (2019). "Weak Instruments in IV Regression: Theory and Practice."
Mogstad, Torgovitsky & Walters (2021). "The Causal Interpretation of Two-Stage Least Squares with Multiple Instruments."
Applications
Angrist (1990). "Lifetime Earnings and the Vietnam Era Draft Lottery." AER.
Card (1995). "Using Geographic Variation in College Proximity to Estimate the Return to Schooling."
Angrist & Krueger (1991). "Does Compulsory School Attendance Affect Schooling and Earnings?" QJE.
Exercises
Conceptual
Explain in your own words why the exclusion restriction is necessary for IV to identify causal effects. Give an example where the exclusion restriction is likely violated.
Two researchers use different instruments to estimate the effect of education on earnings. Researcher A gets β^=0.10; Researcher B gets β^=0.05. Both instruments appear valid. Can both estimates be correct? Explain.
Applied
Using data from [Angrist data archive], replicate the draft lottery IV estimate. Report:
First stage results and F-statistic
2SLS estimate and standard error
Interpretation: who are the compliers?
Conduct a weak instrument sensitivity analysis. What happens to your estimates as you (artificially) weaken the first stage?
Discussion
A researcher proposes using "distance to nearest casino" as an instrument for gambling behavior when studying gambling's effect on household finances. Evaluate this instrument: Is it relevant? Is the exclusion restriction plausible? What concerns would you raise?
Technical Appendix: Derivations
A.1 Derivation of the Wald Estimator
Starting from the structural equation Y=β0+β1D+ε, take the covariance with Z:
Cov(Z,Y)=Cov(Z,β0)+β1Cov(Z,D)+Cov(Z,ε)
The first term is zero (covariance with constant). Assume Cov(Z,ε)=0 (exclusion). Then:
Cov(Z,Y)=β1Cov(Z,D)
Solving:
β1=Cov(Z,D)Cov(Z,Y)
A.2 Asymptotic Distribution
Under standard regularity conditions, the 2SLS estimator is asymptotically normal:
n(β^2SLS−β)dN(0,V)
where V=σ2(Z′X(X′X)−1X′Z)−1 and X is the matrix of instruments.
Draft version. Comments welcome.
Last updated