Chapter 1: The Empirical Enterprise

Opening Question

What can we actually learn from data about the social world—and what must remain assumption?


Chapter Overview

Empirical research promises to settle debates with evidence. Yet economics, political science, and sociology are littered with controversies that data alone have not resolved. Does the minimum wage reduce employment? Do cash transfers help people escape poverty traps? Does democracy cause growth? Despite mountains of data, decades of research and thousands of studies, confident answers to these and many similar questions remain elusive.

This is not a failure of empirical methods—it is a reflection of what empirical methods can and cannot do. This chapter develops a framework for understanding the scope and limits of empirical inquiry. We examine what types of questions data can answer, what role theory plays, how the "credibility revolution" reshaped applied research, and why research design matters more than statistical technique.

What a close reader of this book will learn:

  • The distinction between description, causation, prediction, and mechanism.

  • Why identification requires assumptions that cannot be tested with data alone.

  • How the "credibility revolution" shifted emphasis again from statistical technique to research design.

  • The so called "structural versus reduced-form" debate—and why it is often overstated

  • How to choose methods based on research questions, not disciplinary fashion


Figure 1.1: The Empirical Research Cycle

Figure 1.1: The Empirical Research Cycle. Empirical research is an iterative process: questions lead to theory, theory guides data collection, data enables analysis, analysis generates interpretations, and interpretations raise new questions.


1.1 What Empirical Research Can Tell Us

Four Types of Questions

Empirical research addresses different types of questions, each with distinct requirements:

Description: What is happening?

  • What is the unemployment rate?

  • How has income inequality changed over time?

  • What fraction of households have internet access?

Description seems straightforward but it is not. It requires defining concepts (what counts as "unemployed"?), measuring them reliably (and repeatably), and aggregating across heterogeneous populations. Good description is the foundation of all empirical work.

Prediction: What will happen?

  • Which loan applicants will default?

  • What will inflation be next quarter?

  • Which patients will respond to treatment?

Prediction asks about future or unobserved outcomes. It does not require understanding why something happens—only that the pattern holds. Machine learning optimizes predictive accuracy without demanding causal knowledge.

Warning: The Policy-Prediction Fallacy

A predictive model can forecast outcomes accurately without identifying causes. This creates a dangerous temptation: using predictive relationships to guide interventions.

Example: A model predicts loan default using credit score, income, and zip code. The model works well—zip codes with more defaults in the training data reliably predict defaults in new data. A bank executive concludes: "If we could just improve zip codes with high default rates, we'd reduce defaults."

This is the policy-prediction fallacy. Zip code predicts default but doesn't cause it. Intervening on zip code (say, by encouraging people to move) won't reduce defaults. The correlation exists because zip code proxies for unobserved factors (wealth, job stability, local economic conditions) that actually cause default.

The general pattern: Variables that predict an outcome often do so because they correlate with true causes, not because they are causes. Prediction requires correlation; policy requires causation.

When it matters: Any time you move from "X predicts Y" to "changing X will change Y." Risk scores, targeting algorithms, and forecasting models are tools for prediction, not guides for intervention—unless you have separate causal evidence that the predictors are also causes.

Causation: What is the effect of X on Y?

  • Does education increase earnings?

  • Does immigration reduce native wages?

  • Does monetary policy affect output?

Causal questions ask about interventions. They require knowing what would happen if we changed X—a counterfactual we never directly observe. This is why causation is hard: the fundamental problem of causal inference is that we cannot see the road not taken.

Mechanism: How and why does X affect Y?

  • Through what channels does education raise earnings?

  • Why do some development programs work and others fail?

  • What explains the transmission of monetary policy?

  • How does smoking cause cancer?

Mechanism questions ask about the process linking cause to effect. Answering them typically requires both causal evidence and theoretical structure. A treatment effect tells us that something works; mechanism tells us why.

Figure 1.2: Four Types of Empirical Questions

Figure 1.2: Four Types of Empirical Questions. Description, prediction, causation, and mechanism each have different data requirements and methodological approaches. The key is matching your method to your question.

The Hierarchy Is Misleading

These categories are often presented as a hierarchy, with causation at the top and description at the bottom. This is wrong. The value of evidence depends on the question.

For a policymaker deciding where to allocate resources, accurate prediction may matter more than causal understanding. For a scientist developing theory, mechanism is essential even if prediction is poor. For a journalist informing the public, careful description may be the highest contribution.

Key Insight: Match the method to the question. The best analysis is the one that answers the question actually being asked—not the one that demonstrates the most sophisticated technique.

What Data Cannot Tell Us

Some questions appear empirical but resist empirical resolution:

Value questions: Should we prioritize equality or efficiency? Data can inform tradeoffs but cannot resolve fundamental value disagreements.

Scope conditions: Does a finding apply to a new context? Evidence from one setting cannot definitively answer questions about another, though theory and triangulation help.

Questions without variation: If everyone received the treatment, we cannot estimate what would have happened without it—no matter how much data we have.

Deep parameters: Some quantities (like risk aversion or time preference) are defined within theoretical frameworks. Their "true" values depend on assumptions that data cannot adjudicate.


1.2 The Fundamental Problem of Causal Inference

Counterfactuals and Potential Outcomes

The modern framework for causal inference, developed by Donald Rubin and extended by many others, formalizes causation in terms of potential outcomes.

For each individual ii and treatment DD:

  • Yi(1)Y_i(1) = outcome if treated

  • Yi(0)Y_i(0) = outcome if not treated

  • The causal effect = Yi(1)Yi(0)Y_i(1) - Y_i(0)

The fundamental problem: we observe either Yi(1)Y_i(1) or Yi(0)Y_i(0), never both. The individual causal effect is inherently unobservable.

The Fundamental Problem of Causal Inference

We cannot observe the same unit in both treatment and control states. The counterfactual—what would have happened under the alternative—is missing by definition.

This is not a data problem that more observations can solve. It is a logical feature of causal questions about specific individuals.

How We Make Progress Anyway

Given the fundamental problem, how do we learn about causal effects?

Randomization: If we randomly assign treatment, treated and control groups are comparable in expectation. The average difference in outcomes estimates the average treatment effect.

Credible comparison groups: Without randomization, we seek comparison groups that approximate what would have happened to the treated in the absence of treatment. This requires assumptions—about selection, parallel trends, exclusion restrictions—that cannot be fully verified.

Assumptions fill the gap: Every identification strategy rests on assumptions about the data-generating process. Some are testable implications (balance in experiments, pre-trends in DiD). Most are maintained assumptions that must be defended substantively.

The Role of Assumptions

This is the central insight: empirical evidence alone cannot establish causation without assumptions about the world.

Consider three approaches to estimating returns to education:

  1. OLS: Assumes all confounders are observed and controlled

  2. IV: Assumes the instrument affects outcomes only through education

  3. Randomized experiment: Assumes no spillovers, no selective attrition, compliance with assignment

Each rests on different assumptions. Each can be "wrong" if its assumptions fail. The question is not which method is assumption-free (none are) but which assumptions are most defensible in a given context.

Every Method Has Assumptions

Method
Key Assumption

OLS

E[εX]=0E[\varepsilon \mid X] = 0 (no omitted confounders)

IV

Exclusion restriction (instrument affects Y only through D)

DiD

Parallel trends (groups would evolve similarly absent treatment)

RDD

No manipulation at threshold

RCT

SUTVA (no interference between units)


1.3 The Role of Theory

Theory as Guide

Theory helps in multiple ways:

Identifying relevant questions: What variables matter? What relationships should we investigate?

Providing functional forms: Should the relationship be linear, logarithmic, or something else?

Restricting the hypothesis space: Good theory rules out possibilities, making identification easier.

Generating testable predictions: Theories that cannot be distinguished empirically are not useful for applied work.

Informing external validity: Theory helps us understand when and why findings might generalize.

Theory as Straitjacket

But theory can also mislead:

Over-constraining: Imposing theoretical structure on data may obscure important patterns. If the theory is wrong, theory-driven estimates are biased.

Motivated reasoning: It is easy to construct post-hoc theoretical justifications for any finding. Theory should discipline analysis, not rationalize it.

False precision: Structural models produce point estimates even when the underlying assumptions are uncertain. The appearance of precision may hide deep uncertainty.

When to Let Data Speak

There is no formula for the how much theory is just right. But some guidelines help:

Explore before imposing structure: Descriptive analysis and exploratory work should precede structural modeling. Let the data reveal possibilities before constraining them.

Use theory for identification, not convenience: Theoretical restrictions should be economically meaningful, not chosen because they simplify estimation.

Test overidentifying restrictions: When theory generates testable implications, test them. Rejections are informative.

Be explicit about what is assumed versus estimated: Tell the reader which parameters are estimated from the data and which are calibrated or assumed.


Economics as a Natural Science

Don Ross (2014), in his Philosophy of Economics, argues against the idea that economics is a distinct domain with its own unique rules of evidence. Instead, economics is part of the broader scientific enterprise, where the goal is to identify "real patterns" (following Daniel Dennett). Econometrics is the tool for detecting these stable patterns in noisy data.

But—crucially for our pluralist argument—Ross warns against "reductive" econometrics that ignores the cognitive and institutional structures that generate the data. A "scientific" economics requires coordinating evidence from multiple levels: the individual (psychology/neuroscience), the market (econometrics), and the institution (sociology/history). This supports the view that no single method is sufficient; to find "real patterns," we need to triangulate across these levels.

1.4 The Credibility Revolution

From Specification Searches to Research Design

In 1983, Edward Leamer published a provocative critique of econometric practice. He showed that researchers could obtain almost any result they wanted by choosing among plausible specifications—selecting control variables, functional forms, and sample definitions to achieve desired conclusions. Since the "specification search" was typically invisible to readers, reported confidence levels were meaningless.

Leamer called for researchers to report results across many specifications, showing sensitivity to choices. But his deeper message was more profound: statistical technique cannot rescue a bad research design.

The "credibility revolution" that followed, associated with Angrist, Krueger, Card, Imbens, and others, took this lesson to heart. The new emphasis:

Research design over statistical technique: A clever estimator cannot fix fundamental identification problems. Start with a design that would convince a skeptic.

Transparency about assumptions: Make identifying assumptions explicit. Defend them substantively, not just statistically.

Natural experiments and quasi-experiments: Exploit situations where nature, policy, or circumstance creates variation that approximates random assignment.

Graphical evidence: Show the variation driving identification. Plots of event studies, regression discontinuities, and first stages communicate more than tables of coefficients.

The Credibility Revolution Toolkit

The revolution produced a standard toolkit for applied microeconomics work trying to answer causal questions:

  1. Randomized experiments (Chapter 10): The "gold standard" when feasible

  2. Instrumental variables (Chapter 12): Exploit exogenous variation to handle unobserved confounding

  3. Difference-in-differences (Chapter 13): Use within-unit variation over time

  4. Regression discontinuity (Chapter 14): Exploit threshold rules for identification

  5. Synthetic control (Chapter 15): Construct comparison groups from donors

Each method has clear identifying assumptions. The credibility of research derives from the credibility of these assumptions, not from the sophistication of the statistics.

Figure 1.3: Identification Strategies by Assumption Strength

Figure 1.3: The Credibility Spectrum. Identification strategies arranged by the strength of assumptions required. Methods on the left require fewer assumptions (higher internal validity) but may have lower external validity. All methods require some assumptions—the question is which are most defensible in your context.

What the Revolution Achieved

The credibility revolution improved empirical practice substantially:

Higher standards for causal claims: Referees and readers now demand clear identification strategies. "I controlled for everything" no longer convinces.

Replicable analysis: Pre-registration, replication packages, and transparency norms make it harder to hide specification searches.

Cumulative knowledge: Well-identified findings build on each other. We know more about returns to education, effects of minimum wages, and impacts of cash transfers than we did thirty years ago.

What the Revolution Did Not Solve

But the revolution has limits:

External validity: Clean identification often comes from specific contexts (a particular policy, a particular time, a particular complier group). How findings generalize remains uncertain.

Mechanism and theory: Reduced-form estimates tell us that something works, not why. Policy design often requires understanding mechanisms.

Many of the most important questions resist clean identification: Aggregate effects of trade, long-run impacts of institutions, macroeconomic policy—these questions may not admit credibility-revolution-style answers.

Publication bias persists: The incentive to find significant effects has not disappeared. It may have shifted to finding clean natural experiments with significant results.

What Explains China's Growth?

To see the limits of the credibility revolution—and the continued need for methodological pluralism—consider arguably the most important economic question of the last half-century:

Running Example: China's Post-1978 Economic Growth

Between 1978 and 2020, China's GDP per capita grew from approximately $300 to over $10,000 (in constant dollars)—the fastest sustained growth in human history. Hundreds of millions of people escaped poverty. The global economy was transformed.

What caused this? Candidate explanations include (but are not limited to):

  • Economic reforms (market liberalization, property rights, dual-track pricing)

  • Integration into global trade (WTO accession, export orientation)

  • Investment in human and physical capital

  • Demographic dividend (large working-age population)

  • Institutional innovation (Special Economic Zones, township-village enterprises)

  • Geography and initial conditions

This question matters enormously—for understanding development, for policy in other countries, for geopolitics. Yet it resists the credibility revolution's toolkit.

Why clean identification is elusive:

  1. n = 1 at the country level: There is only one China, one Deng Xiaoping. We cannot randomly assign reform packages to Chinas and observe outcomes.

  2. Simultaneity: Reforms, trade, investment, and growth all evolved together. What caused what? Everything happened at once.

  3. General equilibrium: China's growth changed the world economy. Partial equilibrium thinking fails when the "treatment" reshapes the entire system.

  4. Multiple interacting causes: Growth likely resulted from combinations of factors, not a single identifiable treatment. Interaction effects dominate.

What we can do instead:

The question is not unanswerable—but it requires combining methods to piece together the story and weigh the importance of different factors:

  • Description and growth accounting (Chapter 6): Document the facts. How much came from capital accumulation versus productivity? Where did growth occur—coastal cities, rural areas, state or private sector?

  • Regional quasi-experiments (Chapter 13): Within China, policies varied across regions and time. Special Economic Zones provide something like a staggered difference-in-differences design. This identifies components of the story, not the whole.

  • Time series analysis (Chapters 7, 16): Identify structural breaks, leading and lagging relationships, and dynamic effects.

  • Comparative case studies (Chapters 1, 23): How does China compare to other fast-growing economies? To similar countries that did not grow? To its own counterfactual trajectory?

  • Theory and mechanism (Chapter 19): Why would these factors cause growth? What do economic models predict? Can we test specific mechanisms?

  • Triangulation (Chapter 23): When multiple approaches point in the same direction, confidence increases—even without a single definitive identification strategy.

The lesson for this book:

China's growth illustrates why this book covers all these methods—not just the credibility revolution's greatest hits. Some questions are too important to ignore simply because they are "not identified". The goal is appropriate rigor: combining evidence, acknowledging uncertainty, and building cumulative knowledge through multiple approaches.

We return to China's growth throughout this book—not because we will definitively answer what caused it, but because the question reveals what different methods can and cannot contribute.


1.5 Structural and Reduced-Form Approaches

The Debate

Perhaps no methodological debate has generated more heat than "structural versus reduced-form." At its worst, this debate is tribal: labor economists dismiss structural IO as assumption-laden speculation; IO economists dismiss reduced-form labor as atheoretical dust-bowl empiricism.

At its best, the debate clarifies genuine tradeoffs.

The Case for Structural Models

Structural approaches estimate parameters with behavioral interpretation: elasticities, preferences, costs, strategic responses. Fans argue:

Counterfactual analysis: Structural models simulate policies never observed. What would happen if we changed the tax code in a way no country has tried? Only a model can answer.

Welfare evaluation: To compare policies requires measuring welfare, which requires knowing preferences. Reduced-form effects do not directly translate to welfare.

General equilibrium: Partial equilibrium estimates may miss important feedback effects. Structural models can incorporate equilibrium responses.

External validity through structure: If we estimate fundamental parameters (preferences, technology), they may apply across contexts—even where clean identification is unavailable.

The Case for Reduced-Form Methods

Reduced-form approaches prioritize clean identification over structural interpretation. Proponents argue:

Identification first: Know what you are estimating before adding structure. A well-identified reduced-form effect is more credible than a poorly-identified structural parameter.

Robustness: Reduced-form estimates require fewer assumptions. They are less sensitive to functional form choices.

Transparency: Assumptions are clear and often testable. Readers can assess credibility.

Policy-relevant effects: Average effects are often what policymakers need. Knowing education raises earnings by 10% may be sufficient; we need not know the full production function.

Beyond the Dichotomy

The structural/reduced-form distinction is often overstated. In practice:

Most work lies on a spectrum: Very few studies are purely atheoretical. Very few structural models ignore identification concerns.

The approaches are complements: Use reduced-form evidence to validate structural models. Use structure to interpret reduced-form findings.

The question should drive the choice: If the research question requires counterfactual simulation or welfare analysis, some structure is necessary. If the question asks about a specific treatment effect, reduced-form may suffice.

Design is a Model: Recent work in "design-based econometrics" (e.g., by Peter Hull and coauthors) clarifies that "reduced-form" methods are not model-free. Instead, they shift the modeling burden from the outcome equation (how the economy works) to the assignment mechanism (how the treatment was distributed). Explicitly modeling the design—writing down the probability limits of estimators based on the assignment process—allows researchers to treat identification with the same rigor as structural modeling. This perspective suggests the dichotomy is false: we are always modeling something, whether it is the economic behavior (structure) or the experiment (design).

When to Use Each Approach

Situation
Structural
Reduced-Form

Clean natural experiment available

Support

Primary

Need to simulate counterfactual policy

Primary

Support

Welfare analysis required

Necessary

Insufficient

Mechanism is the question

Often needed

Insufficient alone

External validity concerns paramount

May help

Limited

The best work often combines both: establish causal effects with clean reduced-form designs, then use structure to interpret and extrapolate.


1.6 Internal and External Validity

Definitions

Internal validity: Is the causal effect correctly estimated for the study population? Does the research design successfully isolate the causal effect of interest?

External validity: Does the effect generalize to other populations, settings, or time periods? Will the same intervention have similar effects elsewhere?

The Tradeoff

The credibility revolution prioritized internal validity—for good reason. An externally valid estimate of the wrong thing is useless. But the emphasis on clean identification creates a tension:

The settings that enable clean identification may be unusual: Draft lotteries, compulsory schooling laws, and regression discontinuities occur in specific contexts. Compliers in these natural experiments may not be representative.

Local average treatment effects are local: IV estimates LATE—the effect for compliers whose treatment status is changed by the instrument. This may differ from ATE for the population.

Lab versus field: Laboratory experiments maximize internal validity but may not reflect real-world behavior.

Dealing with External Validity

We must acknowledge that the external validity problem is profound and, in a strict sense, unresolvable. As Bates and Glennerster (2017) argue in "The Generalizability Puzzle," a result from one context may not apply to another not because the method failed, but because the "supporting cast" of factors—institutions, market structures, social norms—differs.

No single study can solve this. We cannot simply transport a result from Kenya to India without a theory of how these underlying mechanisms interact with local conditions. Instead, we need to accumulate evidence across many studies to see the mechanisms at work. Progress requires:

Replication with variation: If the same intervention is studied in multiple settings with different underlying conditions, we begin to map the function linking context to effect.

Theoretical priors: If theory explains why an effect occurs (mechanism), we can predict where it should (and should not) apply.

Structured speculation: We must use theory and local knowledge to speculate on how differences in context might alter results, rather than assuming constant effects.

Characterize the study population: Describe who is studied, how they differ from target populations, and why differences might matter.

Bound extrapolations: Rather than point predictions for new contexts, provide ranges acknowledging uncertainty.

Accumulate evidence: Single studies rarely settle questions. Meta-analysis and systematic review aggregate findings across studies.


1.7 From Question to Design

The Research Question as Organizing Principle

Good empirical work starts with a clear question. The question determines:

  • What data are needed

  • What variation identifies the answer

  • What methods are appropriate

  • What assumptions are required

Too often, researchers start with data or methods and go looking for questions. This is backwards. A question-first orientation asks: What do I want to know? What would convince a skeptic? What is the minimum required to answer the question?

Matching Questions to Methods

Different questions require different approaches:

Question Type
Primary Methods
Key Challenge

What is the prevalence of X?

Surveys, measurement

Sampling, measurement error

How has X changed over time?

Time series, repeated cross-sections

Comparability, definition changes

What predicts Y?

Regression, machine learning

Overfitting, spurious correlation

What is the effect of X on Y?

Experiments, quasi-experiments

Identification, confounding

Why does X affect Y?

Mediation analysis, structural models

Mechanism identification

Where will the effect hold?

Meta-analysis, replication, theory

External validity

The Honest Researcher's Checklist

Before conducting an empirical analysis, ask:

  1. What is my research question? State it precisely.

  2. What type of question is it? Description, prediction, causation, or mechanism?

  3. What variation identifies the answer? Where does the informative variation come from?

  4. What assumptions am I making? List them explicitly.

  5. Are my assumptions defensible? Can I convince a skeptic?

  6. What would falsify my hypothesis? What pattern would make me update my beliefs?

  7. How general are my findings? Who do they apply to?


Practical Guidance

Choosing a Research Design

Situation
Design Options
Key Considerations

Can randomize

Experiment

Power, ethics, external validity

Policy creates discontinuity

RDD

Manipulation, bandwidth, local effect

Policy varies across units/time

DiD

Parallel trends, staggered timing

Exogenous shock available

IV

Exclusion, relevance, complier interpretation

None of the above

SOO with sensitivity

Confounding, proxy quality

Common Pitfalls

Pitfall 1: Method-Driven Research

Choosing a fancy method and then looking for a question it can answer. Methods should serve questions, not the reverse.

How to avoid: Start with the question. What would convince a skeptic? Then choose the design.

Pitfall 2: Underestimating Assumptions

Treating identification assumptions as technicalities. They are not. They are the foundation on which causal claims rest.

How to avoid: State assumptions explicitly. Defend them substantively. Report sensitivity to violations.

Pitfall 3: Overstating Precision

Reporting narrow confidence intervals that ignore model uncertainty, specification uncertainty, and publication bias.

How to avoid: Report results across specifications. Acknowledge uncertainty honestly. Distinguish statistical from substantive significance.

Pitfall 4: Ignoring External Validity

Achieving clean identification in a narrow context and then generalizing without justification.

How to avoid: Characterize your study population. Discuss generalizability. Triangulate with other evidence.


The Limits of Quantification

Quantitative methods dominate this book, but they do not have a monopoly on evidence. As Ross (2014) notes, economic patterns are often "institutionally specific." While physics laws are universal, economic regularities depend on the rules of the game—laws, norms, and culture.

Qualitative methods—case studies, interviews, ethnography, process tracing—are not just "storytelling"; they are rigorous tools for mapping the "boundary conditions" of these patterns. They tell us where the regression coefficients apply. Without this qualitative map of the institutional terrain, econometric estimates are just correlations floating in a void, liable to fail as soon as the context shifts.

Qualitative approaches offer distinct advantages:

Depth over breadth: Qualitative work can explore mechanisms, context, and meaning that surveys miss.

Discovery: Qualitative research excels at generating hypotheses and identifying patterns worth testing quantitatively.

External validity: Detailed case studies can illuminate whether and why findings from one context transfer to another.

Interpretation: Numbers require interpretation. Qualitative knowledge informs what quantities mean.

When to Combine Methods

Mixed-methods research combines quantitative and qualitative approaches. Common designs:

Sequential exploratory: Qualitative work generates hypotheses; quantitative work tests them.

Sequential explanatory: Quantitative work establishes effects; qualitative work explains mechanisms.

Concurrent: Both approaches investigate the same question simultaneously, with findings compared and integrated.

This book focuses on quantitative methods, but Chapter 23 returns to integration strategies in detail.


Summary

Key takeaways:

  1. Empirical research answers different types of questions—description, prediction, causation, mechanism—each with distinct requirements

  2. Causal inference faces a fundamental problem: counterfactuals are unobservable. Progress requires assumptions that data cannot fully verify.

  3. The credibility revolution prioritized research design over statistical technique, raising standards for causal claims while highlighting tradeoffs with external validity

  4. The structural/reduced-form debate reflects genuine tradeoffs but is often overstated. The approaches are complements, not substitutes.

  5. Good empirical work starts with questions, not methods. Match the design to the question; make assumptions explicit; acknowledge uncertainty honestly.

Returning to the opening question: We can learn a great deal from data about the social world—patterns, relationships, effects of interventions. But every causal claim rests on assumptions about the data-generating process. The empirical enterprise is not about eliminating assumptions but about making them transparent, defending them rigorously, and understanding what follows if they fail.


Reproducibility and the Replication Crisis

The credibility of empirical research depends not only on good research design but on whether findings hold up when examined closely. Since 2010, a "replication crisis" has shaken several fields, revealing that many published findings do not replicate.

What happened: High-profile replication projects found troubling patterns:

  • The Open Science Collaboration (2015) replicated 100 psychology studies; only 36% showed statistically significant effects in the same direction as the original

  • Camerer et al. (2016, 2018) replicated experimental economics studies with somewhat better results (61-62% success rate), but still concerning

  • Chang and Li (2022) found that only 38% of 67 papers from top economics journals could be successfully reproduced computationally

Why it matters: If findings do not replicate, the knowledge base is unreliable. Policy decisions, scientific theories, and research priorities may all be built on foundations that do not exist.

What causes replication failures:

  1. Publication bias: Journals favor significant results, so studies finding null effects go unpublished. The literature overrepresents false positives.

  2. P-hacking and specification search: With many possible analyses, researchers (consciously or unconsciously) report the one that "works."

  3. Small samples and low power: Underpowered studies that happen to find significant effects are likely overestimating true effect sizes.

  4. Hidden moderators: Effects may genuinely vary across contexts in ways not understood when the original study was published.

What is being done:

  • Pre-registration: Researchers commit to hypotheses and analysis plans before seeing data, limiting p-hacking

  • Replication packages: Leading journals now require code and data sufficient to reproduce published results

  • Registered reports: Some journals review and accept papers based on research design before results are known

  • Multi-site studies: Coordinated replications across many sites estimate effect heterogeneity directly

For Students and Early-Career Researchers

The replication crisis offers both a warning and an opportunity. The warning: do not trust any single study, no matter how prestigious the journal or author. The opportunity: reproducible, transparent research is increasingly valued. Building good habits now—documenting your code, pre-registering analyses, sharing replication materials—will serve you well.

This book emphasizes transparent research practices throughout. Chapter 4 covers project organization and version control, while Chapter 26 addresses documentation and replication packages in detail.


Further Reading

Essential

  • Angrist & Pischke (2010). "The Credibility Revolution in Empirical Economics." JEP. The manifesto of the credibility revolution.

  • Freedman (1991). "Statistical Models and Shoe Leather." Sociological Methodology. A bracing reminder that technique cannot substitute for substance.

For Deeper Understanding

  • Leamer (1983). "Let's Take the Con Out of Econometrics." AER. The critique that sparked reform.

  • Heckman (2000). "Causal Parameters and Policy Analysis in Economics." QJE. A thoughtful perspective on identification and economic structure.

  • Deaton (2010). "Instruments, Randomization, and Learning about Development." JEL. A critical assessment of the credibility revolution.

  • Bates & Glennerster (2017). "The Generalizability Puzzle." Stanford Social Innovation Review. Argues that external validity requires a theory of mechanisms and extensive replication.

On the Structural/Reduced-Form Debate

  • Angrist & Pischke (2010). "The Credibility Revolution in Empirical Economics." Defense of reduced-form.

  • Reiss & Wolak (2007). "Structural Econometric Modeling." Handbook of Econometrics. Defense of structural.

  • Nevo & Whinston (2010). "Taking the Dogma Out of Econometrics." JEP. A balanced perspective.

  • Hull (2025). "Design-Based Econometrics." Lecture Notes. A modern synthesis arguing that design-based inference is a form of modeling.

Philosophy of Science

  • Ross (2014). Philosophy of Economics. Springer. A naturalistic defense of economics as a science of "real patterns" that requires coordinating evidence across multiple levels.

  • Cartwright (2007). Hunting Causes and Using Them. Philosophical foundations of causal inference.

  • Morgan & Winship (2015). Counterfactuals and Causal Inference. Comprehensive treatment of the potential outcomes framework.

Replication and Reproducibility

  • Open Science Collaboration (2015). "Estimating the Reproducibility of Psychological Science." Science. The landmark replication project that catalyzed reform.

  • Camerer et al. (2016). "Evaluating Replicability of Laboratory Experiments in Economics." Science. Replication of experimental economics studies.

  • Chang & Li (2022). "A Pre-analysis Plan to Replicate Sixty Economics Research Papers That Worked Half of the Time." AER. Systematic replication in economics.

  • Christensen & Miguel (2018). "Transparency, Reproducibility, and the Credibility of Economics Research." JEL. Comprehensive review of reproducibility practices.


Exercises

Conceptual

  1. A researcher claims that because her instrument is "as good as random," no assumptions are required for valid inference. Explain what is wrong with this claim. What assumptions does IV require even with a random instrument?

  2. Consider the question "Does social media use cause depression among teenagers?"

    • What makes this a causal question?

    • What is the counterfactual of interest?

    • What are three ways you might try to answer it, and what assumptions would each require?

Applied

  1. Find a recent empirical paper in a top journal in your field. Identify:

    • The main research question (is it descriptive, predictive, causal, or about mechanism?)

    • The identification strategy

    • The key identifying assumptions

    • What evidence (if any) supports those assumptions

    • What would happen to the conclusions if the assumptions failed

  2. The returns to education literature includes IV studies using (a) draft lottery, (b) compulsory schooling laws, (c) college proximity, and (d) twins. For each, describe the instrument and explain why different instruments might give different estimates even if all are valid.

Discussion

  1. Deaton (2010) argues that randomized experiments have been oversold as a solution to identification problems. Summarize his critique and evaluate it. Where do you agree? Where do you disagree? How should empirical researchers respond?


Last updated