Chapter 24: Evidence Synthesis
Opening Question
With dozens of studies on the same question often reaching different conclusions, how should we combine evidence to draw reliable inferences?
Chapter Overview
A mature research literature contains not one study but many, each with different samples, methods, and results. The minimum wage literature includes hundreds of studies. The returns to education literature spans decades and dozens of countries. The microfinance literature now includes multiple RCTs across different contexts. How should we combine this evidence?
This chapter examines formal methods for synthesizing evidence across studies. Meta-analysis provides tools for pooling estimates while accounting for heterogeneity. Systematic review offers structured approaches for comprehensively evaluating literature. And modern tools---specification curves, multiverse analysis, pre-registration---help address the replication crisis by making research more transparent and robust.
The core insight is that naive synthesis (counting studies, simple averaging) can mislead. Publication bias distorts what gets published. Study heterogeneity means not all estimates address the same question. Quality differences mean not all studies deserve equal weight. Careful synthesis methods address these challenges.
What you will learn:
How to conduct and interpret meta-analyses
How to detect and correct for publication bias
How to design and execute systematic reviews
How specification curves and multiverse analysis reveal researcher degrees of freedom
Prerequisites: Familiarity with regression methods (Chapter 3), identification strategies (Chapters 9-17)
Historical Context: The Rise of Evidence Synthesis
Meta-analysis---the statistical combination of results from multiple studies---was pioneered by Gene Glass in psychology in the 1970s. Glass coined the term in 1976, defining meta-analysis as "the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings."
In medicine, the Cochrane Collaboration (founded 1993) institutionalized systematic review, developing rigorous protocols that became gold standards for evidence-based medicine. The Campbell Collaboration (founded 2000) extended these methods to social science.
Economics came to meta-analysis relatively late. Stanley and Jarrell's 1989 "Meta-Regression Analysis" introduced economists to the approach, but widespread adoption came in the 2000s. The credibility revolution initially emphasized single-study identification over accumulation, but recent years have seen renewed interest in synthesis---partly driven by concerns about replication failures and partly by the accumulation of multiple well-identified studies on important questions.
The replication crisis that emerged in psychology in the 2010s (Open Science Collaboration 2015) spurred new tools: pre-registration, registered reports, specification curves, and multiverse analysis. Economics has adapted these tools while debating their applicability to economics' different research context.
24.1 Meta-Analysis Basics
The Logic of Pooling
Meta-analysis combines estimates from multiple studies to produce a summary effect size. The basic logic is simple: averaging reduces sampling error. If each study provides a noisy estimate of a true effect, combining them should yield a more precise estimate.
Definition 24.1: Meta-Analytic Estimate Given k studies with effect estimates θ^i and standard errors σi, a weighted average estimator is: θ^MA=∑i=1kwi∑i=1kwiθ^i where weights wi are typically inverse variance weights: wi=1/σi2.
Intuition: More precise studies get more weight. A study with a standard error of 0.05 gets four times the weight of a study with standard error 0.10.
Fixed Effects vs. Random Effects
The choice of meta-analytic model depends on assumptions about heterogeneity across studies.
Fixed Effects Model Assumes all studies estimate the same true effect θ. Differences in estimates arise only from sampling error: θ^i=θ+ϵi,ϵi∼N(0,σi2)
The fixed effects pooled estimate is: θ^FE=∑wi∑wiθ^i,Var(θ^FE)=∑wi1
Random Effects Model Allows the true effect to vary across studies: θ^i=θi+ϵi,θi∼N(μ,τ2)
where μ is the average effect and τ2 is between-study variance. The random effects pooled estimate is: θ^RE=∑wi∗∑wi∗θ^i,wi∗=σi2+τ21
Worked Example: Minimum Wage Meta-Analysis
Consider three minimum wage studies:
Study A: Elasticity = -0.05, SE = 0.03
Study B: Elasticity = -0.15, SE = 0.06
Study C: Elasticity = -0.08, SE = 0.04
Fixed Effects Calculation: wA=1/0.032=1111,wB=1/0.062=278,wC=1/0.042=625 θ^FE=1111+278+6251111(−0.05)+278(−0.15)+625(−0.08)=2014−147.7=−0.073 SEFE=1/2014=0.022
The pooled estimate is -0.073 with SE 0.022.
If between-study heterogeneity is substantial (τ2>0), random effects weights would give relatively more weight to smaller studies and yield a wider confidence interval.
Quantifying Heterogeneity
The I2 statistic measures what fraction of observed variance is due to true heterogeneity rather than sampling error:
I2=QQ−df×100%
where Q is Cochran's Q statistic: Q=∑wi(θ^i−θ^FE)2
Interpretation guidelines (Higgins et al. 2003):
I2<25%: Low heterogeneity
I2=25−75%: Moderate heterogeneity
I2>75%: High heterogeneity
When heterogeneity is high, a single pooled estimate may be less meaningful than understanding what drives variation.
Box: Critiques of Random Effects Meta-Analysis
Random effects models are standard practice, but face serious critiques:
1. The normality assumption is arbitrary
Random effects assumes true effects follow θi∼N(μ,τ2). But why should nature produce normally distributed treatment effects? The assumption is convenient, not justified. With few studies, this distributional choice strongly affects results.
2. "Average effect" may not exist
If studies target different populations, use different interventions, or measure different outcomes, what does the "average" effect mean? Pooling apples and oranges produces fruit salad, not insight.
3. Weights can be perverse
Random effects gives more weight to smaller, noisier studies than fixed effects does. If small studies are systematically different (e.g., due to publication bias), this amplifies bias.
4. Few-studies problem
Estimating τ2 requires multiple studies. With fewer than ~10 studies, τ2 is poorly estimated, and random effects confidence intervals can be badly miscalibrated.
When to use anyway: Despite these critiques, random effects remains useful when (a) you believe effects genuinely vary, (b) you have enough studies to estimate heterogeneity, and (c) you're honest that "the average effect" is a modeling construct.
The Shrinkage Formula
Random effects estimation shrinks each study toward the pooled mean. The amount of shrinkage depends on study precision and total heterogeneity:
θ^ishrunk=λiθ^i+(1−λi)θ^RE
where the shrinkage factor is:
λi=τ2+σi2τ2
Interpretation:
When σi2 is large (imprecise study): λi is small → heavy shrinkage toward mean
When τ2 is large (high heterogeneity): λi is large → less shrinkage, trust individual studies
When τ2=0 (no heterogeneity): λi=0 → complete shrinkage to fixed effects estimate
Example: Study A has θ^A=0.30 with σA2=0.04. The pooled estimate is θ^RE=0.15 with τ2=0.02.
λA=0.02+0.040.02=0.33
θ^Ashrunk=0.33(0.30)+0.67(0.15)=0.20
The shrunk estimate (0.20) is pulled toward the mean, reflecting skepticism about extreme values from noisy studies.

Bayesian Hierarchical Models
A powerful alternative to frequentist random effects is Bayesian hierarchical modeling, which treats study-specific effects as draws from a common distribution:
θi∼N(μ,τ2) θ^i∣θi∼N(θi,σi2)
This framework offers several advantages:
Shrinkage: Study estimates are pulled toward the overall mean, with more shrinkage for less precise studies
Uncertainty quantification: Posterior distributions capture uncertainty about both individual study effects and the overall mean
Prediction: Can predict effects in new contexts, directly addressing external validity
Worked Example: Bayesian Meta-Analysis of Microcredit RCTs (Meager 2019)
Meager (2019) applied Bayesian hierarchical models to seven randomized microcredit evaluations---the same studies discussed in Chapter 10. Her analysis yielded several important insights:
Decomposing heterogeneity: The seven studies appeared to show substantial variation in effects. But how much was true heterogeneity versus sampling error? Meager's hierarchical model estimated that roughly 60% of observed cross-study variation was sampling error. The underlying effects were more similar than they appeared.
Pooling information: The Bayesian framework pooled information across studies while allowing for genuine heterogeneity. Studies with smaller samples were shrunk more toward the overall mean, appropriately discounting imprecise estimates.
External validity: The model generated a predictive distribution for effects in a hypothetical new site. This directly addresses the policy question: "What effect should we expect if we expand microcredit to a new country?" The answer incorporated both the estimated average effect and the estimated variability across contexts.
Joint estimation: Rather than analyzing each outcome separately, Meager jointly modeled multiple outcomes (household expenditure, business profits, consumption), capturing correlations across outcomes and studies.
The analysis concluded that microcredit expansions have modest positive effects on average, with household expenditure increasing by about 5%, but that effects are unlikely to be transformative in any setting. This nuanced conclusion---neither "microcredit works" nor "microcredit doesn't work"---exemplifies what good meta-analysis can deliver.
Meta-Regression
Meta-regression extends meta-analysis to model heterogeneity:
θ^i=β0+β1X1i+...+βpXpi+ui+ϵi
where Xki are study-level characteristics (sample size, method, time period, country) and ui captures residual between-study variance.
Example: Why Do Minimum Wage Studies Differ?
Doucouliagos and Stanley (2009) use meta-regression to explain heterogeneity in minimum wage studies:
Study characteristics as moderators:
Teen vs. total employment
US vs. international
Before vs. after Card-Krueger
Publication status (journal vs. working paper)
Methodological approach (time series vs. panel vs. quasi-experimental)
They find that after controlling for publication bias and methodology, the evidence points to small negative effects---not the large effects or zero effects found in subsets of the literature.
24.2 Publication Bias
The Problem
Publication bias occurs when studies with statistically significant or theoretically expected results are more likely to be published than studies with null or unexpected results. This systematically distorts the published literature.
Definition 24.2: Publication Bias Systematic tendency for the selection of studies into the published literature to depend on results, leading to a non-representative sample of all studies conducted.
Mechanisms:
File drawer problem: Researchers don't submit null results
Editorial preferences: Journals prefer significant findings
Reader interest: Significant findings get more citations
P-hacking: Researchers select specifications that achieve significance
Detection Methods
Funnel Plot Plot effect size against precision (1/SE). Under no publication bias, the plot should be symmetric around the true effect. Asymmetry suggests publication bias.

Egger's Test Regresses standardized effect size on precision: σiθ^i=β0+β1σi1+ϵi
If β0=0, asymmetry is present, suggesting publication bias.
FAT-PET-PEESE Stanley and Doucouliagos's approach:
FAT (Funnel Asymmetry Test): θ^i=β0+β1σi+ϵi
If β1=0, publication bias is present. The estimate β0 provides a bias-corrected effect under the assumption that bias is proportional to standard error.
PEESE (Precision-Effect Estimate with Standard Error): θ^i=β0+β1σi2+ϵi
Uses squared standard error when FAT rejects no effect.
Worked Example: Testing for Publication Bias
Suppose we have 20 minimum wage studies. Regressing estimates on standard errors yields: θ^i=−0.02+0.8×σi
The coefficient 0.8 on SE is significant (p = 0.02), indicating publication bias. The intercept -0.02 represents the bias-corrected effect---small negative but close to zero, rather than the -0.10 average across published studies.
Correction Approaches
Trim-and-Fill Imputes "missing" studies to make the funnel plot symmetric, then re-estimates the pooled effect.
Selection Models Explicitly model the selection process. Hedges and Vevea (1996) and Andrews and Kasy (2019) develop selection models for meta-analysis.
p-Curve Analysis Examines the distribution of significant p-values. If effects are real, p-values should be right-skewed (clustered near zero). If results are due to p-hacking, they cluster just below 0.05.
Limitations
All publication bias methods rely on untestable assumptions:
Funnel plots assume bias operates through imprecision
Selection models assume specific functional forms for selection
Meta-regression assumes publication bias is the only source of asymmetry
Discipline about these methods requires honesty about uncertainty.
24.3 Systematic Review
What Makes a Review "Systematic"?
A systematic review differs from a narrative literature review in being:
Comprehensive: Attempts to find all relevant studies
Transparent: Documents search and selection criteria
Reproducible: Another researcher could replicate the search
Structured: Uses predetermined protocols
The PRISMA Framework
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) provides reporting standards:
Key elements:
Protocol: Pre-specified search strategy, eligibility criteria, analysis plan
Search: Multiple databases, grey literature, reference lists
Selection: Documented screening process with inclusion/exclusion counts
Assessment: Quality/risk of bias evaluation for included studies
Synthesis: Quantitative (meta-analysis) or qualitative summary

Quality Assessment
Not all studies deserve equal weight. Systematic reviews assess study quality/risk of bias:
For RCTs (Cochrane Risk of Bias tool):
Random sequence generation
Allocation concealment
Blinding
Incomplete outcome data
Selective reporting
For observational studies (Newcastle-Ottawa Scale):
Selection of study groups
Comparability
Outcome ascertainment
For quasi-experiments in economics:
Identification strategy credibility
Data quality
Pre-trends/placebo tests
Robustness across specifications
Example: Systematic Review of Microfinance Impacts
Duvendack et al. (2011) conducted an influential systematic review of microfinance impacts:
Search strategy:
15 databases searched
Grey literature from development organizations
Reference lists of included studies
Authors contacted for unpublished work
Eligibility criteria:
Quantitative impact studies
Microfinance (credit) as intervention
Outcome: poverty, income, consumption, or welfare
Control or comparison group
Results:
35,000 records screened
170 full texts assessed
58 studies included
Quality generally poor (few controlled studies, selection bias common)
Conclusion: Evidence on microfinance impact was surprisingly weak before the recent RCTs, despite enthusiastic claims in policy circles.
24.4 Replication and Robustness
The Replication Crisis
The Open Science Collaboration (2015) attempted to replicate 100 psychology studies. Only 36% replicated successfully. Similar concerns arose in economics:
Camerer et al. (2016): 11 of 18 experimental economics studies replicated
Chang and Li (2022): 29% of economics papers fully reproducible
Brodeur et al. (2016): Bunching of test statistics just above significance thresholds
This motivated new approaches to robustness and transparency.
Robustness Tools for Primary Research
Several tools help individual researchers make their analysis more robust:
Specification curves display how results vary across all defensible analytical choices
Multiverse analysis extends this to data processing choices
Pre-registration commits to an analysis plan before seeing results
These tools improve primary research quality, which in turn improves evidence synthesis. See Chapter 25 (Section 25.2) for detailed treatment of specification curves, multiverse analysis, and pre-registration in the context of research practice.
Pre-Registration and Publication Bias
Pre-registration commits researchers to an analysis plan before seeing results:
Standard pre-registration:
Research question
Hypotheses
Sample and data
Variables and measures
Analysis plan (primary specifications)
Registered reports:
Peer review before data collection
Publication commitment regardless of results
Eliminates publication bias at source
Pre-Registration in Economics
AEA RCT Registry (2013-present) provides registration for experiments. OSF and EGAP offer broader registration services.
Benefits:
Distinguishes confirmatory from exploratory analysis
Reduces p-hacking and HARKing (Hypothesizing After Results are Known)
Improves reproducibility
Limitations and debates:
Economics is often observational (can't pre-register before data exist)
Exploration is valuable and shouldn't be discouraged
Pre-registration doesn't eliminate all researcher degrees of freedom
Balanced approach:
Pre-register primary specifications where feasible
Clearly distinguish confirmatory from exploratory
Report specification curves for robustness
Focus on effect sizes and confidence intervals, not just significance
Practical Guidance
When to Conduct Meta-Analysis
Multiple studies on same question
Yes
Core application
Studies use same/comparable outcome measures
Yes
Effect sizes comparable
High heterogeneity across studies
Maybe
May be more useful to explain heterogeneity than pool
Studies have fundamental design differences
Caution
May be combining apples and oranges
Publication bias severe
Caution
Pooling biased estimates yields biased pool
Common Pitfalls
Pitfall 1: Garbage In, Garbage Out Meta-analysis cannot correct for problems in underlying studies. Pooling biased studies yields a more precise but still biased estimate.
How to avoid: Assess study quality. Consider sensitivity analyses excluding low-quality studies.
Pitfall 2: Comparing Incomparable Estimates Studies may estimate different parameters (LATE vs. ATE, short-run vs. long-run). Pooling them conflates different quantities.
How to avoid: Carefully define what each study estimates. Use meta-regression to model differences.
Pitfall 3: Ignoring Heterogeneity When I2 is high, a single pooled estimate may be misleading.
How to avoid: Report and explain heterogeneity. Consider whether pooled estimate is meaningful.
Pitfall 4: Over-Correcting for Publication Bias Publication bias corrections rely on strong assumptions. Aggressive correction can introduce new biases.
How to avoid: Report multiple methods. Treat corrections as sensitivity analyses.
Implementation Checklist
For meta-analysis:
For robustness analysis:
Qualitative Bridge
Qualitative Synthesis Methods
While this chapter focuses on quantitative synthesis, qualitative systematic reviews also exist:
Qualitative evidence synthesis:
Thematic synthesis
Meta-ethnography
Framework synthesis
These methods systematically combine findings from qualitative studies, looking for common themes, contradictions, and explanatory patterns.
Mixed-Methods Synthesis
Increasingly, systematic reviews combine quantitative and qualitative evidence:
EPPI-Centre approach:
Quantitative studies → meta-analysis of effects
Qualitative studies → synthesis of implementation, mechanisms
Integration → what works, for whom, and why
This mirrors the triangulation discussed in Chapter 23, applied to synthesis rather than primary research.
Integration Note
Connections to Other Methods
Triangulation
Meta-analysis as formal triangulation method
Ch. 23
Heterogeneity analysis
Meta-regression parallels HTE analysis
Ch. 20
Bayesian methods
Bayesian hierarchical models for pooling
Ch. 3
Machine learning
ML for study selection, coding
Ch. 21
From Single Studies to Cumulative Knowledge
Evidence synthesis represents the culmination of empirical research. Individual studies provide pieces; synthesis assembles the puzzle. But synthesis is only as good as the underlying studies, emphasizing why careful research practice (Chapter 25) matters throughout the research process.
Summary
Key takeaways:
Meta-analysis provides formal methods for combining estimates across studies, with random effects models accounting for heterogeneity and meta-regression explaining variation.
Publication bias systematically distorts the literature. Detection methods (funnel plots, Egger's test, FAT-PET-PEESE) and corrections exist but rely on strong assumptions.
Specification curves and multiverse analysis reveal researcher degrees of freedom and test robustness across analytical choices. Pre-registration commits researchers to analysis plans before seeing results.
Returning to the opening question: Combining evidence across studies requires more than simple averaging. Meta-analysis provides precision-weighted pooling. Publication bias assessment reveals what may be missing. Systematic review ensures comprehensive coverage. And robustness tools show how conclusions depend on analytical choices. Used together, these methods allow us to draw more reliable inferences than any single study---or any casual literature review---can provide.
Further Reading
Essential
Borenstein, M., L. Hedges, J. Higgins, and H. Rothstein (2009). "Introduction to Meta-Analysis." Wiley.
Stanley, T.D. and H. Doucouliagos (2012). "Meta-Regression Analysis in Economics and Business." Routledge.
For Deeper Understanding
Higgins, J. and S. Green, eds. (2011). "Cochrane Handbook for Systematic Reviews of Interventions." Cochrane Collaboration.
Andrews, I. and M. Kasy (2019). "Identification of and Correction for Publication Bias." American Economic Review.
Advanced/Specialized
Meager, R. (2019). "Understanding the Average Impact of Microcredit Expansions." AEJ: Applied. [Bayesian hierarchical approach]
Simonsohn, U., J. Simmons, and L. Nelson (2020). "Specification Curve Analysis." Nature Human Behaviour.
Applications
Doucouliagos, H. and T.D. Stanley (2009). "Publication Selection Bias in Minimum-Wage Research?" British Journal of Industrial Relations.
Meager, R. (2019). "Understanding the Average Impact of Microcredit Expansions: A Bayesian Hierarchical Analysis of Seven Randomized Experiments." AEJ: Applied. Exemplary Bayesian hierarchical meta-analysis demonstrating joint estimation, heterogeneity decomposition, and external validity prediction.
Brodeur, A., M. Le, M. Sangnier, and Y. Zylberberg (2016). "Star Wars: The Empirics Strike Back." AEJ: Applied.
Exercises
Conceptual
Explain why random effects meta-analysis gives relatively more weight to smaller studies compared to fixed effects. When would this be desirable, and when might it be problematic?
A meta-analysis finds a pooled effect of 0.3 with a narrow confidence interval, but I2=85%. How should you interpret this result?
Applied
You are conducting a meta-analysis of studies on the effect of class size on student achievement. List five study-level characteristics you would include in a meta-regression to explain heterogeneity. For each, explain what pattern you would expect and why.
Create a specification curve for a simple analysis. Using any publicly available dataset, identify at least 4 binary analytical choices (e.g., include/exclude outliers, log vs. level, with/without controls). Run all 16 combinations and plot the results. What do you conclude about robustness?
Discussion
Pre-registration has been controversial in economics. What are the strongest arguments for and against requiring pre-registration for observational studies using administrative data?
Appendix 24A: Meta-Analysis Formulas
Fixed Effects Variance
Var(θ^FE)=∑i=1kwi1=∑i=1k1/σi21
Estimating Between-Study Variance (τ2)
DerSimonian-Laird estimator: τ^2=cQ−(k−1)
where: c=∑wi−∑wi∑wi2
Random Effects Variance
Var(θ^RE)=∑i=1kwi∗1=∑i=1k1/(σi2+τ2)1
Egger's Test Statistic
Under null of no asymmetry: t=SE(β^0)β^0
follows a t-distribution with k−2 degrees of freedom.
Last updated