Solutions to Exercises

This document provides solutions and guidance for exercises in Empirical Methods for the Social Sciences.

Note to Instructors: This document is intended for instructor use. Consider providing selected solutions to students while reserving others for assessment.


Part I: Foundations

Chapter 1: The Empirical Enterprise

1.1 (Conceptual) The distinction between description and causation lies in counterfactual reasoning. Descriptive claims summarize observed patterns; causal claims assert what would happen under intervention. The same data can support strong descriptive claims while leaving causal questions unresolved.

1.2 (Conceptual) Internal validity concerns whether the causal claim holds for the study population. External validity concerns whether it generalizes to other populations. A lab experiment with random assignment has high internal but potentially low external validity. Observational studies of representative samples may have higher external but lower internal validity.

1.3 (Applied) Answers will vary by paper. Key evaluation criteria: Is the identification strategy explicit? Are assumptions testable? What threats remain?

1.4 (Applied) Growth accounting: capital deepening, labor force expansion, TFP growth. Quasi-experiments: SEZ establishment (DiD), one-child policy (RD elements). Case studies: township-village enterprises, dual-track pricing. Each method illuminates different aspects; none alone provides complete explanation.

1.5 (Discussion) The structural vs. reduced-form distinction is about goals, not quality. Structural models make theory explicit and allow counterfactual policy analysis beyond the data. Reduced-form methods make identification explicit and are robust to theory misspecification. Best practice often combines both: reduced-form for credible identification, structural for policy extrapolation.


Chapter 2: Data

2.1 (Conceptual) Validity: Does GDP measure economic welfare? Excludes home production, leisure, environmental quality. Reliability: National accounts may have measurement error, especially for informal sectors. Different measurement approaches (production, expenditure, income) should converge but often don't exactly.

2.2 (Conceptual) Panel data: same units over time (advantages: control for time-invariant unobservables). Repeated cross-sections: different samples over time (advantages: no attrition, potentially larger samples). Key tradeoff: panel enables within-unit analysis but suffers attrition; RCS avoids attrition but can't track individuals.

2.3 (Applied) Answers depend on paper chosen. Look for: data source documentation, sample construction, variable definitions, handling of missing data, potential measurement issues acknowledged.

2.4 (Applied) Administrative data: high coverage, low response burden, limited variables, access restrictions. Survey data: richer measures, potential nonresponse bias, respondent burden. Linking allows validation and enrichment; challenges include matching errors and privacy.

2.5 (Applied) Key cleaning steps: handle missing values (document patterns), check for outliers (substantive judgment needed), verify coding (especially for categorical variables), create derived variables with clear documentation.

2.6 (Discussion) Tensions are real: privacy (individual harm from disclosure) vs. transparency (scientific verification) vs. access (equity in who can use data). Potential solutions: synthetic data, secure access rooms, tiered access, pre-registration of analyses.


Chapter 3: Statistical Foundations

3.1 (Conceptual) A parameter is a fixed (unknown) characteristic of the population. A statistic is a function of the sample data. The sample mean is a statistic because it's computed from data; the population mean is a parameter we're trying to estimate.

3.2 (Conceptual) P-values are P(data | null), not P(null | data). Neither p=0.08 nor p=0.001 tells us the probability the effect is real. The p=0.001 result provides stronger evidence against the null, but effect size and prior plausibility matter too.

3.3 (Conceptual) CLT: Regardless of population distribution, sample means become approximately normal as n grows (given finite variance). Implication: We can use normal-theory inference even for non-normal populations when samples are large enough.

3.4 (Applied) Simulation should show ~95% coverage. Code example:

3.5 (Applied) Robust SEs typically larger than classical SEs when heteroskedasticity present. Conclusions rarely change dramatically but inference is more honest.

3.6 (Discussion) Both sides have merit. Priors are explicit assumptions; classical methods embed implicit assumptions (e.g., flat priors in some interpretations). Key question: Is explicit specification of prior beliefs more or less honest than implicit assumptions?

3.7 (Discussion) "97% probability drug works" conflates p-value with posterior probability. Newspaper claim overstates certainty. Better: "The study found a statistically significant reduction of 2 mmHg (95% CI: 0.2 to 3.8). Whether this is clinically meaningful depends on context."

3.8 (Discussion) Ranking depends on prior plausibility. Study B (birth month/personality) has low prior plausibility despite small p-value—likely false positive. Study A (physics with theory) has high prior support. Study C (replication) updates existing evidence. Smaller p-values don't always indicate stronger evidence when priors differ dramatically.


Chapter 4: Programming Foundations

4.1 (Conceptual) Modifying raw data destroys the ability to verify the analysis chain. If errors are discovered later, there's no way to recover. Best practice: raw data is read-only; all transformations are documented in code.

4.2 (Conceptual) Version-in-filename is a poor substitute for version control. Recommend: Git repository with meaningful commits. The "final_ACTUAL" pattern indicates lost track of the true current version.

4.3 (Applied) Project should have: data/raw/, data/processed/, code/, output/, .gitignore, README.md. Git initialized with initial commit. Sample workflow documented.

4.4 (Applied) Reorganization should document: what was moved, what was renamed, what was deleted as redundant. README should explain project purpose and how to run.

4.5 (Discussion) Balance considerations: burden on researchers (especially those with limited resources) vs. benefits for verification and cumulative science. Possible middle ground: code availability required, data availability when legally possible with clear documentation of restrictions.


Part II: Description

Chapter 5: Survey Methods

5.1 (Conceptual) Simple random: equal probability for all units. Stratified: divide population into groups, sample within each. Cluster: sample groups, then all/sample within groups. Stratified improves precision for heterogeneous populations; cluster reduces costs when populations are dispersed.

5.2 (Conceptual) Satisficing: minimal cognitive effort responses. Acquiescence: tendency to agree. Social desirability: presenting favorably. Solutions: careful question wording, randomized response techniques, implicit measures, validation against behavior.

5.3 (Applied) AAPOR codes: Complete interviews, Partial, Refusal, Non-contact, Other. Response rate varies by definition; key is transparency about which definition used and comparison with benchmarks.

5.4 (Applied) Question problems might include: double-barreled questions, leading wording, response scale issues, unclear reference periods. Revisions should address specific identified problems.

5.5 (Applied) Mode effects may include: social desirability differences (more candid online), satisficing patterns, break-off rates. Analysis should document and explore these differences.

5.6 (Discussion) Declining response raises concerns about representativeness. However, coverage (who is in the frame) may matter more than response rates. Weight adjustments help but rely on assumptions about the missing-at-random mechanism.


Chapter 6: Describing Patterns

6.1 (Conceptual) Dimensions for data visualization: position (most accurate), length, angle, area, color intensity, color hue. Cleveland hierarchy ranks these by perceptual accuracy. Pie charts use angle (less accurate than length), which is why bar charts often preferred.

6.2 (Conceptual) Dimension reduction trades interpretability for parsimony. PCA maximizes variance explained; factors may not be substantively meaningful. K-means assigns observations to clusters; cluster validity should be assessed.

6.3 (Applied) EDA should explore distributions (univariate), relationships (bivariate), and patterns (multivariate). Document surprises and data quality issues discovered.

6.4 (Applied) Topic modeling requires preprocessing decisions (stopwords, stemming, n-grams) that affect results. Validate by examining top words per topic and representative documents.

6.5 (Applied) Spatial autocorrelation (Moran's I) quantifies clustering. Mapping should show both the pattern and the uncertainty.

6.6 (Discussion) Visualization can clarify patterns but also mislead through poor design choices (truncated axes, inappropriate scales, cherry-picked views). Best practice: show the data fairly, including uncertainty.


Chapter 7: Time Series Foundations

7.1 (Conceptual) Stationarity means statistical properties don't change over time. Non-stationary series can produce spurious regressions (high R² between unrelated trends). First-differencing often induces stationarity.

7.2 (Conceptual) Autocorrelation: correlation of series with its lags. Partial autocorrelation: correlation after removing intermediate lag effects. ACF decays exponentially for AR; PACF cuts off. Opposite for MA.

7.3 (Applied) Decomposition should separate trend, seasonal, and irregular components. Interpretation should discuss each component's substantive meaning.

7.4 (Applied) ADF tests the null of unit root; KPSS tests the null of stationarity. Results may conflict near the boundary. Report both with appropriate caveats.

7.5 (Applied) Forecast evaluation: compute RMSE, MAE on holdout sample. Compare models using information criteria and forecast accuracy.

7.6 (Discussion) Granger causality is about predictive content, not causation. "X Granger-causes Y" means X helps predict Y beyond Y's own history. Does not imply X causes Y in the interventionist sense.


Part III: Causation

Chapter 9: The Causal Framework

9.1 (Conceptual) Random assignment ensures treatment is independent of potential outcomes: $D \perp (Y(1), Y(0))$. This eliminates selection bias. Still need SUTVA (no interference, no hidden versions of treatment).

9.2 (Conceptual) DAG should show: Class size ← Budget → Achievement, Class size ← Parental involvement → Ability → Achievement. Backdoor paths go through Budget and Parental involvement. Control for these. Ability is a mediator (not a confounder) if we're interested in total effect.

9.3 (Conceptual) Controlling for post-training job quality is bad control—it's a mediator/outcome of training. This blocks part of the causal effect and potentially introduces selection bias.

9.4 (Applied) Y(1): earnings if attend elite university. Y(0): earnings otherwise. Selection problem: students who attend differ in ability, motivation, family background. Strategies: (1) RD on admission threshold (identifies LATE at cutoff), (2) Sibling comparisons (identifies ATT for families with siblings at different types). Different complier populations yield different estimands.

9.5 (Applied) Analysis will depend on paper chosen. Look for: explicit statement of identifying assumptions, testable implications examined, threats to validity discussed.

9.6 (Discussion) Valid critique for simultaneous determination. DAGs assume recursive (one-way) causation. For equilibrium models, structural equation modeling or other approaches may be more appropriate. DAGs still useful for thinking about confounding even when timing is unclear.


Chapter 10: Experimental Methods

10.1 (Conceptual) ITT: effect of being assigned to treatment. TOT/LATE: effect of actually receiving treatment, for compliers. With noncompliance, ITT ≠ TOT. ITT is policy-relevant (what happens if we offer the program); LATE tells us about treatment efficacy for those who take it up.

10.2 (Conceptual) Balance tests check whether randomization "worked" for observed covariates. Cannot verify balance on unobservables. With many covariates, expect some imbalance by chance (multiple testing). Pre-registration of primary outcomes and control strategy helps.

10.3 (Applied) Balance table should show means by treatment arm and p-values. Interpretation should discuss any imbalances and whether they affect conclusions.

10.4 (Applied) Power depends on effect size, variance, sample size, and significance level. Typical targets: 80% power at α=0.05. MDE formula: δ = 2.8σ/√n for two-arm trial.

10.5 (Applied) Meta-analysis should weight by precision, examine heterogeneity (I²), and probe for publication bias (funnel plot asymmetry).

10.6 (Discussion) Tension between control (internal validity) and realism (external validity). Some prefer "policy-relevant" designs that sacrifice control; others prefer clean identification with separate discussion of generalization. No universal answer—depends on research question.


Chapter 11: Selection on Observables

11.1 (Conceptual) CIA requires all confounders observed. This is untestable. Sensitivity analysis asks: how much unobserved confounding would be needed to overturn results? Oster (2019) provides one framework.

11.2 (Conceptual) Matching: prune to comparable units, then compare. Regression: adjust parametrically. Matching is nonparametric but may discard data. Both rely on CIA. Doubly robust combines both for protection against misspecification of either.

11.3 (Applied) IPW and matching should achieve balance (check with balance tables). Compare estimates—if different, investigate why (different effective samples, model specification).

11.4 (Applied) LaLonde exercise: compare experimental benchmark to observational estimators. Observational methods may fail due to selection on unobservables.

11.5 (Applied) Generalized propensity score for continuous treatments. Dose-response curve should show how effect varies with treatment intensity.

11.6 (Discussion) Researcher discretion in covariate selection can lead to specification searching. Pre-registration, sensitivity analysis, and transparency about choices help. Theory-driven covariate selection preferred to data-driven.

11.7 (Discussion) Target trial emulation: design observational study as if it were a trial. Clarifies: eligibility criteria, treatment strategies, outcome timing. Reduces immortal time bias and other common observational study errors.


Chapter 12: Instrumental Variables

12.1 (Conceptual) Exclusion restriction: instrument affects outcome only through treatment. Violated if instrument has direct effect. Example: quarter of birth might affect outcomes directly through age-at-test effects.

12.2 (Conceptual) Yes, both can be correct if they identify effects for different complier populations. LATE is local to compliers of each instrument. Different instruments may have different complier populations with different treatment effects.

12.3 (Applied) First stage should show strong relationship (F > 10 as rough threshold). 2SLS gives IV estimate. Compliers are those whose education changed due to draft lottery.

12.4 (Applied) Weakening first stage (e.g., adding noise to instrument) shows how estimates become unstable and standard errors explode.

12.5 (Discussion) Casino distance: Relevance—plausible if closer = more gambling. Exclusion—problematic if distance correlates with urbanicity, income, which directly affect finances. Would want to control for these, but then instrument may become weak.


Chapter 13: Difference-in-Differences

13.1 (Conceptual) Parallel trends: absent treatment, treated and control groups would have followed parallel outcome paths. Testable implication: parallel pre-trends. But parallel pre-trends doesn't guarantee parallel post-trends.

13.2 (Conceptual) Event study shows dynamic treatment effects. Useful for: testing pre-trends (coefficients before treatment should be zero), examining effect timing and persistence.

13.3 (Applied) Standard TWFE may be biased with staggered timing and heterogeneous effects. Callaway-Sant'Anna disaggregates by cohort and timing, avoiding negative weighting problems.

13.4 (Applied) Analysis should follow published paper, reproducing main findings with did package.

13.5 (Applied) Synthetic DiD combines synthetic control weighting with DiD. Compare to pure DiD and pure SC.

13.6 (Discussion) Pre-trends test has limited power to detect violations. Honest approach: (1) be transparent about what pre-trends show, (2) conduct sensitivity analysis, (3) consider alternative explanations.


Chapter 14: Regression Discontinuity

14.1 (Conceptual) Continuity assumption: potential outcomes are continuous at cutoff. Implies no manipulation/sorting. Tested indirectly via: density tests (McCrary), covariate balance at cutoff.

14.2 (Conceptual) Bandwidth tradeoff: wider = more data, less bias from extrapolation; narrower = less data, more locally valid. Data-driven methods (IK, CCT) balance bias and variance.

14.3 (Applied) RD plot should show: raw data (binned scatter), fitted curves, discontinuity. Placebo tests at non-cutoff thresholds should show no effect.

14.4 (Applied) Manipulation test (McCrary): look for discontinuity in density. If agents can manipulate running variable, RD may fail.

14.5 (Applied) Covariate balance: pre-determined covariates should be continuous at cutoff. If discontinuous, suggests manipulation or violation of continuity assumption.

14.6 (Discussion) RD identifies very local effect (at cutoff). Extrapolation requires additional assumptions. For policy, may want to know effects away from cutoff—RD alone doesn't provide this.


Chapter 15: Advanced Panel Methods

15.1 (Conceptual) Synthetic control constructs counterfactual from weighted average of donors. Weights chosen to match pre-treatment outcomes. Key assumption: what explains pre-treatment outcomes also explains post-treatment counterfactual.

15.2 (Conceptual) Good pre-treatment fit is necessary but not sufficient. A unit could match pre-trends by coincidence. However, poor fit suggests model may not extrapolate well.

15.3 (Applied) Weights should be reported. Donor pool selection affects results. Sensitivity analysis should examine robustness to donor pool changes.

15.4 (Applied) Placebo tests: apply method to untreated units. If "effects" found, method may be finding spurious patterns.

15.5 (Applied) Synthetic DiD: applies synthetic control weighting to DiD setting. May perform better than DiD when parallel trends questionable.

15.6 (Discussion) Inference is challenging because only one treated unit. Permutation-based inference (treating each potential donor as if treated) is common approach.


Chapter 16: Time Series Causal Inference

16.1 (Conceptual) SVAR: system of equations with contemporaneous relationships. Structural shocks identified through restrictions (Cholesky, sign restrictions, external instruments). Reduced form doesn't identify structural shocks.

16.2 (Conceptual) Local projections: direct regression of outcome at horizon h on shock. Robust to misspecification but less efficient than VAR. VAR more efficient but sensitive to lag specification.

16.3 (Applied) IRF should show response with confidence bands. Compare across identification schemes.

16.4 (Applied) External instruments: use high-frequency financial data or narrative shocks to identify monetary policy. Requires instrument validity arguments.

16.5 (Applied) Stock-Watson analysis of Great Moderation: separate "good luck" (smaller shocks) from "good policy" (better response to shocks).

16.6 (Discussion) Identification is fundamental challenge. Approaches: theory-based restrictions, external information, heteroskedasticity. No perfect solution—transparent presentation of uncertainty crucial.


Chapter 17: Partial Identification

17.1 (Conceptual) Point identification gives single value; partial identification gives set of values consistent with data and assumptions. Partial ID preferred when: point ID assumptions implausible, bounds informative for decision-making.

17.2 (Conceptual) Lee bounds: monotonicity means treatment only affects selection in one direction. Violated if treatment causes some to drop out and others to persist.

17.3 (Conceptual) Oster (2019): assumes proportional selection on observables and unobservables. Can be reframed as partial identification under selection assumptions of varying strength.

17.4 (Applied) Lee bounds: trim sample to create comparable treated and control groups. Width of bounds reflects severity of selection problem.

17.5 (Applied) Manski bounds: outcome ∈ [Y_min, Y_max]. With no assumptions, ATE ∈ [Y_min - Y_max, Y_max - Y_min]. Monotonicity tightens considerably.

17.6 (Discussion) Bounds are informative when: (1) they exclude zero (significant effect), (2) they're narrow enough for decisions, (3) they're combined with other evidence. Wide bounds honestly represent uncertainty.


Chapter 18: Programming Causal

18.1 (Conceptual) Different matching methods handle ties and distance metrics differently. Nearest neighbor is greedy; optimal matching minimizes total distance. Differences highlight that causal effect estimates depend on methodological choices, not just data.

18.2 (Conceptual) Regular coefficients assume homogeneous effects; sunab (Sun-Abraham) allows for heterogeneous treatment effects across cohorts. Prefer sunab when treatment effects likely vary by cohort; regular when willing to assume homogeneity.

18.3 (Conceptual) Data-driven bandwidth is optimal for bias-variance tradeoff under regularity conditions. Domain knowledge may suggest different weighting of bias vs. variance, or may indicate bandwidth affects complier population in substantively important ways. Report both as robustness check.

18.4-18.7 (Applied) See documentation for respective packages: ri2/randomizr, MatchIt/WeightIt, did/fixest, rdrobust.

18.8 (Discussion) R packages often more statistical/academic in design; Python packages often more software-engineering oriented. Choice depends on: existing codebase, team skills, deployment environment, specific package features needed.


Part IV: Beyond Averages

Chapter 19: Mechanisms

19.1 (Conceptual) Mediation analysis decomposes total effect into direct and indirect (through mediator). Challenge: sequential ignorability requires no unmeasured confounders of treatment-mediator and mediator-outcome relationships. Latter especially demanding.

19.2 (Conceptual) Front-door requires: (1) treatment affects mediator, (2) mediator fully mediates treatment, (3) no backdoor paths from mediator to outcome (conditional on treatment). Rare in practice.

19.3 (Applied) Sensitivity analysis should examine how much confounding would be needed to explain away mediated effect. Baron-Kenny vs. causal mediation should show differences when assumptions differ.

19.4 (Applied) Experimental manipulation of mediator (if feasible) strengthens causal claim about mechanism.

19.5 (Applied) Process tracing: within-case analysis of causal mechanisms. Complements statistical mediation with detailed narrative evidence.

19.6 (Discussion) Many "mechanisms" in economics are really effect heterogeneity by potential mediator value, not true mechanisms. The distinction matters: heterogeneity analysis conditions on mediator, which may introduce selection bias.


Chapter 20: Heterogeneity and Generalization

20.1 (Conceptual) Treatment effect heterogeneity: effects vary across individuals. Matters for: policy targeting, understanding mechanisms, external validity.

20.2 (Conceptual) Subgroup analysis pitfalls: multiple testing, data mining, hindsight bias. Require: pre-registration, multiple testing correction, replication.

20.3 (Applied) Causal forest estimates CATE. Variable importance shows which characteristics predict heterogeneity. But importance ≠ causal mechanism.

20.4 (Applied) Optimal policy learning identifies which individuals to treat based on estimated CATEs. Requires considering treatment costs and targeting costs.

20.5 (Applied) Multi-site meta-analysis examines cross-site heterogeneity and predictors of site-level effects.

20.6 (Discussion) External validity requires assumptions about why effects vary. Pure data extrapolation risky. Theory + data combination more defensible.


Chapter 21: ML for Causal Inference

21.1 (Conceptual) Regularization biases coefficients toward zero, which is helpful for prediction but problematic for causal inference. The coefficient of interest is shrunk along with nuisance parameters.

21.2 (Conceptual) Sample splitting (cross-fitting) prevents overfitting bias. Without it, ML predictions are too optimistic on training data, leading to biased treatment effect estimates.

21.3 (Conceptual) Prediction policy problems: predict outcomes to target intervention. Causal problems: estimate effect of intervention. Healthcare example: prediction = risk scoring; causal = treatment effect estimation.

21.4 (Applied) Compare methods on: point estimate, standard error, sensitivity to tuning choices. DML should be robust to ML method choice if methods achieve good nuisance estimation.

21.5 (Applied) Calibration check: rank by estimated CATE, then compare actual treatment effects across quantiles. Well-calibrated forest should show monotonic relationship.

21.6 (Discussion) Both views have merit. ML methods make fewer functional form assumptions but are harder to interpret. Simple methods are interpretable but may be misspecified. Best practice: compare methods, examine sensitivity, present transparently.


Chapter 22: Programming Beyond Averages

22.1 (Conceptual) Honesty (sample splitting) prevents overfitting: trees are constructed on one sample and estimates computed on another. Cost: reduced sample size for each task. Benefit: valid inference.

22.2 (Conceptual) Cross-fitting prevents overfitting bias. If nuisance functions estimated on full sample, predictions are overly optimistic where data informed model, biasing treatment effect estimates.

22.3 (Conceptual) Variable importance shows which variables best partition treatment effect heterogeneity. Does not indicate direction of heterogeneity—a variable could be important because effects are larger OR smaller for high values.

22.4-22.6 (Applied) See package documentation: grf, DoubleML, econml.

22.7 (Discussion) Consider: documentation quality, community support, integration with other tools, specific features needed. For teaching: which has better tutorials? For production: which has better testing and maintenance?

22.8 (Discussion) Both views valid. Key insight: ML doesn't eliminate assumptions—it changes which assumptions are made. Transparency about remaining assumptions crucial regardless of method.


Part V: Integration & Practice

Chapter 23: Triangulation

23.1 (Conceptual) Triangulation: combining methods to strengthen conclusions. Types: data triangulation (multiple sources), method triangulation (multiple approaches), investigator triangulation (multiple researchers), theory triangulation (multiple frameworks).

23.2 (Conceptual) Convergence strengthens confidence; divergence requires explanation. Neither automatically resolves uncertainty—must assess quality and relevance of each source.

23.3 (Applied) Design should specify: research questions, data sources, analysis methods, integration strategy.

23.4 (Applied) Analysis should: examine each method separately, assess quality/limitations, synthesize findings, identify areas of agreement/disagreement.

23.5 (Discussion) Decision context matters. Sometimes bounds are sufficient (sign of effect, rough magnitude). Other times more precision needed. Honest uncertainty acknowledgment preferable to false precision.


Chapter 24: Evidence Synthesis

24.1 (Conceptual) Fixed effects: assumes single true effect. Random effects: assumes distribution of effects. Random effects gives more weight to smaller studies (shrinkage toward mean). Prefer random when heterogeneity expected; problematic if heterogeneity reflects study quality.

24.2 (Conceptual) High I² indicates heterogeneity in effects, not necessarily error. A narrow pooled CI with high I² says: average effect is precisely estimated, but individual effects vary. May indicate moderators to explore.

24.3 (Applied) Moderators might include: study design, sample characteristics, outcome measurement, treatment dosage, context. Theory should guide selection.

24.4 (Applied) Specification curve: vary key analytical choices, plot distribution of results. Robustness if results stable across specifications; concern if highly sensitive.

24.5 (Discussion) Arguments for pre-registration: reduces p-hacking, specification searching. Arguments against: administrative burden, may not fit exploratory research. Middle ground: distinguish confirmatory (pre-registered) from exploratory analysis.


Chapter 25: Research Practice

25.1 (Conceptual) P-hacking: trying many specifications until p<0.05. HARKing: presenting post-hoc hypotheses as a priori. Both inflate false positive rates. Solutions: pre-registration, transparency, replication.

25.2 (Conceptual) Pre-registration: publicly document analysis plan before seeing data. Benefits: prevents specification searching. Limitations: constrains exploratory analysis, doesn't prevent cheating.

25.3 (Applied) Pre-registration should include: hypotheses, sample, variables, analysis plan, robustness checks.

25.4 (Applied) Specification curve should show: distribution of results across specifications, which choices matter most.

25.5 (Applied) Poster/memo should: state question, summarize methods, present results, discuss limitations—all accessibly.

25.6 (Discussion) Reproducibility crisis varies by field. Economics has replication issues but also strong tradition of method transparency. Improvements needed in: code sharing, data availability, pre-registration culture.


Chapter 26: Programming Projects

26.1 (Conceptual) Code-on-request fails because: requests go unanswered, code becomes incompatible, researcher moves on. Public posting enables: immediate verification, building on existing work, error correction.

26.2 (Conceptual) Replicability: same inputs → same outputs. Reproducibility: similar methods → consistent conclusions. Both matter: replicability for verification, reproducibility for scientific generalization.

26.3 (Conceptual) Git never forgets. Sensitive data in commits persists even after deletion. Best practice: never commit sensitive data. Use: .gitignore, environment variables, secrets management.

26.4-26.6 (Applied) Projects should demonstrate: clear structure, documentation, version control, reproducibility.

26.7 (Discussion) Approaches for restricted data: synthetic data, summary statistics only, secure access rooms, code without data (documentation of how to obtain access), containerized environments.

26.8 (Discussion) ROI of reproducibility infrastructure: high for long-running projects, collaborative work, returning to old projects. Lower for one-off analyses. Even for small projects, good habits prevent future problems.


Appendix: Data Sources for Exercises

Many exercises reference specific datasets. Here are suggested sources:

  • CPS (Current Population Survey): IPUMS-CPS (cps.ipums.org)

  • Experimental data: J-PAL Dataverse, AEA RCT Registry

  • Administrative data: FOIA requests, state data centers

  • Teaching datasets: Wooldridge data (via R/Stata packages), Stock & Watson data


Solutions last updated: January 2026

Last updated