Chapter 26: Programming Companion—Project Management
Opening Question
How do we organize research projects so they can be understood, reproduced, and extended by others---including our future selves?
Chapter Overview
This chapter addresses the practical infrastructure of empirical research projects. While earlier programming chapters covered specific methods, this chapter focuses on the organizational practices that make research reproducible and collaborative: project structure, documentation, version control workflows, and creating replication packages.
These skills are increasingly required. Major economics journals now mandate data and code availability. Funders expect reproducibility. Collaborators need to understand your work. And you will thank yourself when you return to a project after months away and find clear documentation instead of cryptic files.
What you will learn:
How to structure projects for clarity and reproducibility
How to write effective documentation at multiple levels
How to use Git for collaboration and code review
How to create replication packages that actually work
Prerequisites: Chapter 4 (basic programming workflow)
26.1 Project Structure
Principles
A well-organized project should be:
Navigable: New users can find things quickly
Self-documenting: Structure reveals purpose
Reproducible: Clear path from raw data to results
Modular: Components can be understood independently
Recommended Structure
Naming Conventions
Files:
Use lowercase with underscores:
clean_survey_data.RNumber sequential scripts:
01_,02_, etc.Be descriptive but concise:
analyze_wages.Rnota.RDate outputs if needed:
results_2025-01-15.csv
Variables:
Use consistent style (snake_case or camelCase)
Be descriptive:
log_hourly_wagenotyAvoid reserved words and special characters
Document abbreviations in codebook
Directories:
Match structure to workflow stages
Use consistent naming across projects
Don't nest too deeply (3-4 levels max)
Running Your Project: From Simple to Sophisticated
Level 1: Master Script (Simple Projects)
For small projects, a linear script works fine:
Problem: If you change only 04_tables.R, you still re-run everything. For large datasets or slow models, this wastes hours.
Level 2: Dependency-Based Pipelines (Complex Projects)
For projects with expensive computations, use dependency management. The system tracks which outputs depend on which inputs and only rebuilds what changed.
R with targets:
With Make (language-agnostic):
Run with make all. Only outdated files rebuild.
Why this matters:
Change one figure's code → only that figure rebuilds
Team member pulls your changes → only affected outputs rebuild
Confident that outputs match current code (no stale files)
Environment Reproducibility
Your code ran in 2024 with specific package versions. Will it run in 2027?
R with renv:
Python with conda or pip:
Docker (Gold Standard):
For complete reproducibility, containerize your entire environment:
Why Docker matters:
"Works on my machine" → works everywhere
Captures OS, libraries, package versions—everything
Journals increasingly accept/require Docker for replication
Run 2024 analysis in 2027 exactly as it was
Master script
Low
Basic
Small/simple projects
renv/conda
Medium
Package versions
Most projects
Make/targets
Medium
Build dependencies
Complex pipelines
Docker
High
Complete
Publications, long-term
26.2 Documentation
README Files
Every project needs a main README.md:
Python Packages
Replication Instructions
Quick Start
Clone this repository
Download data following instructions in
data/raw/README.mdRun
Rscript run_all.R
Step by Step
01_download_data.R- Downloads public data files02_clean_data.R- Cleans and merges data03_analysis.R- Main regression analysis... etc.
Output
Tables are in
output/tables/Figures are in
output/figures/Table X corresponds to
table_X.texFigure Y corresponds to
figure_Y.pdf
Notes
Runtime: Approximately 2 hours on standard laptop
Memory: Requires at least 8GB RAM for data cleaning step
Known issues: See
docs/known_issues.md
License
[License type] - see LICENSE file
Code Comments
Comment code to explain why, not what:
Analysis Notes
Document major decisions in a separate file:
26.3 Collaboration with Git
Branch Workflow
For collaborative projects:
Commit Messages
Write informative commit messages:
Code Review
When reviewing collaborators' code:
Reviewer checklist:
Pull request template:
Handling Conflicts
When merge conflicts occur:
Prevention:
Pull from main frequently
Communicate about who's working on what
Keep changes small and focused
Avoid reformatting code unnecessarily
26.4 Replication Packages
Journal Requirements
Most economics journals now require replication packages:
AEA journals:
All data and code must be deposited
Must include README with clear instructions
Must run "out of the box" where possible
Confidential data: provide code and describe access
Other journals:
Similar requirements increasingly common
Check specific journal policies
Some accept links to repositories (GitHub, Dataverse)
Creating a Replication Package
Step 1: Verify everything runs
Step 2: Minimize and clean
Remove unnecessary files
Delete intermediate data if reproducible
Remove personal paths
Check for sensitive information
Step 3: Write comprehensive README
Include all steps needed to reproduce
Document software versions
Estimate runtime and resources
Note any known issues
Step 4: Test on fresh machine
Ask colleague to run package
Or use Docker/container
Fix any issues discovered
README Template for Replication
Package Structure
Archiving
Repositories:
AEA Data and Code Repository (for AEA journals)
ICPSR
Harvard Dataverse
Zenodo
GitHub (less permanent)
Best practices:
Get DOI for citability
Use open license (MIT, CC-BY)
Include version number
Update if errors found
Summary
Key takeaways:
Project structure should be self-documenting: clear naming, numbered scripts, separation of data/code/output.
Documentation exists at multiple levels: README for project overview, codebooks for data, comments for code decisions, analysis notes for research choices.
Git enables collaboration through branching, pull requests, and code review; write informative commit messages.
Returning to the opening question: Organizing projects for reproducibility requires upfront investment but pays dividends. Your future self---returning to a project after months---will benefit as much as external replicators. The key is building good practices into your workflow from the start, not trying to impose them at the end.
Further Reading
Essential
Christensen, G., J. Freese, and E. Miguel (2019). "Transparent and Reproducible Social Science Research." UC Press.
Gentzkow, M. and J. Shapiro (2014). "Code and Data for the Social Sciences: A Practitioner's Guide."
Practical Guides
AEA Data and Code Availability Policy: https://www.aeaweb.org/journals/data
"The Turing Way": https://the-turing-way.netlify.app/
Exercises
Conceptual
A researcher argues: "I share my code on request, which is sufficient for transparency." Another argues: "Code must be publicly posted with the paper for true reproducibility." What are the tradeoffs between these positions? When might each be appropriate?
Explain the difference between replicability (same data, same code, same results) and reproducibility (different data, same methods, consistent findings). Why do both matter for scientific credibility?
What are the risks of putting API keys, passwords, or personally identifiable information in a Git repository? How should sensitive data be handled in a replication package?
Applied
Take an existing research project (yours or a provided example) and:
Reorganize it following the recommended structure
Write a comprehensive README
Create a codebook for the main dataset
Verify it runs from scratch
Set up a collaborative workflow with a colleague:
Create a shared repository
Practice the branch-PR-review-merge workflow
Resolve an intentionally created merge conflict
Create a minimal replication package for a simple analysis:
Include all necessary components
Have someone else attempt to run it
Fix any issues they encounter
Discussion
Many journals now require data and code availability. However, some data (administrative, proprietary, confidential) cannot be shared. How should researchers balance reproducibility requirements with data access constraints? What are best practices for "reproducibility under constraints"?
Some argue that detailed documentation and replication packages slow down research by adding overhead. Others argue they save time in the long run by preventing errors and enabling future work. Based on your experience, when is the investment in reproducibility infrastructure worth it?
Last updated