Chapter 26: Programming Companion—Project Management

Opening Question

How do we organize research projects so they can be understood, reproduced, and extended by others---including our future selves?


Chapter Overview

This chapter addresses the practical infrastructure of empirical research projects. While earlier programming chapters covered specific methods, this chapter focuses on the organizational practices that make research reproducible and collaborative: project structure, documentation, version control workflows, and creating replication packages.

These skills are increasingly required. Major economics journals now mandate data and code availability. Funders expect reproducibility. Collaborators need to understand your work. And you will thank yourself when you return to a project after months away and find clear documentation instead of cryptic files.

What you will learn:

  • How to structure projects for clarity and reproducibility

  • How to write effective documentation at multiple levels

  • How to use Git for collaboration and code review

  • How to create replication packages that actually work

Prerequisites: Chapter 4 (basic programming workflow)


26.1 Project Structure

Principles

A well-organized project should be:

  • Navigable: New users can find things quickly

  • Self-documenting: Structure reveals purpose

  • Reproducible: Clear path from raw data to results

  • Modular: Components can be understood independently

Naming Conventions

Files:

  • Use lowercase with underscores: clean_survey_data.R

  • Number sequential scripts: 01_, 02_, etc.

  • Be descriptive but concise: analyze_wages.R not a.R

  • Date outputs if needed: results_2025-01-15.csv

Variables:

  • Use consistent style (snake_case or camelCase)

  • Be descriptive: log_hourly_wage not y

  • Avoid reserved words and special characters

  • Document abbreviations in codebook

Directories:

  • Match structure to workflow stages

  • Use consistent naming across projects

  • Don't nest too deeply (3-4 levels max)

Running Your Project: From Simple to Sophisticated

Level 1: Master Script (Simple Projects)

For small projects, a linear script works fine:

Problem: If you change only 04_tables.R, you still re-run everything. For large datasets or slow models, this wastes hours.

Level 2: Dependency-Based Pipelines (Complex Projects)

For projects with expensive computations, use dependency management. The system tracks which outputs depend on which inputs and only rebuilds what changed.

R with targets:

With Make (language-agnostic):

Run with make all. Only outdated files rebuild.

Why this matters:

  • Change one figure's code → only that figure rebuilds

  • Team member pulls your changes → only affected outputs rebuild

  • Confident that outputs match current code (no stale files)

Environment Reproducibility

Your code ran in 2024 with specific package versions. Will it run in 2027?

R with renv:

Python with conda or pip:

Docker (Gold Standard):

For complete reproducibility, containerize your entire environment:

Why Docker matters:

  • "Works on my machine" → works everywhere

  • Captures OS, libraries, package versions—everything

  • Journals increasingly accept/require Docker for replication

  • Run 2024 analysis in 2027 exactly as it was

Approach
Complexity
Reproducibility
When to Use

Master script

Low

Basic

Small/simple projects

renv/conda

Medium

Package versions

Most projects

Make/targets

Medium

Build dependencies

Complex pipelines

Docker

High

Complete

Publications, long-term


26.2 Documentation

README Files

Every project needs a main README.md:

Python Packages

Replication Instructions

Quick Start

  1. Clone this repository

  2. Download data following instructions in data/raw/README.md

  3. Run Rscript run_all.R

Step by Step

  1. 01_download_data.R - Downloads public data files

  2. 02_clean_data.R - Cleans and merges data

  3. 03_analysis.R - Main regression analysis

  4. ... etc.

Output

  • Tables are in output/tables/

  • Figures are in output/figures/

  • Table X corresponds to table_X.tex

  • Figure Y corresponds to figure_Y.pdf

Notes

  • Runtime: Approximately 2 hours on standard laptop

  • Memory: Requires at least 8GB RAM for data cleaning step

  • Known issues: See docs/known_issues.md

License

[License type] - see LICENSE file

Code Comments

Comment code to explain why, not what:

Analysis Notes

Document major decisions in a separate file:


26.3 Collaboration with Git

Branch Workflow

For collaborative projects:

Commit Messages

Write informative commit messages:

Code Review

When reviewing collaborators' code:

Reviewer checklist:

Pull request template:

Handling Conflicts

When merge conflicts occur:

Prevention:

  • Pull from main frequently

  • Communicate about who's working on what

  • Keep changes small and focused

  • Avoid reformatting code unnecessarily


26.4 Replication Packages

Journal Requirements

Most economics journals now require replication packages:

AEA journals:

  • All data and code must be deposited

  • Must include README with clear instructions

  • Must run "out of the box" where possible

  • Confidential data: provide code and describe access

Other journals:

  • Similar requirements increasingly common

  • Check specific journal policies

  • Some accept links to repositories (GitHub, Dataverse)

Creating a Replication Package

Step 1: Verify everything runs

Step 2: Minimize and clean

  • Remove unnecessary files

  • Delete intermediate data if reproducible

  • Remove personal paths

  • Check for sensitive information

Step 3: Write comprehensive README

  • Include all steps needed to reproduce

  • Document software versions

  • Estimate runtime and resources

  • Note any known issues

Step 4: Test on fresh machine

  • Ask colleague to run package

  • Or use Docker/container

  • Fix any issues discovered

README Template for Replication

Package Structure

Archiving

Repositories:

  • AEA Data and Code Repository (for AEA journals)

  • ICPSR

  • Harvard Dataverse

  • Zenodo

  • GitHub (less permanent)

Best practices:

  • Get DOI for citability

  • Use open license (MIT, CC-BY)

  • Include version number

  • Update if errors found


Summary

Key takeaways:

  1. Project structure should be self-documenting: clear naming, numbered scripts, separation of data/code/output.

  2. Documentation exists at multiple levels: README for project overview, codebooks for data, comments for code decisions, analysis notes for research choices.

  3. Git enables collaboration through branching, pull requests, and code review; write informative commit messages.

Returning to the opening question: Organizing projects for reproducibility requires upfront investment but pays dividends. Your future self---returning to a project after months---will benefit as much as external replicators. The key is building good practices into your workflow from the start, not trying to impose them at the end.


Further Reading

Essential

  • Christensen, G., J. Freese, and E. Miguel (2019). "Transparent and Reproducible Social Science Research." UC Press.

  • Gentzkow, M. and J. Shapiro (2014). "Code and Data for the Social Sciences: A Practitioner's Guide."

Practical Guides

  • AEA Data and Code Availability Policy: https://www.aeaweb.org/journals/data

  • "The Turing Way": https://the-turing-way.netlify.app/


Exercises

Conceptual

  1. A researcher argues: "I share my code on request, which is sufficient for transparency." Another argues: "Code must be publicly posted with the paper for true reproducibility." What are the tradeoffs between these positions? When might each be appropriate?

  2. Explain the difference between replicability (same data, same code, same results) and reproducibility (different data, same methods, consistent findings). Why do both matter for scientific credibility?

  3. What are the risks of putting API keys, passwords, or personally identifiable information in a Git repository? How should sensitive data be handled in a replication package?

Applied

  1. Take an existing research project (yours or a provided example) and:

    • Reorganize it following the recommended structure

    • Write a comprehensive README

    • Create a codebook for the main dataset

    • Verify it runs from scratch

  2. Set up a collaborative workflow with a colleague:

    • Create a shared repository

    • Practice the branch-PR-review-merge workflow

    • Resolve an intentionally created merge conflict

  3. Create a minimal replication package for a simple analysis:

    • Include all necessary components

    • Have someone else attempt to run it

    • Fix any issues they encounter

Discussion

  1. Many journals now require data and code availability. However, some data (administrative, proprietary, confidential) cannot be shared. How should researchers balance reproducibility requirements with data access constraints? What are best practices for "reproducibility under constraints"?

  2. Some argue that detailed documentation and replication packages slow down research by adding overhead. Others argue they save time in the long run by preventing errors and enabling future work. Based on your experience, when is the investment in reproducibility infrastructure worth it?

Last updated