Chapter 26: Programming Companion—Project Management

Opening Question

How do we organize research projects so they can be understood, reproduced, and extended by others---including our future selves?

Chapter Overview

This chapter addresses the practical infrastructure of empirical research projects. While earlier programming chapters covered specific methods, this chapter focuses on the organizational practices that make research reproducible and collaborative: project structure, documentation, version control workflows, and creating replication packages.

These skills are increasingly required. Major economics journals now mandate data and code availability. Funders expect reproducibility. Collaborators need to understand your work. And you will thank yourself when you return to a project after months away and find clear documentation instead of cryptic files.

What you will learn:

How to structure projects for clarity and reproducibility
How to write effective documentation at multiple levels
How to use Git for collaboration and code review
How to create replication packages that actually work

Prerequisites: Chapter 4 (basic programming workflow)

26.1 Project Structure

Principles

A well-organized project should be:

Navigable: New users can find things quickly
Self-documenting: Structure reveals purpose
Reproducible: Clear path from raw data to results
Modular: Components can be understood independently

Recommended Structure

project_name/
├── README.md                 # Project overview
├── LICENSE                   # Usage terms
├── .gitignore               # Files Git should ignore
│
├── data/
│   ├── raw/                 # Original, immutable data
│   │   └── README.md        # Data sources and access
│   ├── processed/           # Cleaned data
│   └── temp/                # Intermediate files (gitignored)
│
├── code/
│   ├── 00_setup.R           # Environment setup
│   ├── 01_download_data.R   # Data acquisition
│   ├── 02_clean_data.R      # Data cleaning
│   ├── 03_analysis.R        # Main analysis
│   ├── 04_robustness.R      # Robustness checks
│   ├── 05_tables.R          # Generate tables
│   ├── 06_figures.R         # Generate figures
│   └── functions/           # Helper functions
│       ├── data_utils.R
│       └── plot_utils.R
│
├── output/
│   ├── tables/              # LaTeX/HTML tables
│   ├── figures/             # PDF/PNG figures
│   └── logs/                # Execution logs
│
├── paper/
│   ├── main.tex             # Manuscript
│   ├── appendix.tex         # Appendix
│   └── references.bib       # Bibliography
│
├── docs/
│   ├── codebook.md          # Variable documentation
│   ├── analysis_notes.md    # Research decisions
│   └── replication_log.md   # What we've verified
│
└── archive/
    └── replication_package/ # Final package for journal

Naming Conventions

Files:

Use lowercase with underscores: clean_survey_data.R
Number sequential scripts: 01_, 02_, etc.
Be descriptive but concise: analyze_wages.R not a.R
Date outputs if needed: results_2025-01-15.csv

Variables:

Use consistent style (snake_case or camelCase)
Be descriptive: log_hourly_wage not y
Avoid reserved words and special characters
Document abbreviations in codebook

Directories:

Match structure to workflow stages
Use consistent naming across projects
Don't nest too deeply (3-4 levels max)

Running Your Project: From Simple to Sophisticated

Level 1: Master Script (Simple Projects)

For small projects, a linear script works fine:

# run_all.R - Master script to reproduce all results
source("code/00_setup.R")
source("code/01_download_data.R")
source("code/02_clean_data.R")
source("code/03_analysis.R")
source("code/04_tables.R")
source("code/05_figures.R")
message("All outputs generated!")

Problem: If you change only 04_tables.R, you still re-run everything. For large datasets or slow models, this wastes hours.

Level 2: Dependency-Based Pipelines (Complex Projects)

For projects with expensive computations, use dependency management. The system tracks which outputs depend on which inputs and only rebuilds what changed.

R with targets:

# _targets.R
library(targets)

tar_option_set(packages = c("tidyverse", "fixest"))

list(
  tar_target(raw_data, read_csv("data/raw/data.csv")),
  tar_target(clean_data, clean_data_function(raw_data)),
  tar_target(model_results, run_analysis(clean_data)),
  tar_target(table1, make_table1(model_results)),
  tar_target(figure1, make_figure1(model_results))
)

# Run with: tar_make()
# View dependencies: tar_visnetwork()

With Make (language-agnostic):

# Makefile
output/table1.tex: data/processed/clean.csv code/03_analysis.R
	Rscript code/03_analysis.R

data/processed/clean.csv: data/raw/data.csv code/02_clean.R
	Rscript code/02_clean.R

all: output/table1.tex output/figure1.pdf
clean:
	rm -rf output/* data/processed/*

Run with make all. Only outdated files rebuild.

Why this matters:

Change one figure's code → only that figure rebuilds
Team member pulls your changes → only affected outputs rebuild
Confident that outputs match current code (no stale files)

Environment Reproducibility

Your code ran in 2024 with specific package versions. Will it run in 2027?

R with renv:

# Initialize (once per project)
renv::init()

# Install packages normally, then snapshot
install.packages("fixest")
renv::snapshot()  # Records versions to renv.lock

# Collaborator restores your exact environment
renv::restore()

Python with conda or pip:

# conda environment
conda create -n myproject python=3.11
conda activate myproject
pip install -r requirements.txt

# Or with pip-tools for stricter pinning
pip-compile requirements.in  # Creates requirements.txt with exact versions
pip-sync requirements.txt

Docker (Gold Standard):

For complete reproducibility, containerize your entire environment:

# Dockerfile
FROM rocker/r-ver:4.3.0

# Install system dependencies
RUN apt-get update && apt-get install -y libcurl4-openssl-dev

# Install R packages
RUN R -e "install.packages(c('tidyverse', 'fixest', 'targets'))"

# Copy project files
COPY . /project
WORKDIR /project

# Run analysis
CMD ["Rscript", "run_all.R"]

# Build and run
docker build -t myproject .
docker run -v $(pwd)/output:/project/output myproject

Why Docker matters:

"Works on my machine" → works everywhere
Captures OS, libraries, package versions—everything
Journals increasingly accept/require Docker for replication
Run 2024 analysis in 2027 exactly as it was

Approach

Complexity

Reproducibility

When to Use

Master script

Low

Basic

Small/simple projects

renv/conda

Medium

Package versions

Most projects

Make/targets

Medium

Build dependencies

Complex pipelines

Docker

High

Complete

Publications, long-term

26.2 Documentation

README Files

Every project needs a main README.md:

# Project Title

Brief description of the research question and main findings.

## Authors
- Name One (affiliation, email)
- Name Two (affiliation, email)

## Paper
Link to published paper or working paper.

## Data
- **Source**: Description of where data comes from
- **Access**: How to obtain data (link, request process)
- **Files**: Brief description of data files

## Requirements
### Software
- R 4.3.0 or later
- Python 3.10 or later

### R Packages
```r
install.packages(c("tidyverse", "fixest", "modelsummary"))

Python Packages

pip install -r requirements.txt

Replication Instructions

Quick Start

Clone this repository
Download data following instructions in data/raw/README.md
Run Rscript run_all.R

Step by Step

01_download_data.R - Downloads public data files
02_clean_data.R - Cleans and merges data
03_analysis.R - Main regression analysis
... etc.

Output

Tables are in output/tables/
Figures are in output/figures/
Table X corresponds to table_X.tex
Figure Y corresponds to figure_Y.pdf

Notes

Runtime: Approximately 2 hours on standard laptop
Memory: Requires at least 8GB RAM for data cleaning step
Known issues: See docs/known_issues.md

License

[License type] - see LICENSE file


### Codebooks

Every dataset needs a codebook:

```markdown
# Codebook: Analysis Sample

## Overview
- **File**: `data/processed/analysis_sample.parquet`
- **Observations**: 45,231
- **Time period**: 2010-2020
- **Unit of observation**: Individual-year

## Variable Descriptions

### Identifiers
| Variable | Type | Description |
|----------|------|-------------|
| person_id | integer | Unique person identifier |
| year | integer | Calendar year |

### Demographics
| Variable | Type | Description | Values |
|----------|------|-------------|--------|
| age | integer | Age in years | 18-65 |
| female | binary | Female indicator | 0=Male, 1=Female |
| education | categorical | Education level | See below |
| race | categorical | Race/ethnicity | See below |

**education categories:**
- 1 = Less than high school
- 2 = High school diploma
- 3 = Some college
- 4 = Bachelor's degree
- 5 = Graduate degree

### Outcomes
| Variable | Type | Description | Notes |
|----------|------|-------------|-------|
| log_wage | numeric | Log hourly wage | Missing if not employed |
| employed | binary | Employed indicator | |
| hours | numeric | Usual weekly hours | 0 if not employed |

### Treatment
| Variable | Type | Description |
|----------|------|-------------|
| treated | binary | Received treatment |
| treat_year | integer | Year first treated (0 if never) |

## Construction Notes
- Sample restricted to ages 25-54 to focus on prime working age
- Wages winsorized at 1st and 99th percentiles
- Industry codes crosswalked to consistent classification

## Missing Data
| Variable | N Missing | % Missing | Reason |
|----------|-----------|-----------|--------|
| log_wage | 8,432 | 18.6% | Not employed |
| education | 156 | 0.3% | Item nonresponse |

Code Comments

Comment code to explain why, not what:

# BAD: Commenting the obvious
# Multiply x by 2
y <- x * 2

# GOOD: Explaining the reason
# Double the sample weights to account for 50% sampling rate
# in post-2015 waves (see documentation section 3.2)
final_weight <- base_weight * 2

# GOOD: Documenting a non-obvious decision
# Use winsorized wages at p1/p99 rather than trimming
# Based on robustness analysis showing similar results
# but retaining more observations
df <- df %>%
  mutate(wage_wins = pmin(pmax(wage, quantile(wage, 0.01)),
                          quantile(wage, 0.99)))

# GOOD: Warning about potential issues
# NOTE: This merge creates duplicates for individuals who moved states
# These are dropped in the next step - verified this affects <0.1% of obs
df <- left_join(df, state_data, by = c("state", "year"))

Analysis Notes

Document major decisions in a separate file:

# Analysis Notes

## Data Decisions

### Sample Restrictions
- **Age range**: 25-54 (excludes students and early retirees)
  - Robustness: Results similar with 18-65 (Table A1)

- **Wage outliers**: Winsorized at p1/p99
  - Rationale: A few extreme values, likely data errors
  - Alternative: Trim - results nearly identical (not shown)

### Variable Construction
- **Experience**: Age - Education - 6
  - Standard Mincer approximation
  - Direct experience variable missing for 30% of sample

## Econometric Decisions

### Standard Errors
- Clustered at state level for all specifications
- Rationale: Treatment varies at state level
- Robustness: State-year clustering similar (Table A2)

### Fixed Effects
- Include state and year FE in main specification
- State-by-year too demanding given treatment variation

## Issues Encountered

### Merge Problems (Resolved)
- Date: 2025-01-10
- Issue: 5% of observations didn't merge to geographic data
- Resolution: Missing FIPS codes for Alaska, coded manually

26.3 Collaboration with Git

Branch Workflow

For collaborative projects:

# Main branch is protected - never commit directly
# Create feature branch for each task
git checkout -b add-robustness-checks

# Make changes
git add code/04_robustness.R
git commit -m "Add robustness checks for alternative sample"

# Push branch
git push origin add-robustness-checks

# Open pull request on GitHub for review
# After approval, merge to main

Commit Messages

Write informative commit messages:

# BAD
git commit -m "updates"
git commit -m "fix"
git commit -m "stuff"

# GOOD
git commit -m "Add state fixed effects to main specification"
git commit -m "Fix off-by-one error in event time calculation"
git commit -m "Refactor data cleaning into separate functions"

# For complex changes, use multi-line messages
git commit -m "Add robustness checks for alternative samples

- Restrict to prime-age workers (25-54)
- Drop imputed wages
- Results in Table A3, qualitatively similar to main"

Code Review

When reviewing collaborators' code:

Reviewer checklist:

Does the code run without errors?
Are variable names clear?
Are non-obvious decisions documented?
Are there potential bugs or edge cases?
Is the approach correct for the stated goal?
Are results consistent with expectations?

Pull request template:

## Description
Brief description of what this PR does.

## Changes
- Added X
- Modified Y
- Fixed Z

## Testing
How did you verify this works?

## Output
Does this change any tables/figures? If so, what changes are expected?

## Checklist
- [ ] Code runs without errors
- [ ] Output files updated
- [ ] Documentation updated
- [ ] No hardcoded paths

Handling Conflicts

When merge conflicts occur:

# Git marks conflicts in the file:
<<<<<<< HEAD
your version
=======
their version
>>>>>>> feature-branch

# Edit file to resolve (keeping correct version)
# Then:
git add resolved_file.R
git commit -m "Resolve merge conflict in analysis script"

Prevention:

Pull from main frequently
Communicate about who's working on what
Keep changes small and focused
Avoid reformatting code unnecessarily

26.4 Replication Packages

Journal Requirements

Most economics journals now require replication packages:

AEA journals:

All data and code must be deposited
Must include README with clear instructions
Must run "out of the box" where possible
Confidential data: provide code and describe access

Other journals:

Similar requirements increasingly common
Check specific journal policies
Some accept links to repositories (GitHub, Dataverse)

Creating a Replication Package

Step 1: Verify everything runs

# Clean environment
rm -rf output/*

# Run from scratch
Rscript run_all.R

# Verify outputs match paper

Step 2: Minimize and clean

Remove unnecessary files
Delete intermediate data if reproducible
Remove personal paths
Check for sensitive information

Step 3: Write comprehensive README

Include all steps needed to reproduce
Document software versions
Estimate runtime and resources
Note any known issues

Step 4: Test on fresh machine

Ask colleague to run package
Or use Docker/container
Fix any issues discovered

README Template for Replication

# Replication Package for "[Paper Title]"

## Authors
- [Names and contact information]

## Data Availability

### Provided Data
- `data/raw/survey_data.csv`: Survey data collected by authors
- `data/raw/public_data.csv`: Downloaded from [source]

### Data Requiring Access
- Confidential administrative data from [agency]
- To access: [Instructions or contact information]
- We provide code that runs on this data in `code/confidential/`

## Computational Requirements

### Software
- R 4.3.0 (packages listed in `renv.lock`)
- LaTeX distribution for compiling paper

### Hardware
- Memory: 16GB RAM recommended
- Storage: 5GB for data and outputs
- Runtime: Approximately 3 hours on standard laptop

### Controlled Randomness
- Random seed set in `code/00_setup.R`
- All results should reproduce exactly

## Instructions

### Setup
1. Install R 4.3.0 or later
2. Clone repository: `git clone [URL]`
3. Open R in project directory
4. Run `renv::restore()` to install packages

### Data Preparation
1. Download public data: `Rscript code/01_download_data.R`
2. Place confidential data in `data/raw/confidential/` (if applicable)

### Running Analysis
Run `Rscript run_all.R` or execute scripts in order:
1. `02_clean_data.R` - Creates analysis sample
2. `03_analysis.R` - Main results
3. `04_robustness.R` - Robustness checks
4. `05_tables.R` - Generates all tables
5. `06_figures.R` - Generates all figures

### Output
| Paper Element | Output File |
|--------------|-------------|
| Table 1 | `output/tables/table1.tex` |
| Table 2 | `output/tables/table2.tex` |
| Figure 1 | `output/figures/figure1.pdf` |
| ... | ... |

## List of Tables and Figures

### Main Paper
- **Table 1**: Summary statistics → `05_tables.R`, lines 10-45
- **Figure 1**: Event study → `06_figures.R`, lines 50-80

### Appendix
- **Table A1**: Robustness → `04_robustness.R`, lines 100-150

## References
[Any references needed to understand the code]

Package Structure

replication_package/
├── README.md                    # Comprehensive instructions
├── run_all.R                    # Master script
├── renv.lock                    # R package versions
│
├── data/
│   ├── raw/
│   │   ├── README.md           # Data sources and access
│   │   └── public_data.csv     # Included public data
│   └── processed/               # Empty (generated)
│
├── code/
│   ├── 00_setup.R
│   ├── 01_download_data.R
│   ├── 02_clean_data.R
│   ├── 03_analysis.R
│   ├── 04_robustness.R
│   ├── 05_tables.R
│   └── 06_figures.R
│
├── output/
│   ├── tables/                  # Empty or pre-generated
│   └── figures/                 # Empty or pre-generated
│
└── paper/
    ├── main.pdf                 # Published paper
    └── appendix.pdf             # Online appendix

Archiving

Repositories:

AEA Data and Code Repository (for AEA journals)
ICPSR
Harvard Dataverse
Zenodo
GitHub (less permanent)

Best practices:

Get DOI for citability
Use open license (MIT, CC-BY)
Include version number
Update if errors found

Summary

Key takeaways:

Project structure should be self-documenting: clear naming, numbered scripts, separation of data/code/output.
Documentation exists at multiple levels: README for project overview, codebooks for data, comments for code decisions, analysis notes for research choices.
Git enables collaboration through branching, pull requests, and code review; write informative commit messages.

Returning to the opening question: Organizing projects for reproducibility requires upfront investment but pays dividends. Your future self---returning to a project after months---will benefit as much as external replicators. The key is building good practices into your workflow from the start, not trying to impose them at the end.

Exercises

Conceptual

A researcher argues: "I share my code on request, which is sufficient for transparency." Another argues: "Code must be publicly posted with the paper for true reproducibility." What are the tradeoffs between these positions? When might each be appropriate?
Explain the difference between replicability (same data, same code, same results) and reproducibility (different data, same methods, consistent findings). Why do both matter for scientific credibility?
What are the risks of putting API keys, passwords, or personally identifiable information in a Git repository? How should sensitive data be handled in a replication package?

Applied

Take an existing research project (yours or a provided example) and:
- Reorganize it following the recommended structure
- Write a comprehensive README
- Create a codebook for the main dataset
- Verify it runs from scratch
Set up a collaborative workflow with a colleague:
- Create a shared repository
- Practice the branch-PR-review-merge workflow
- Resolve an intentionally created merge conflict
Create a minimal replication package for a simple analysis:
- Include all necessary components
- Have someone else attempt to run it
- Fix any issues they encounter

Discussion

Many journals now require data and code availability. However, some data (administrative, proprietary, confidential) cannot be shared. How should researchers balance reproducibility requirements with data access constraints? What are best practices for "reproducibility under constraints"?
Some argue that detailed documentation and replication packages slow down research by adding overhead. Others argue they save time in the long run by preventing errors and enabling future work. Based on your experience, when is the investment in reproducibility infrastructure worth it?

PreviousChapter 25: Research Practice NextAppendix A: Mathematical Notation

Last updated 1 month ago

Chapter 26: Programming Companion—Project Management

Opening Question

Chapter Overview

26.1 Project Structure

Principles

Recommended Structure

Naming Conventions

Running Your Project: From Simple to Sophisticated

Environment Reproducibility

26.2 Documentation

README Files

Python Packages

Replication Instructions

Quick Start

Step by Step

Output

Notes

License

Code Comments

Analysis Notes

26.3 Collaboration with Git

Branch Workflow

Commit Messages

Code Review

Handling Conflicts

26.4 Replication Packages

Journal Requirements

Creating a Replication Package

README Template for Replication

Package Structure

Archiving

Summary

Further Reading

Essential

Practical Guides

Exercises

Conceptual

Applied

Discussion

hashtagOpening Question

hashtagChapter Overview

hashtag26.1 Project Structure

hashtagPrinciples

hashtagRecommended Structure

hashtagNaming Conventions

hashtagRunning Your Project: From Simple to Sophisticated

hashtagEnvironment Reproducibility

hashtag26.2 Documentation

hashtagREADME Files

hashtagPython Packages

hashtagReplication Instructions

hashtagQuick Start

hashtagStep by Step

hashtagOutput

hashtagNotes

hashtagLicense

hashtagCode Comments

hashtagAnalysis Notes

hashtag26.3 Collaboration with Git

hashtagBranch Workflow

hashtagCommit Messages

hashtagCode Review

hashtagHandling Conflicts

hashtag26.4 Replication Packages

hashtagJournal Requirements

hashtagCreating a Replication Package

hashtagREADME Template for Replication

hashtagPackage Structure

hashtagArchiving

hashtagSummary

hashtagFurther Reading

hashtagEssential

hashtagPractical Guides

hashtagExercises

hashtagConceptual

hashtagApplied

hashtagDiscussion

Opening Question

Chapter Overview

26.1 Project Structure

Principles

Recommended Structure

Naming Conventions

Running Your Project: From Simple to Sophisticated

Environment Reproducibility

26.2 Documentation

README Files

Python Packages

Replication Instructions

Quick Start

Step by Step

Output

Notes

License

Code Comments

Analysis Notes

26.3 Collaboration with Git

Branch Workflow

Commit Messages

Code Review

Handling Conflicts

26.4 Replication Packages

Journal Requirements

Creating a Replication Package

README Template for Replication

Package Structure

Archiving

Summary

Further Reading

Essential

Practical Guides

Exercises

Conceptual

Applied

Discussion