Chapter 4: Programming Companion—Foundations

Opening Question

How do we set up a computational environment that makes empirical research reproducible, collaborative, and efficient?


Chapter Overview

This chapter introduces the computational foundations for empirical research. Modern empirical work requires not just statistical knowledge but also practical skills in organizing projects, writing code, managing data, and collaborating with others. These skills are rarely taught formally but make the difference between research that can be reproduced, extended, and trusted, and research that exists only on one person's laptop.

We focus on two languages---R and Python---because they dominate modern empirical social science. Both are free, open-source, and have rich ecosystems for data analysis. Rather than advocate for one, we show how to use each effectively and when each has advantages.

What you will learn:

  • How to organize research projects for reproducibility

  • Core R and Python skills for data analysis

  • Version control with Git for tracking changes and collaboration

  • Data management principles that prevent common disasters

Prerequisites: Basic familiarity with any programming language is helpful but not required


4.1 Reproducible Workflow

Why Reproducibility Matters

A reproducible project is one where another researcher---or your future self---can understand what was done and verify the results. This is both a scientific ideal and a practical necessity.

Scientific reasons:

  • Verification: Others can check your work

  • Extension: Others can build on your work

  • Credibility: Transparent work is more trustworthy

Practical reasons:

  • Your future self will forget what you did

  • Collaborators need to understand your work

  • Journals increasingly require replication materials

  • Errors are easier to find and fix

Project Organization

A well-organized project follows predictable conventions:

Key principles:

  1. Numbered scripts: Files run in order. 01_clean.R runs before 02_analyze.R. This makes dependencies explicit.

  2. Separation of raw and processed data: Raw data is sacred---never modify it. All transformations happen in code.

  3. Clear inputs and outputs: Each script should have well-defined inputs (what it reads) and outputs (what it creates).

  4. Self-documenting structure: A new collaborator should understand the project from the folder names alone.

The README File

Every project needs a README.md explaining:

Authors

  • Name (email)

Python: requirements.txt or conda


4.2 R and Python for Empirical Research

Choosing Between R and Python

Both languages can do everything you need. The choice often depends on context:

Consideration
R
Python

Econometrics packages

Excellent (fixest, plm, rdrobust)

Good (statsmodels, linearmodels)

Data manipulation

Excellent (tidyverse)

Excellent (pandas)

Visualization

Excellent (ggplot2)

Good (matplotlib, seaborn)

Machine learning

Good (tidymodels, caret)

Excellent (scikit-learn)

General programming

Adequate

Excellent

Industry jobs

Less common

Very common

Academic economics

Dominant

Growing

Recommendation: Learn both, but invest more deeply in one. Use R if you're focused on academic economics; Python if you want broader applicability or ML emphasis.

R Essentials

The tidyverse

The tidyverse is a collection of packages sharing consistent design philosophy:

Regression with fixest

Why fixest?

fixest has become the standard for regression in applied economics research. It handles high-dimensional fixed effects efficiently (crucial for large panel datasets), supports multiway clustering out of the box, and produces publication-ready tables with etable(). For most empirical work, fixest should replace both lm() and older packages like lfe or plm.

fixest is the modern standard for regression in R---fast, flexible, and handles fixed effects and clustering well:

Visualization with ggplot2

Figure 4.1: Example visualization outputs. (a) Scatter plot with regression line and confidence band showing the relationship between education and log wages. (b) Density plot comparing wage distributions by gender. These examples illustrate the clean, publication-ready graphics that ggplot2 (R) and seaborn (Python) produce with minimal code.

Python Essentials

pandas for data manipulation

Regression with statsmodels and linearmodels

Visualization with matplotlib and seaborn

R ggplot2 vs. Python matplotlib/seaborn

Both ecosystems produce publication-quality graphics, but the philosophies differ. ggplot2 uses a "grammar of graphics"—you build plots by adding layers. matplotlib is more imperative—you issue commands to modify a figure object. seaborn adds statistical visualization on top of matplotlib with a cleaner API. For most plots, ggplot2 requires less code; for highly customized figures, matplotlib offers more control.

Data Types: Factors vs. Strings

A common source of confusion (and bugs) is the distinction between categorical data types and strings. Getting this right matters for modeling, efficiency, and correctness.

R: Factors vs. Characters

Python: Category vs. Object (string)

When to Use Which

Data Type
Use When
Example

String/Character

Free-form text, unique identifiers

Names, addresses, IDs

Factor/Category

Repeated categories, regression

State, education level, industry

Ordered Factor

Categories with natural ordering

Education (HS < BA < PhD), ratings

Common Pitfalls

For Regression Models: Always be explicit about categorical variables. In R, use factor(). In Python with statsmodels, use C() in the formula: smf.ols('y ~ C(state)', data=df). In scikit-learn, use OneHotEncoder or pd.get_dummies().

Side-by-Side Comparison

Here's the same analysis in both languages:

Task: Load data, compute returns to education, make a table

R version:

Python version:

Debugging and Troubleshooting

Every programmer spends significant time debugging. Developing good debugging habits early saves countless hours.

Reading error messages

Error messages seem cryptic at first but contain valuable information:

This tells you: the object analysis_data doesn't exist in your environment. Either you haven't run the code that creates it, or you spelled it differently.

This tells you: you're trying to access a column called eduction that doesn't exist—likely a typo for education.

Common errors and solutions:

Error Type
Likely Cause
Solution

Object not found

Didn't run earlier code

Run scripts in order

File not found

Wrong path or filename

Check working directory, spelling

Type error

Wrong data type

Check variable types, convert if needed

NA/NaN in results

Missing data in computation

Handle missing values explicitly

Dimension mismatch

Vectors/matrices wrong size

Check data shapes before operations

Debugging strategies:

  1. Isolate the problem: Comment out code until error disappears; the last uncommented line is the culprit

  2. Print intermediate values: Insert print() statements to see what's happening

  3. Check data at each step: Look at head(), str(), or shape after transformations

  4. Rubber duck debugging: Explain your code line-by-line to an imaginary listener (or rubber duck)

  5. Search the error message: Copy the exact error into Google—someone has encountered it before

R debugging tools:

Python debugging tools:

Defensive Programming: Fail Early, Fail Loud

Debugging finds errors after they occur. Defensive programming prevents them—or catches them immediately when assumptions are violated.

Principle: Code should check its own assumptions. When something unexpected happens, fail immediately with a clear message rather than silently producing wrong results.

Why this matters for empirical research: A silent error in row 50,000 of your data processing can propagate through your entire analysis, producing results that look reasonable but are completely wrong. You might not discover the problem until after publication.

Assertions: Check conditions that must be true

Common assertions for empirical work:

What to Check
Why It Matters

Row counts after merges

Catch unintended duplicates or drops

No missing values in key variables

Prevent silent NA propagation

Values in expected range

Catch data entry errors or coding mistakes

Unique identifiers are unique

Prevent incorrect aggregation

Panel is balanced (if expected)

Catch missing observations

Example: Defensive data pipeline

The 10x Rule: Time spent adding assertions is repaid tenfold in debugging time saved. Every hour of defensive coding prevents ten hours of hunting for mysterious bugs.

Working with APIs and External Data

Modern empirical research often requires pulling data from APIs (Application Programming Interfaces). Here's how to access common economic data sources:

R: Using FRED data

Python: Using pandas-datareader

Census and survey data:


4.3 Version Control with Git

Why Version Control?

Without version control:

  • analysis_final.R

  • analysis_final_v2.R

  • analysis_final_v2_fixed.R

  • analysis_FINAL_actually_final.R

With version control:

  • One file: analysis.R

  • Complete history of all changes

  • Can revert to any previous version

  • Know exactly what changed when and why

Git Fundamentals

Key concepts:

  • Repository (repo): A project tracked by Git

  • Commit: A snapshot of your project at a point in time

  • Branch: A parallel version of your project

  • Remote: A copy of your repo on a server (GitHub, GitLab)

GUI Alternatives to Command-Line Git

While this chapter teaches command-line Git (which is essential to understand), many researchers prefer visual interfaces for day-to-day work:

  • RStudio: Built-in Git pane shows changes, supports staging, committing, pushing, and pulling. Ideal for R users.

  • VS Code: Source Control panel with visual diffs, staging, and commit interface. Works for any language.

  • GitHub Desktop: Standalone app that simplifies cloning, branching, and syncing with GitHub.

  • GitKraken, Sourcetree: Full-featured Git GUIs with visualizations of branch history.

Recommendation: Learn the commands first (to understand what's happening), then use whichever interface fits your workflow. RStudio's Git integration is particularly well-suited for typical empirical research projects.

Basic workflow:

The .gitignore file:

Critical: What NOT to Commit

Three categories of files should never enter version control: (1) Sensitive data—credentials, API keys, IRB-protected data. These can be accidentally leaked even after deletion. (2) Large binary files—raw datasets, images, PDFs. Git stores every version, so repositories bloat quickly. Use Git LFS or external storage instead. (3) Generated outputs—anything your code can recreate. Commit the code that makes the figure, not the figure itself.

Some files shouldn't be tracked---large data, sensitive information, generated outputs:

Collaboration Workflow

Working with branches:

Working with remotes (GitHub):

Handling conflicts:

When two people edit the same lines:

Best Practices

  1. Commit often: Small, focused commits are easier to understand and revert

  2. Write good messages: "Fix bug in standard error calculation" not "updates"

  3. Don't commit sensitive data: Use .gitignore

  4. Don't commit large files: Use Git LFS or don't track data

  5. Pull before you push: Get collaborators' changes first


4.4 Data Management

The Cardinal Rule

Principle 4.1: Raw Data Immutability Never modify raw data files. Ever. All transformations happen in code, creating new files.

Why this matters:

  • You can always recreate processed data from raw + code

  • Errors in cleaning can be identified and fixed

  • The transformation process is documented

  • Replication is possible

Data Cleaning Workflow

Codebooks and Documentation

Every dataset needs documentation:

File Formats

Recommended formats:

Use Case
Format
Why

Working data

Parquet

Fast, compressed, preserves types

Sharing with others

CSV

Universal compatibility

Large datasets

Parquet or Feather

Much faster than CSV

Final archive

CSV + codebook

Long-term readability

Reading and writing in R:

Reading and writing in Python:

Handling Sensitive Data

Some data requires special handling:

Personally identifiable information (PII):

  • Names, addresses, SSNs, etc.

  • Store separately from analysis data

  • Never commit to Git

  • Consider encryption

Confidential data:

  • Data under restricted access agreements

  • Follow data provider's requirements

  • Document what can/cannot be shared

  • Create synthetic or redacted versions for replication


Practical Guidance

Setting Up a New Project

Step-by-step checklist:

Common Pitfalls

Pitfall 1: "I'll Clean Up Later" Starting with messy organization, planning to fix it later. You won't.

How to avoid: Start organized. The 10 minutes of setup saves hours later.

Pitfall 2: Hardcoded Paths Code like read_csv("C:/Users/John/Documents/research/data.csv") breaks on any other machine.

How to avoid: Use relative paths. Set working directory to project root.

Pitfall 3: Not Tracking Package Versions Code that runs today may not run in 6 months with updated packages.

How to avoid: Use renv (R) or requirements.txt (Python). Snapshot your environment.

Pitfall 4: Overwriting Raw Data Accidentally saving cleaned data over raw data.

How to avoid: Use different directories. Consider making raw data read-only.

Implementation Checklist

Every project should have:


Summary

Key takeaways:

  1. Reproducibility requires organization from the start: clear folder structures, numbered scripts, preserved raw data, and documented decisions.

  2. Both R and Python are capable tools for empirical research. R excels in econometrics packages and visualization; Python in general programming and machine learning. Learn both, master one.

  3. Version control with Git tracks changes, enables collaboration, and provides a safety net. Commit often, write clear messages, and use .gitignore appropriately.

Returning to the opening question: A reproducible computational environment combines organized project structure, version control, environment management, and careful data handling. These practices require initial investment but pay dividends in reliability, collaboration, and your own peace of mind when returning to a project months later.


Further Reading

Essential

  • Gentzkow, M. and J. Shapiro (2014). "Code and Data for the Social Sciences: A Practitioner's Guide." [Online]

  • Wilson, G. et al. (2017). "Good Enough Practices in Scientific Computing." PLOS Computational Biology.

For Deeper Understanding

  • Wickham, H. and G. Grolemund (2017). "R for Data Science." O'Reilly. [Free online]

  • McKinney, W. (2022). "Python for Data Analysis." 3rd ed. O'Reilly.

Advanced/Specialized

  • Bryan, J. "Happy Git with R." [Online resource for Git + R]

  • The Turing Way Community. "The Turing Way: A Handbook for Reproducible Data Science."


Exercises

Conceptual

  1. Explain why raw data should never be modified directly. What problems can arise from editing raw data files?

  2. A colleague's project has files named analysis_v1.R, analysis_v2_final.R, and analysis_v2_final_ACTUAL.R. What would you recommend they do differently?

Applied

  1. Set up a new research project following the principles in this chapter:

    • Create the folder structure

    • Initialize Git

    • Create a .gitignore

    • Write a README

    • Add a sample data file and cleaning script

  2. Take an existing messy project (your own or a provided example) and reorganize it following best practices. Document what changes you made and why.

Discussion

  1. Some researchers argue that requiring code and data availability imposes unfair burdens. Others argue it's essential for science. What's your view, and how would you balance these concerns?

Last updated