Chapter 4: Programming Companion—Foundations
Opening Question
How do we set up a computational environment that makes empirical research reproducible, collaborative, and efficient?
Chapter Overview
This chapter introduces the computational foundations for empirical research. Modern empirical work requires not just statistical knowledge but also practical skills in organizing projects, writing code, managing data, and collaborating with others. These skills are rarely taught formally but make the difference between research that can be reproduced, extended, and trusted, and research that exists only on one person's laptop.
We focus on two languages---R and Python---because they dominate modern empirical social science. Both are free, open-source, and have rich ecosystems for data analysis. Rather than advocate for one, we show how to use each effectively and when each has advantages.
What you will learn:
How to organize research projects for reproducibility
Core R and Python skills for data analysis
Version control with Git for tracking changes and collaboration
Data management principles that prevent common disasters
Prerequisites: Basic familiarity with any programming language is helpful but not required
4.1 Reproducible Workflow
Why Reproducibility Matters
A reproducible project is one where another researcher---or your future self---can understand what was done and verify the results. This is both a scientific ideal and a practical necessity.
Scientific reasons:
Verification: Others can check your work
Extension: Others can build on your work
Credibility: Transparent work is more trustworthy
Practical reasons:
Your future self will forget what you did
Collaborators need to understand your work
Journals increasingly require replication materials
Errors are easier to find and fix
Project Organization
A well-organized project follows predictable conventions:
Key principles:
Numbered scripts: Files run in order.
01_clean.Rruns before02_analyze.R. This makes dependencies explicit.Separation of raw and processed data: Raw data is sacred---never modify it. All transformations happen in code.
Clear inputs and outputs: Each script should have well-defined inputs (what it reads) and outputs (what it creates).
Self-documenting structure: A new collaborator should understand the project from the folder names alone.
The README File
Every project needs a README.md explaining:
Authors
Name (email)
Python: requirements.txt or conda
4.2 R and Python for Empirical Research
Choosing Between R and Python
Both languages can do everything you need. The choice often depends on context:
Econometrics packages
Excellent (fixest, plm, rdrobust)
Good (statsmodels, linearmodels)
Data manipulation
Excellent (tidyverse)
Excellent (pandas)
Visualization
Excellent (ggplot2)
Good (matplotlib, seaborn)
Machine learning
Good (tidymodels, caret)
Excellent (scikit-learn)
General programming
Adequate
Excellent
Industry jobs
Less common
Very common
Academic economics
Dominant
Growing
Recommendation: Learn both, but invest more deeply in one. Use R if you're focused on academic economics; Python if you want broader applicability or ML emphasis.
R Essentials
The tidyverse
The tidyverse is a collection of packages sharing consistent design philosophy:
Regression with fixest
Why fixest?
fixest has become the standard for regression in applied economics research. It handles high-dimensional fixed effects efficiently (crucial for large panel datasets), supports multiway clustering out of the box, and produces publication-ready tables with
etable(). For most empirical work, fixest should replace bothlm()and older packages likelfeorplm.
fixest is the modern standard for regression in R---fast, flexible, and handles fixed effects and clustering well:
Visualization with ggplot2

Python Essentials
pandas for data manipulation
Regression with statsmodels and linearmodels
Visualization with matplotlib and seaborn
R ggplot2 vs. Python matplotlib/seaborn
Both ecosystems produce publication-quality graphics, but the philosophies differ. ggplot2 uses a "grammar of graphics"—you build plots by adding layers. matplotlib is more imperative—you issue commands to modify a figure object. seaborn adds statistical visualization on top of matplotlib with a cleaner API. For most plots, ggplot2 requires less code; for highly customized figures, matplotlib offers more control.
Data Types: Factors vs. Strings
A common source of confusion (and bugs) is the distinction between categorical data types and strings. Getting this right matters for modeling, efficiency, and correctness.
R: Factors vs. Characters
Python: Category vs. Object (string)
When to Use Which
String/Character
Free-form text, unique identifiers
Names, addresses, IDs
Factor/Category
Repeated categories, regression
State, education level, industry
Ordered Factor
Categories with natural ordering
Education (HS < BA < PhD), ratings
Common Pitfalls
For Regression Models: Always be explicit about categorical variables. In R, use
factor(). In Python with statsmodels, useC()in the formula:smf.ols('y ~ C(state)', data=df). In scikit-learn, useOneHotEncoderorpd.get_dummies().
Side-by-Side Comparison
Here's the same analysis in both languages:
Task: Load data, compute returns to education, make a table
R version:
Python version:
Debugging and Troubleshooting
Every programmer spends significant time debugging. Developing good debugging habits early saves countless hours.
Reading error messages
Error messages seem cryptic at first but contain valuable information:
This tells you: the object analysis_data doesn't exist in your environment. Either you haven't run the code that creates it, or you spelled it differently.
This tells you: you're trying to access a column called eduction that doesn't exist—likely a typo for education.
Common errors and solutions:
Object not found
Didn't run earlier code
Run scripts in order
File not found
Wrong path or filename
Check working directory, spelling
Type error
Wrong data type
Check variable types, convert if needed
NA/NaN in results
Missing data in computation
Handle missing values explicitly
Dimension mismatch
Vectors/matrices wrong size
Check data shapes before operations
Debugging strategies:
Isolate the problem: Comment out code until error disappears; the last uncommented line is the culprit
Print intermediate values: Insert
print()statements to see what's happeningCheck data at each step: Look at
head(),str(), orshapeafter transformationsRubber duck debugging: Explain your code line-by-line to an imaginary listener (or rubber duck)
Search the error message: Copy the exact error into Google—someone has encountered it before
R debugging tools:
Python debugging tools:
Defensive Programming: Fail Early, Fail Loud
Debugging finds errors after they occur. Defensive programming prevents them—or catches them immediately when assumptions are violated.
Principle: Code should check its own assumptions. When something unexpected happens, fail immediately with a clear message rather than silently producing wrong results.
Why this matters for empirical research: A silent error in row 50,000 of your data processing can propagate through your entire analysis, producing results that look reasonable but are completely wrong. You might not discover the problem until after publication.
Assertions: Check conditions that must be true
Common assertions for empirical work:
Row counts after merges
Catch unintended duplicates or drops
No missing values in key variables
Prevent silent NA propagation
Values in expected range
Catch data entry errors or coding mistakes
Unique identifiers are unique
Prevent incorrect aggregation
Panel is balanced (if expected)
Catch missing observations
Example: Defensive data pipeline
The 10x Rule: Time spent adding assertions is repaid tenfold in debugging time saved. Every hour of defensive coding prevents ten hours of hunting for mysterious bugs.
Working with APIs and External Data
Modern empirical research often requires pulling data from APIs (Application Programming Interfaces). Here's how to access common economic data sources:
R: Using FRED data
Python: Using pandas-datareader
Census and survey data:
4.3 Version Control with Git
Why Version Control?
Without version control:
analysis_final.Ranalysis_final_v2.Ranalysis_final_v2_fixed.Ranalysis_FINAL_actually_final.R
With version control:
One file:
analysis.RComplete history of all changes
Can revert to any previous version
Know exactly what changed when and why
Git Fundamentals
Key concepts:
Repository (repo): A project tracked by Git
Commit: A snapshot of your project at a point in time
Branch: A parallel version of your project
Remote: A copy of your repo on a server (GitHub, GitLab)
GUI Alternatives to Command-Line Git
While this chapter teaches command-line Git (which is essential to understand), many researchers prefer visual interfaces for day-to-day work:
RStudio: Built-in Git pane shows changes, supports staging, committing, pushing, and pulling. Ideal for R users.
VS Code: Source Control panel with visual diffs, staging, and commit interface. Works for any language.
GitHub Desktop: Standalone app that simplifies cloning, branching, and syncing with GitHub.
GitKraken, Sourcetree: Full-featured Git GUIs with visualizations of branch history.
Recommendation: Learn the commands first (to understand what's happening), then use whichever interface fits your workflow. RStudio's Git integration is particularly well-suited for typical empirical research projects.
Basic workflow:
The .gitignore file:
Critical: What NOT to Commit
Three categories of files should never enter version control: (1) Sensitive data—credentials, API keys, IRB-protected data. These can be accidentally leaked even after deletion. (2) Large binary files—raw datasets, images, PDFs. Git stores every version, so repositories bloat quickly. Use Git LFS or external storage instead. (3) Generated outputs—anything your code can recreate. Commit the code that makes the figure, not the figure itself.
Some files shouldn't be tracked---large data, sensitive information, generated outputs:
Collaboration Workflow
Working with branches:
Working with remotes (GitHub):
Handling conflicts:
When two people edit the same lines:
Best Practices
Commit often: Small, focused commits are easier to understand and revert
Write good messages: "Fix bug in standard error calculation" not "updates"
Don't commit sensitive data: Use .gitignore
Don't commit large files: Use Git LFS or don't track data
Pull before you push: Get collaborators' changes first
4.4 Data Management
The Cardinal Rule
Principle 4.1: Raw Data Immutability Never modify raw data files. Ever. All transformations happen in code, creating new files.
Why this matters:
You can always recreate processed data from raw + code
Errors in cleaning can be identified and fixed
The transformation process is documented
Replication is possible
Data Cleaning Workflow
Codebooks and Documentation
Every dataset needs documentation:
File Formats
Recommended formats:
Working data
Parquet
Fast, compressed, preserves types
Sharing with others
CSV
Universal compatibility
Large datasets
Parquet or Feather
Much faster than CSV
Final archive
CSV + codebook
Long-term readability
Reading and writing in R:
Reading and writing in Python:
Handling Sensitive Data
Some data requires special handling:
Personally identifiable information (PII):
Names, addresses, SSNs, etc.
Store separately from analysis data
Never commit to Git
Consider encryption
Confidential data:
Data under restricted access agreements
Follow data provider's requirements
Document what can/cannot be shared
Create synthetic or redacted versions for replication
Practical Guidance
Setting Up a New Project
Step-by-step checklist:
Common Pitfalls
Pitfall 1: "I'll Clean Up Later" Starting with messy organization, planning to fix it later. You won't.
How to avoid: Start organized. The 10 minutes of setup saves hours later.
Pitfall 2: Hardcoded Paths Code like
read_csv("C:/Users/John/Documents/research/data.csv")breaks on any other machine.How to avoid: Use relative paths. Set working directory to project root.
Pitfall 3: Not Tracking Package Versions Code that runs today may not run in 6 months with updated packages.
How to avoid: Use renv (R) or requirements.txt (Python). Snapshot your environment.
Pitfall 4: Overwriting Raw Data Accidentally saving cleaned data over raw data.
How to avoid: Use different directories. Consider making raw data read-only.
Implementation Checklist
Every project should have:
Summary
Key takeaways:
Reproducibility requires organization from the start: clear folder structures, numbered scripts, preserved raw data, and documented decisions.
Both R and Python are capable tools for empirical research. R excels in econometrics packages and visualization; Python in general programming and machine learning. Learn both, master one.
Version control with Git tracks changes, enables collaboration, and provides a safety net. Commit often, write clear messages, and use .gitignore appropriately.
Returning to the opening question: A reproducible computational environment combines organized project structure, version control, environment management, and careful data handling. These practices require initial investment but pay dividends in reliability, collaboration, and your own peace of mind when returning to a project months later.
Further Reading
Essential
Gentzkow, M. and J. Shapiro (2014). "Code and Data for the Social Sciences: A Practitioner's Guide." [Online]
Wilson, G. et al. (2017). "Good Enough Practices in Scientific Computing." PLOS Computational Biology.
For Deeper Understanding
Wickham, H. and G. Grolemund (2017). "R for Data Science." O'Reilly. [Free online]
McKinney, W. (2022). "Python for Data Analysis." 3rd ed. O'Reilly.
Advanced/Specialized
Bryan, J. "Happy Git with R." [Online resource for Git + R]
The Turing Way Community. "The Turing Way: A Handbook for Reproducible Data Science."
Exercises
Conceptual
Explain why raw data should never be modified directly. What problems can arise from editing raw data files?
A colleague's project has files named
analysis_v1.R,analysis_v2_final.R, andanalysis_v2_final_ACTUAL.R. What would you recommend they do differently?
Applied
Set up a new research project following the principles in this chapter:
Create the folder structure
Initialize Git
Create a .gitignore
Write a README
Add a sample data file and cleaning script
Take an existing messy project (your own or a provided example) and reorganize it following best practices. Document what changes you made and why.
Discussion
Some researchers argue that requiring code and data availability imposes unfair burdens. Others argue it's essential for science. What's your view, and how would you balance these concerns?
Last updated