Chapter 4: Programming Companion—Foundations

Opening Question

How do we set up a computational environment that makes empirical research reproducible, collaborative, and efficient?

Chapter Overview

This chapter introduces the computational foundations for empirical research. Modern empirical work requires not just statistical knowledge but also practical skills in organizing projects, writing code, managing data, and collaborating with others. These skills are rarely taught formally but make the difference between research that can be reproduced, extended, and trusted, and research that exists only on one person's laptop.

We focus on two languages---R and Python---because they dominate modern empirical social science. Both are free, open-source, and have rich ecosystems for data analysis. Rather than advocate for one, we show how to use each effectively and when each has advantages.

What you will learn:

How to organize research projects for reproducibility
Core R and Python skills for data analysis
Version control with Git for tracking changes and collaboration
Data management principles that prevent common disasters

Prerequisites: Basic familiarity with any programming language is helpful but not required

4.1 Reproducible Workflow

Why Reproducibility Matters

A reproducible project is one where another researcher---or your future self---can understand what was done and verify the results. This is both a scientific ideal and a practical necessity.

Scientific reasons:

Verification: Others can check your work
Extension: Others can build on your work
Credibility: Transparent work is more trustworthy

Practical reasons:

Your future self will forget what you did
Collaborators need to understand your work
Journals increasingly require replication materials
Errors are easier to find and fix

Project Organization

A well-organized project follows predictable conventions:

project/
├── README.md              # What is this project? How do I run it?
├── data/
│   ├── raw/              # Original data - NEVER modify
│   │   ├── census_2020.csv
│   │   └── survey_responses.xlsx
│   ├── processed/        # Cleaned, transformed data
│   │   └── analysis_sample.parquet
│   └── codebooks/        # Variable definitions
│       └── census_codebook.pdf
├── code/
│   ├── 01_clean_data.R   # Data cleaning
│   ├── 02_merge_data.R   # Combine sources
│   ├── 03_analysis.R     # Main analysis
│   └── 04_figures.R      # Tables and figures
├── output/
│   ├── tables/           # Generated tables
│   └── figures/          # Generated figures
├── docs/                 # Notes, documentation
└── paper/               # Manuscript files

Key principles:

Numbered scripts: Files run in order. 01_clean.R runs before 02_analyze.R. This makes dependencies explicit.
Separation of raw and processed data: Raw data is sacred---never modify it. All transformations happen in code.
Clear inputs and outputs: Each script should have well-defined inputs (what it reads) and outputs (what it creates).
Self-documenting structure: A new collaborator should understand the project from the folder names alone.

The README File

Every project needs a README.md explaining:

# Project Title

## Overview
Brief description of the research question and approach.

## Data
- `data/raw/census_2020.csv`: US Census data from [source]
- `data/raw/survey_responses.xlsx`: Original survey collected [date]

## Code
Run scripts in numerical order:
1. `01_clean_data.R`: Cleans raw data, outputs `data/processed/clean.parquet`
2. `02_analysis.R`: Main regression analysis
3. `03_figures.R`: Generates all figures for paper

## Requirements
- R 4.3+ with packages: tidyverse, fixest, modelsummary
- Or Python 3.10+ with packages: pandas, statsmodels, matplotlib

## Replication
To replicate all results:
```bash
Rscript code/01_clean_data.R
Rscript code/02_analysis.R
Rscript code/03_figures.R

Authors

Name (email)


### Environment Management

Code that runs on your machine may not run on others' due to different package versions. Environment management solves this.

**R: renv**
```r
# Initialize renv in project
renv::init()

# Install packages as usual
install.packages("tidyverse")

# Snapshot current state
renv::snapshot()  # Creates renv.lock

# Collaborator restores your environment
renv::restore()

Python: requirements.txt or conda

# Create requirements file
pip freeze > requirements.txt

# Collaborator installs same versions
pip install -r requirements.txt

# Or use conda environment.yml
name: myproject
dependencies:
  - python=3.10
  - pandas=2.0
  - statsmodels=0.14
  - matplotlib=3.7

4.2 R and Python for Empirical Research

Choosing Between R and Python

Both languages can do everything you need. The choice often depends on context:

Consideration

Python

Econometrics packages

Excellent (fixest, plm, rdrobust)

Good (statsmodels, linearmodels)

Data manipulation

Excellent (tidyverse)

Excellent (pandas)

Visualization

Excellent (ggplot2)

Good (matplotlib, seaborn)

Machine learning

Good (tidymodels, caret)

Excellent (scikit-learn)

General programming

Adequate

Excellent

Industry jobs

Less common

Very common

Academic economics

Dominant

Growing

Recommendation: Learn both, but invest more deeply in one. Use R if you're focused on academic economics; Python if you want broader applicability or ML emphasis.

R Essentials

The tidyverse

The tidyverse is a collection of packages sharing consistent design philosophy:

library(tidyverse)

# Read data
df <- read_csv("data/raw/wages.csv")

# Data manipulation pipeline
analysis_data <- df %>%
  filter(age >= 25, age <= 65) %>%          # Keep working age
  mutate(
    log_wage = log(wage),                    # Create variables
    experience = age - education - 6,
    exp_sq = experience^2
  ) %>%
  select(id, log_wage, education, experience, exp_sq, female) %>%
  drop_na()                                  # Remove missing

# Summary statistics
analysis_data %>%
  group_by(female) %>%
  summarize(
    mean_wage = mean(exp(log_wage)),
    mean_educ = mean(education),
    n = n()
  )

Regression with fixest

Why fixest?
fixest has become the standard for regression in applied economics research. It handles high-dimensional fixed effects efficiently (crucial for large panel datasets), supports multiway clustering out of the box, and produces publication-ready tables with etable(). For most empirical work, fixest should replace both lm() and older packages like lfe or plm.

fixest is the modern standard for regression in R---fast, flexible, and handles fixed effects and clustering well:

library(fixest)

# Basic OLS
model1 <- feols(log_wage ~ education + experience + exp_sq,
                data = analysis_data)

# With fixed effects
model2 <- feols(log_wage ~ education + experience + exp_sq | industry + year,
                data = analysis_data)

# Clustered standard errors
model3 <- feols(log_wage ~ education + experience + exp_sq | industry + year,
                data = analysis_data,
                cluster = ~state)

# Multiple outcomes at once
models <- feols(c(log_wage, hours) ~ education + experience | industry,
                data = analysis_data)

# Display results
etable(model1, model2, model3)

Visualization with ggplot2

library(ggplot2)

# Scatter with regression line
ggplot(analysis_data, aes(x = education, y = log_wage)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "darkblue") +
  labs(
    x = "Years of Education",
    y = "Log Hourly Wage",
    title = "Education and Wages"
  ) +
  theme_minimal()

# Distribution by group
ggplot(analysis_data, aes(x = log_wage, fill = factor(female))) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(
    values = c("0" = "steelblue", "1" = "coral"),
    labels = c("Male", "Female")
  ) +
  labs(x = "Log Wage", fill = "Gender") +
  theme_minimal()

Python Essentials

pandas for data manipulation

import pandas as pd
import numpy as np

# Read data
df = pd.read_csv("data/raw/wages.csv")

# Data manipulation (method chaining)
analysis_data = (df
    .query("age >= 25 and age <= 65")
    .assign(
        log_wage = lambda x: np.log(x['wage']),
        experience = lambda x: x['age'] - x['education'] - 6,
        exp_sq = lambda x: (x['age'] - x['education'] - 6)**2
    )
    [['id', 'log_wage', 'education', 'experience', 'exp_sq', 'female']]
    .dropna()
)

# Summary statistics
analysis_data.groupby('female').agg(
    mean_wage = ('log_wage', lambda x: np.exp(x).mean()),
    mean_educ = ('education', 'mean'),
    n = ('id', 'count')
)

Regression with statsmodels and linearmodels

import statsmodels.formula.api as smf
from linearmodels.panel import PanelOLS

# Basic OLS
model1 = smf.ols('log_wage ~ education + experience + exp_sq',
                 data=analysis_data).fit()

# Robust standard errors
model1_robust = smf.ols('log_wage ~ education + experience + exp_sq',
                        data=analysis_data).fit(cov_type='HC1')

# Fixed effects with linearmodels
analysis_data = analysis_data.set_index(['id', 'year'])
model_fe = PanelOLS.from_formula(
    'log_wage ~ education + experience + EntityEffects',
    data=analysis_data
).fit(cov_type='clustered', cluster_entity=True)

print(model1.summary())

Visualization with matplotlib and seaborn

R ggplot2 vs. Python matplotlib/seaborn
Both ecosystems produce publication-quality graphics, but the philosophies differ. ggplot2 uses a "grammar of graphics"—you build plots by adding layers. matplotlib is more imperative—you issue commands to modify a figure object. seaborn adds statistical visualization on top of matplotlib with a cleaner API. For most plots, ggplot2 requires less code; for highly customized figures, matplotlib offers more control.

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Scatter with regression
sns.regplot(data=analysis_data, x='education', y='log_wage',
            ax=axes[0], scatter_kws={'alpha': 0.3})
axes[0].set_xlabel('Years of Education')
axes[0].set_ylabel('Log Hourly Wage')
axes[0].set_title('Education and Wages')

# Density by group
for gender, color in [(0, 'steelblue'), (1, 'coral')]:
    subset = analysis_data[analysis_data['female'] == gender]
    sns.kdeplot(data=subset, x='log_wage', ax=axes[1],
                fill=True, alpha=0.5, color=color,
                label='Male' if gender == 0 else 'Female')
axes[1].set_xlabel('Log Wage')
axes[1].legend()

plt.tight_layout()
plt.savefig('output/figures/wage_analysis.pdf')

Data Types: Factors vs. Strings

A common source of confusion (and bugs) is the distinction between categorical data types and strings. Getting this right matters for modeling, efficiency, and correctness.

R: Factors vs. Characters

# Character: just text
state_char <- c("NY", "CA", "NY", "TX")
class(state_char)  # "character"

# Factor: categorical variable with defined levels
state_factor <- factor(c("NY", "CA", "NY", "TX"))
class(state_factor)  # "factor"
levels(state_factor)  # "CA" "NY" "TX"

# Why it matters for regression:
lm(y ~ state_char, data = df)   # Works, but converts implicitly
lm(y ~ state_factor, data = df)  # Explicit levels, predictable behavior

# Ordering matters for some analyses
education <- factor(c("HS", "BA", "PhD", "BA"),
                   levels = c("HS", "BA", "PhD"),
                   ordered = TRUE)

Python: Category vs. Object (string)

import pandas as pd

# Object dtype: strings (memory-inefficient for categories)
df['state'] = ['NY', 'CA', 'NY', 'TX']
df['state'].dtype  # 'object'

# Category dtype: efficient categorical
df['state'] = pd.Categorical(df['state'])
df['state'].dtype  # 'category'

# Or convert existing column
df['state'] = df['state'].astype('category')

# Memory savings can be dramatic with repeated values
# For a column with 1M rows and 50 unique values:
# Object: ~50MB, Category: ~1MB

When to Use Which

Data Type

Use When

Example

String/Character

Free-form text, unique identifiers

Names, addresses, IDs

Factor/Category

Repeated categories, regression

State, education level, industry

Ordered Factor

Categories with natural ordering

Education (HS < BA < PhD), ratings

Common Pitfalls

# WRONG: Factor levels get converted to integers silently
x <- factor(c("10", "20", "30"))
as.numeric(x)  # Returns 1, 2, 3 (the level indices!)

# RIGHT: Convert to character first, then numeric
as.numeric(as.character(x))  # Returns 10, 20, 30

# WRONG: String operations don't work on categories
df['state'].str.lower()  # May fail or give unexpected results

# RIGHT: Convert to string for string operations
df['state'].astype(str).str.lower()

For Regression Models: Always be explicit about categorical variables. In R, use factor(). In Python with statsmodels, use C() in the formula: smf.ols('y ~ C(state)', data=df). In scikit-learn, use OneHotEncoder or pd.get_dummies().

Side-by-Side Comparison

Here's the same analysis in both languages:

Task: Load data, compute returns to education, make a table

R version:

library(tidyverse)
library(fixest)
library(modelsummary)

# Load and prepare
df <- read_csv("data/raw/nlsy.csv") %>%
  mutate(log_wage = log(wage),
         exp = age - educ - 6,
         exp2 = exp^2)

# Estimate models
m1 <- feols(log_wage ~ educ, data = df)
m2 <- feols(log_wage ~ educ + exp + exp2, data = df)
m3 <- feols(log_wage ~ educ + exp + exp2 | industry, data = df)

# Create table
modelsummary(list("(1)" = m1, "(2)" = m2, "(3)" = m3),
             stars = TRUE,
             gof_omit = "AIC|BIC|Log",
             output = "output/tables/returns_to_education.tex")

Python version:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from stargazer.stargazer import Stargazer

# Load and prepare
df = pd.read_csv("data/raw/nlsy.csv")
df['log_wage'] = np.log(df['wage'])
df['exp'] = df['age'] - df['educ'] - 6
df['exp2'] = df['exp']**2

# Estimate models
m1 = smf.ols('log_wage ~ educ', data=df).fit()
m2 = smf.ols('log_wage ~ educ + exp + exp2', data=df).fit()
m3 = smf.ols('log_wage ~ educ + exp + exp2 + C(industry)', data=df).fit()

# Create table
stargazer = Stargazer([m1, m2, m3])
stargazer.render_latex()  # Or render_html()

Debugging and Troubleshooting

Every programmer spends significant time debugging. Developing good debugging habits early saves countless hours.

Reading error messages

Error messages seem cryptic at first but contain valuable information:

# R error example
Error in lm(wage ~ education, data = analysis_data) :
  object 'analysis_data' not found

This tells you: the object analysis_data doesn't exist in your environment. Either you haven't run the code that creates it, or you spelled it differently.

# Python error example
KeyError: 'eduction'

This tells you: you're trying to access a column called eduction that doesn't exist—likely a typo for education.

Common errors and solutions:

Error Type

Likely Cause

Solution

Object not found

Didn't run earlier code

Run scripts in order

File not found

Wrong path or filename

Check working directory, spelling

Type error

Wrong data type

Check variable types, convert if needed

NA/NaN in results

Missing data in computation

Handle missing values explicitly

Dimension mismatch

Vectors/matrices wrong size

Check data shapes before operations

Debugging strategies:

Isolate the problem: Comment out code until error disappears; the last uncommented line is the culprit
Print intermediate values: Insert print() statements to see what's happening
Check data at each step: Look at head(), str(), or shape after transformations
Rubber duck debugging: Explain your code line-by-line to an imaginary listener (or rubber duck)
Search the error message: Copy the exact error into Google—someone has encountered it before

R debugging tools:

# View data interactively
View(df)

# Check structure
str(df)
glimpse(df)

# Debug a function
debug(my_function)  # Step through line by line
browser()           # Insert breakpoint in code
traceback()         # See call stack after error

Python debugging tools:

# View data
df.head()
df.info()
df.describe()

# Debug interactively
import pdb; pdb.set_trace()  # Insert breakpoint

# In Jupyter notebooks
%debug  # Post-mortem debugging after error

Defensive Programming: Fail Early, Fail Loud

Debugging finds errors after they occur. Defensive programming prevents them—or catches them immediately when assumptions are violated.

Principle: Code should check its own assumptions. When something unexpected happens, fail immediately with a clear message rather than silently producing wrong results.

Why this matters for empirical research: A silent error in row 50,000 of your data processing can propagate through your entire analysis, producing results that look reasonable but are completely wrong. You might not discover the problem until after publication.

Assertions: Check conditions that must be true

# R: stopifnot() and assertthat
stopifnot(nrow(df) > 0)  # Fail if dataframe is empty
stopifnot(all(df$age >= 0))  # Fail if negative ages

library(assertthat)
assert_that(is.numeric(df$income), msg = "Income must be numeric")
assert_that(n_distinct(df$state) == 50, msg = "Expected 50 states")

# Python: assert statements
assert len(df) > 0, "Dataframe is empty"
assert (df['age'] >= 0).all(), "Found negative ages"
assert df['state'].nunique() == 50, "Expected 50 states"

# More detailed checks with pandas
import pandas as pd
pd.testing.assert_frame_equal(df1, df2)  # Compare dataframes

Common assertions for empirical work:

What to Check

Why It Matters

Row counts after merges

Catch unintended duplicates or drops

No missing values in key variables

Prevent silent NA propagation

Values in expected range

Catch data entry errors or coding mistakes

Unique identifiers are unique

Prevent incorrect aggregation

Panel is balanced (if expected)

Catch missing observations

Example: Defensive data pipeline

# After loading
stopifnot(nrow(raw_data) == 10000)  # Expected size
stopifnot(!any(is.na(raw_data$id)))  # No missing IDs

# After merge
merged <- left_join(df1, df2, by = "id")
stopifnot(nrow(merged) == nrow(df1))  # No unexpected expansion

# Before analysis
stopifnot(all(analysis_data$wage > 0))  # Positive wages
stopifnot(mean(analysis_data$treatment) > 0.1)  # Enough treated

The 10x Rule: Time spent adding assertions is repaid tenfold in debugging time saved. Every hour of defensive coding prevents ten hours of hunting for mysterious bugs.

Working with APIs and External Data

Modern empirical research often requires pulling data from APIs (Application Programming Interfaces). Here's how to access common economic data sources:

R: Using FRED data

library(fredr)
fredr_set_key("your_api_key")  # Get free key from FRED website

# Download unemployment rate
unemployment <- fredr(
  series_id = "UNRATE",
  observation_start = as.Date("2000-01-01")
)

# Download multiple series
gdp <- fredr(series_id = "GDPC1")
inflation <- fredr(series_id = "CPIAUCSL")

Python: Using pandas-datareader

import pandas_datareader as pdr
from datetime import datetime

# Download from FRED
unemployment = pdr.get_data_fred('UNRATE', start='2000-01-01')
gdp = pdr.get_data_fred('GDPC1', start='2000-01-01')

# Download from World Bank
from pandas_datareader import wb
gdp_per_capita = wb.download(
    indicator='NY.GDP.PCAP.CD',
    country=['US', 'GB', 'DE'],
    start=2000, end=2023
)

Census and survey data:

# R: tidycensus for American Community Survey
library(tidycensus)
census_api_key("your_key")

acs_data <- get_acs(
  geography = "state",
  variables = c(median_income = "B19013_001"),
  year = 2022
)

# Python: census package
from census import Census
c = Census("your_key")

# Get state-level data
states = c.acs5.state(
    ('NAME', 'B19013_001E'),  # Median household income
    '*'
)

4.3 Version Control with Git

Why Version Control?

Without version control:

analysis_final.R
analysis_final_v2.R
analysis_final_v2_fixed.R
analysis_FINAL_actually_final.R

With version control:

One file: analysis.R
Complete history of all changes
Can revert to any previous version
Know exactly what changed when and why

Git Fundamentals

Key concepts:

Repository (repo): A project tracked by Git
Commit: A snapshot of your project at a point in time
Branch: A parallel version of your project
Remote: A copy of your repo on a server (GitHub, GitLab)

GUI Alternatives to Command-Line Git
While this chapter teaches command-line Git (which is essential to understand), many researchers prefer visual interfaces for day-to-day work:
RStudio: Built-in Git pane shows changes, supports staging, committing, pushing, and pulling. Ideal for R users.
VS Code: Source Control panel with visual diffs, staging, and commit interface. Works for any language.
GitHub Desktop: Standalone app that simplifies cloning, branching, and syncing with GitHub.
GitKraken, Sourcetree: Full-featured Git GUIs with visualizations of branch history.
Recommendation: Learn the commands first (to understand what's happening), then use whichever interface fits your workflow. RStudio's Git integration is particularly well-suited for typical empirical research projects.

Basic workflow:

# Initialize a new repo
git init

# Check status (what's changed?)
git status

# Stage changes (prepare for commit)
git add analysis.R
git add data/processed/clean.parquet
# Or add everything:
git add .

# Commit (save snapshot)
git commit -m "Add regression analysis with fixed effects"

# View history
git log --oneline

# Push to remote (share with world/collaborators)
git push origin main

The .gitignore file:

Critical: What NOT to Commit
Three categories of files should never enter version control: (1) Sensitive data—credentials, API keys, IRB-protected data. These can be accidentally leaked even after deletion. (2) Large binary files—raw datasets, images, PDFs. Git stores every version, so repositories bloat quickly. Use Git LFS or external storage instead. (3) Generated outputs—anything your code can recreate. Commit the code that makes the figure, not the figure itself.

Some files shouldn't be tracked---large data, sensitive information, generated outputs:

# Data files (too large, possibly sensitive)
data/raw/*.csv
data/raw/*.xlsx
*.parquet

# Output (can be regenerated)
output/

# Sensitive
.env
credentials.json

# System files
.DS_Store
Thumbs.db

# R/Python specific
.Rhistory
.Rdata
__pycache__/
*.pyc
.ipynb_checkpoints/

Collaboration Workflow

Working with branches:

# Create and switch to new branch
git checkout -b add-robustness-checks

# Make changes, commit them
git add .
git commit -m "Add robustness checks with alternative sample"

# Switch back to main
git checkout main

# Merge branch into main
git merge add-robustness-checks

# Delete branch (optional, after merge)
git branch -d add-robustness-checks

Working with remotes (GitHub):

# Clone existing repo
git clone https://github.com/username/project.git

# Add remote to existing local repo
git remote add origin https://github.com/username/project.git

# Pull changes from remote (get collaborator's updates)
git pull origin main

# Push your changes
git push origin main

Handling conflicts:

When two people edit the same lines:

# Git will mark conflicts in the file:
<<<<<<< HEAD
your changes
=======
collaborator's changes
>>>>>>> branch-name

# Edit the file to resolve, then:
git add resolved_file.R
git commit -m "Resolve merge conflict in analysis"

Best Practices

Commit often: Small, focused commits are easier to understand and revert
Write good messages: "Fix bug in standard error calculation" not "updates"
Don't commit sensitive data: Use .gitignore
Don't commit large files: Use Git LFS or don't track data
Pull before you push: Get collaborators' changes first

4.4 Data Management

The Cardinal Rule

Principle 4.1: Raw Data Immutability Never modify raw data files. Ever. All transformations happen in code, creating new files.

Why this matters:

You can always recreate processed data from raw + code
Errors in cleaning can be identified and fixed
The transformation process is documented
Replication is possible

Data Cleaning Workflow

# 01_clean_data.R

library(tidyverse)

# Load raw data
raw_data <- read_csv("data/raw/survey_2023.csv")

# Document what we're doing
message("Raw data: ", nrow(raw_data), " observations")

# Cleaning steps with comments explaining decisions
clean_data <- raw_data %>%
  # Remove test responses (first 10 were testing)
  filter(response_id > 10) %>%

  # Fix coding errors in age (some entered birth year)
  mutate(age = if_else(age > 1900, 2023 - age, age)) %>%

  # Remove implausible values
  filter(age >= 18, age <= 100) %>%
  filter(income >= 0 | is.na(income)) %>%

  # Standardize categorical variables
  mutate(
    education = case_when(
      education %in% c("HS", "High School", "high school") ~ "High School",
      education %in% c("College", "BA", "BS") ~ "Bachelor's",
      education %in% c("Grad", "MA", "MS", "PhD") ~ "Graduate",
      TRUE ~ "Other"
    )
  ) %>%

  # Create derived variables
  mutate(
    log_income = if_else(income > 0, log(income), NA_real_),
    age_sq = age^2
  )

message("Clean data: ", nrow(clean_data), " observations")
message("Dropped: ", nrow(raw_data) - nrow(clean_data), " observations")

# Save processed data
write_parquet(clean_data, "data/processed/survey_clean.parquet")

# Also save as CSV for non-R users
write_csv(clean_data, "data/processed/survey_clean.csv")

Codebooks and Documentation

Every dataset needs documentation:

# Survey 2023 Codebook

## Source
Online survey conducted January-March 2023
N = 2,451 respondents (after cleaning)

## Variables

### Identifiers
- `response_id`: Unique respondent identifier (integer)

### Demographics
- `age`: Age in years (18-100, integer)
- `female`: 1 if female, 0 if male (binary)
- `education`: Highest education level (categorical)
  - "High School": High school diploma or equivalent
  - "Bachelor's": Four-year college degree
  - "Graduate": Master's or doctoral degree
  - "Other": Other/missing

### Outcomes
- `income`: Annual income in USD (continuous, ≥0)
- `log_income`: Natural log of income (missing if income=0)

### Constructed Variables
- `age_sq`: Age squared (for quadratic specifications)

## Notes
- Original N=2,523; dropped 72 observations:
  - 10 test responses
  - 45 with implausible ages
  - 17 with negative income

File Formats

Recommended formats:

Use Case

Format

Why

Working data

Parquet

Fast, compressed, preserves types

Sharing with others

CSV

Universal compatibility

Large datasets

Parquet or Feather

Much faster than CSV

Final archive

CSV + codebook

Long-term readability

Reading and writing in R:

library(arrow)

# Parquet (recommended for working data)
write_parquet(df, "data.parquet")
df <- read_parquet("data.parquet")

# CSV
write_csv(df, "data.csv")
df <- read_csv("data.csv")

Reading and writing in Python:

import pandas as pd

# Parquet
df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")

# CSV
df.to_csv("data.csv", index=False)
df = pd.read_csv("data.csv")

Handling Sensitive Data

Some data requires special handling:

Personally identifiable information (PII):

Names, addresses, SSNs, etc.
Store separately from analysis data
Never commit to Git
Consider encryption

Confidential data:

Data under restricted access agreements
Follow data provider's requirements
Document what can/cannot be shared
Create synthetic or redacted versions for replication

# Example: Create analysis file without PII
analysis_data <- full_data %>%
  select(-name, -address, -ssn, -email) %>%
  mutate(id = row_number())  # Replace real IDs

# Save mapping separately (if needed)
id_mapping <- full_data %>%
  select(original_id, name) %>%
  mutate(new_id = row_number())

write_csv(id_mapping, "data/confidential/id_mapping.csv")  # DO NOT COMMIT
write_parquet(analysis_data, "data/processed/analysis.parquet")

Practical Guidance

Setting Up a New Project

Step-by-step checklist:

Create project folder with standard structure
Initialize Git repository (git init)
Create .gitignore file
Write initial README.md
Set up environment management (renv/conda)
Place raw data in data/raw/ (don't commit if large/sensitive)
Create codebook for data
Write first cleaning script

Common Pitfalls

Pitfall 1: "I'll Clean Up Later" Starting with messy organization, planning to fix it later. You won't.
How to avoid: Start organized. The 10 minutes of setup saves hours later.

Pitfall 2: Hardcoded Paths Code like read_csv("C:/Users/John/Documents/research/data.csv") breaks on any other machine.
How to avoid: Use relative paths. Set working directory to project root.

# Bad
df <- read_csv("C:/Users/John/Documents/research/data.csv")

# Good
df <- read_csv("data/raw/data.csv")

Pitfall 3: Not Tracking Package Versions Code that runs today may not run in 6 months with updated packages.
How to avoid: Use renv (R) or requirements.txt (Python). Snapshot your environment.

Pitfall 4: Overwriting Raw Data Accidentally saving cleaned data over raw data.
How to avoid: Use different directories. Consider making raw data read-only.

# Make raw data read-only (Unix/Mac)
chmod -w data/raw/*

Implementation Checklist

Every project should have:

Clear directory structure
README with overview and instructions
.gitignore excluding sensitive/large files
Environment specification (renv.lock or requirements.txt)
Raw data preserved unchanged
Codebook for main datasets
Numbered scripts that run in order

Summary

Key takeaways:

Reproducibility requires organization from the start: clear folder structures, numbered scripts, preserved raw data, and documented decisions.
Both R and Python are capable tools for empirical research. R excels in econometrics packages and visualization; Python in general programming and machine learning. Learn both, master one.
Version control with Git tracks changes, enables collaboration, and provides a safety net. Commit often, write clear messages, and use .gitignore appropriately.

Returning to the opening question: A reproducible computational environment combines organized project structure, version control, environment management, and careful data handling. These practices require initial investment but pay dividends in reliability, collaboration, and your own peace of mind when returning to a project months later.

Exercises

Conceptual

Explain why raw data should never be modified directly. What problems can arise from editing raw data files?
A colleague's project has files named analysis_v1.R, analysis_v2_final.R, and analysis_v2_final_ACTUAL.R. What would you recommend they do differently?

Applied

Set up a new research project following the principles in this chapter:
- Create the folder structure
- Initialize Git
- Create a .gitignore
- Write a README
- Add a sample data file and cleaning script
Take an existing messy project (your own or a provided example) and reorganize it following best practices. Document what changes you made and why.

Discussion

Some researchers argue that requiring code and data availability imposes unfair burdens. Others argue it's essential for science. What's your view, and how would you balance these concerns?

PreviousChapter 3: Statistical Foundations NextChapter 5: Survey Methods

Last updated 4 days ago

hashtagOpening Question

hashtagChapter Overview

hashtag4.1 Reproducible Workflow

hashtagWhy Reproducibility Matters

hashtagProject Organization

hashtagThe README File

hashtagAuthors

hashtag4.2 R and Python for Empirical Research

hashtagChoosing Between R and Python

hashtagR Essentials

hashtagPython Essentials

hashtagData Types: Factors vs. Strings

hashtagSide-by-Side Comparison

hashtagDebugging and Troubleshooting

hashtagDefensive Programming: Fail Early, Fail Loud

hashtagWorking with APIs and External Data

hashtag4.3 Version Control with Git

hashtagWhy Version Control?

hashtagGit Fundamentals

hashtagCollaboration Workflow

hashtagBest Practices

hashtag4.4 Data Management

hashtagThe Cardinal Rule

hashtagData Cleaning Workflow

hashtagCodebooks and Documentation

hashtagFile Formats

hashtagHandling Sensitive Data

hashtagPractical Guidance

hashtagSetting Up a New Project

hashtagCommon Pitfalls

hashtagImplementation Checklist

hashtagSummary

hashtagFurther Reading

hashtagEssential

hashtagFor Deeper Understanding

hashtagAdvanced/Specialized

hashtagExercises

hashtagConceptual

hashtagApplied

hashtagDiscussion