Chapter 8: Programming Companion—Data Exploration

Opening Question

How do we implement the descriptive and exploratory methods from Part II in practical code?

Chapter Overview

This chapter provides practical implementations of the descriptive methods covered in Chapters 5-7. We focus on four areas: visualization, text analysis, spatial data, and time series. For each, we show how to accomplish common tasks in both R and Python.

The emphasis is on practical workflow rather than comprehensive coverage. We show the tools most commonly needed for empirical research, with pointers to more specialized resources for advanced applications.

What you will learn:

How to create publication-quality visualizations with ggplot2 and matplotlib
How to process and analyze text data
How to work with spatial data and create maps
How to analyze time series data: decomposition, testing, and forecasting
How to analyze network data: structure, centrality, and communities
How to collect data from the web: APIs and web scraping

Prerequisites: Chapter 4 (programming foundations), Chapters 5-7 (conceptual foundations)

8.1 Visualization

Principles Review

Good visualizations should:

Show the data clearly
Avoid distortion
Serve a purpose (exploration, presentation, or both)
Be accessible (consider colorblindness, clarity at different sizes)

The Two Audiences
Visualizations serve two distinct purposes: exploration (for you) and presentation (for readers). Exploratory plots can be rough—quick histograms, scatter matrices, diagnostic checks. Presentation figures must be polished—clear labels, appropriate scales, thoughtful color choices. Don't confuse the two: spending hours on a plot you'll never show wastes time; presenting a quick-and-dirty plot to readers wastes their attention.

ggplot2 (R)

ggplot2 uses a "grammar of graphics" approach: build plots from components.

Basic structure:

library(tidyverse)

ggplot(data, aes(x = var1, y = var2)) +  # Data and aesthetics
  geom_point() +                          # Geometric objects
  labs(title = "Title", x = "X axis") +   # Labels
  theme_minimal()                         # Theme

Common plot types:

# Scatter plot with regression line
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "darkblue") +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
       title = "Fuel Efficiency vs. Weight") +
  theme_minimal()

# Histogram
ggplot(df, aes(x = income)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(x = "Income ($$)", y = "Count") +
  theme_minimal()

# Density by group
ggplot(df, aes(x = wage, fill = factor(female))) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("0" = "steelblue", "1" = "coral"),
                    labels = c("Male", "Female")) +
  labs(x = "Hourly Wage", fill = "Gender") +
  theme_minimal()

# Line plot (time series)
ggplot(macro_data, aes(x = date, y = gdp_growth)) +
  geom_line(color = "darkblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  labs(x = "", y = "GDP Growth (%)") +
  theme_minimal()

# Faceting (small multiples)
ggplot(df, aes(x = education, y = wage)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  facet_wrap(~region, ncol = 2) +
  labs(x = "Education", y = "Wage") +
  theme_minimal()

Coefficient plots:

library(fixest)
library(broom)

# Estimate model
model <- feols(log_wage ~ education + experience + I(experience^2) | industry,
               data = df, cluster = ~state)

# Coefficient plot
tidy(model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(x = "Coefficient", y = "") +
  theme_minimal()

Event study plots:

# Assuming event study coefficients in data frame
ggplot(event_study_df, aes(x = time_to_treat, y = estimate)) +
  geom_point() +
  geom_line() +
  geom_ribbon(aes(ymin = conf_low, ymax = conf_high), alpha = 0.2) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = "red") +
  labs(x = "Periods Relative to Treatment", y = "Effect") +
  theme_minimal()

Saving plots:

# Save with ggsave
p <- ggplot(df, aes(x = x, y = y)) + geom_point()
ggsave("output/figures/scatter.pdf", p, width = 6, height = 4)
ggsave("output/figures/scatter.png", p, width = 6, height = 4, dpi = 300)

matplotlib/seaborn (Python)

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['figure.dpi'] = 100

Common plot types:

# Scatter with regression
fig, ax = plt.subplots()
sns.regplot(data=df, x='wt', y='mpg', ax=ax, scatter_kws={'alpha': 0.7})
ax.set_xlabel('Weight (1000 lbs)')
ax.set_ylabel('Miles per Gallon')
ax.set_title('Fuel Efficiency vs. Weight')
plt.tight_layout()
plt.savefig('output/figures/scatter.pdf')

# Histogram
fig, ax = plt.subplots()
ax.hist(df['income'], bins=30, color='steelblue', edgecolor='white')
ax.set_xlabel('Income ($$)')
ax.set_ylabel('Count')
plt.tight_layout()

# Density by group
fig, ax = plt.subplots()
for gender, color, label in [(0, 'steelblue', 'Male'), (1, 'coral', 'Female')]:
    subset = df[df['female'] == gender]
    sns.kdeplot(data=subset, x='wage', fill=True, alpha=0.5,
                color=color, label=label, ax=ax)
ax.set_xlabel('Hourly Wage')
ax.legend()
plt.tight_layout()

# Line plot
fig, ax = plt.subplots()
ax.plot(macro_data['date'], macro_data['gdp_growth'], color='darkblue')
ax.axhline(y=0, linestyle='--', color='gray')
ax.set_xlabel('')
ax.set_ylabel('GDP Growth (%)')
plt.tight_layout()

# Faceting with seaborn
g = sns.FacetGrid(df, col='region', col_wrap=2)
g.map_dataframe(sns.boxplot, x='education', y='wage')
g.set_axis_labels('Education', 'Wage')
plt.tight_layout()

Coefficient plots:

import statsmodels.formula.api as smf

# Estimate model
model = smf.ols('log_wage ~ education + experience + I(experience**2)',
                data=df).fit()

# Extract coefficients
coef_df = pd.DataFrame({
    'term': model.params.index[1:],  # Exclude intercept
    'estimate': model.params.values[1:],
    'se': model.bse.values[1:]
})
coef_df['ci_low'] = coef_df['estimate'] - 1.96 * coef_df['se']
coef_df['ci_high'] = coef_df['estimate'] + 1.96 * coef_df['se']

# Plot
fig, ax = plt.subplots(figsize=(6, 4))
ax.errorbar(coef_df['estimate'], coef_df['term'],
            xerr=1.96*coef_df['se'], fmt='o', capsize=3)
ax.axvline(x=0, linestyle='--', color='gray')
ax.set_xlabel('Coefficient')
plt.tight_layout()

Interactive Visualization

R with plotly:

library(plotly)

# Convert ggplot to interactive
p <- ggplot(df, aes(x = education, y = wage, color = region)) +
  geom_point()
ggplotly(p)

# Native plotly
plot_ly(df, x = ~education, y = ~wage, color = ~region, type = "scatter")

Python with plotly:

import plotly.express as px

fig = px.scatter(df, x='education', y='wage', color='region',
                 hover_data=['id', 'industry'])
fig.show()
fig.write_html('output/figures/interactive.html')

8.2 Text Analysis

Text Processing Pipeline

Text analysis typically follows these steps:

Load raw text
Clean and normalize
Tokenize
Remove stop words, stem/lemmatize
Create document-term matrix
Analyze (topic models, sentiment, etc.)

R with tidytext

library(tidyverse)
library(tidytext)

# Sample data: research abstracts
abstracts <- tibble(
  doc_id = 1:3,
  text = c(
    "This paper examines the effects of minimum wage on employment.",
    "We study returns to education using instrumental variables.",
    "The impact of monetary policy on inflation is analyzed."
  )
)

# Tokenize and clean
tidy_abstracts <- abstracts %>%
  unnest_tokens(word, text) %>%           # Tokenize
  anti_join(stop_words, by = "word") %>%  # Remove stop words
  filter(!str_detect(word, "^[0-9]+$$"))   # Remove numbers

# Word frequencies
word_counts <- tidy_abstracts %>%
  count(word, sort = TRUE)

# TF-IDF
tfidf <- tidy_abstracts %>%
  count(doc_id, word) %>%
  bind_tf_idf(word, doc_id, n)

# Document-term matrix
dtm <- tidy_abstracts %>%
  count(doc_id, word) %>%
  cast_dtm(doc_id, word, n)

Topic modeling with LDA:

library(topicmodels)

# Fit LDA model
lda_model <- LDA(dtm, k = 3, control = list(seed = 42))

# Extract topics
topics <- tidy(lda_model, matrix = "beta")

# Top words per topic
top_words <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Visualize
top_words %>%
  mutate(word = reorder_within(word, beta, topic)) %>%
  ggplot(aes(beta, word, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free_y") +
  scale_y_reordered() +
  theme_minimal()

Sentiment analysis:

# Using AFINN lexicon
sentiments <- tidy_abstracts %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(doc_id) %>%
  summarize(sentiment = sum(value))

Python with scikit-learn and NLTK

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data (once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

# Sample data
abstracts = [
    "This paper examines the effects of minimum wage on employment.",
    "We study returns to education using instrumental variables.",
    "The impact of monetary policy on inflation is analyzed."
]

# Basic preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [stemmer.stem(t) for t in tokens
              if t.isalpha() and t not in stop_words]
    return ' '.join(tokens)

processed = [preprocess(doc) for doc in abstracts]

# Create document-term matrix
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(processed)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(processed)

Topic modeling:

# Fit LDA
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(dtm)

# Get top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-10:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

Sentiment with VADER (for social media text):

from nltk.sentiment import SentimentIntensityAnalyzer

# nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

sentiments = [sia.polarity_scores(doc) for doc in abstracts]
sentiment_df = pd.DataFrame(sentiments)

Beyond Basic Text Analysis: Interdisciplinary Applications

Text as data extends well beyond topic modeling and sentiment. Researchers across disciplines use computational text methods to extract structured information from unstructured sources:

Optical Character Recognition (OCR): Converting scanned historical documents into machine-readable text. Libraries like Tesseract (via tesseract in R or pytesseract in Python) handle OCR, though historical fonts and degraded documents require careful preprocessing.

Named Entity Recognition (NER): Extracting people, places, organizations, and dates from text. SpaCy (Python) and spacyr (R) provide pre-trained models; fine-tuning on domain-specific corpora improves accuracy for historical or technical texts.

Extracting economic information: Prices, quantities, exchange rates, and other numeric data often appear in unstructured text (historical newspapers, corporate filings, government reports). Regular expressions combined with NER can systematically extract these values.

Machine translation: Multilingual corpora—historical diplomatic correspondence, cross-national news coverage, colonial archives—require translation before analysis. Google Translate API, DeepL, and open-source models (via Hugging Face) enable programmatic translation at scale.

Geocoding text: Place names in historical documents can be converted to coordinates for spatial analysis using geocoding APIs (Google Maps, OpenStreetMap Nominatim) or historical gazetteers.

For Further Study
The Programming Historian (https://programminghistorian.org/en/lessons/) offers peer-reviewed tutorials on these methods, written for humanities researchers but valuable for any discipline working with text as data. The methodological pluralism of social science increasingly requires moving between quantitative analysis and computational text methods.

8.3 Spatial Data

Spatial Data Basics

Spatial data comes in two main forms:

Vector: Points, lines, polygons (boundaries)
Raster: Grid cells (satellite imagery, elevation)

We focus on vector data, which is more common in social science.

R with sf

library(sf)
library(tidyverse)

# Read shapefile
counties <- st_read("data/raw/us_counties.shp")

# Check coordinate reference system
st_crs(counties)

# Transform to different projection
counties <- st_transform(counties, crs = 4326)  # WGS84

# Basic operations
counties <- counties %>%
  mutate(area_km2 = as.numeric(st_area(geometry)) / 1e6)

# Spatial join: add county data to point data
points <- st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326)
points_with_county <- st_join(points, counties)

Mapping with ggplot2:

# Basic choropleth
ggplot(counties) +
  geom_sf(aes(fill = median_income), color = "white", size = 0.1) +
  scale_fill_viridis_c(option = "plasma", labels = scales::comma) +
  labs(fill = "Median\nIncome ($$)") +
  theme_void()

# With points overlaid
ggplot() +
  geom_sf(data = counties, fill = "gray95", color = "gray70") +
  geom_sf(data = points, aes(color = treatment), size = 1, alpha = 0.7) +
  scale_color_manual(values = c("0" = "steelblue", "1" = "coral")) +
  theme_void()

# Faceted maps
ggplot(counties) +
  geom_sf(aes(fill = unemployment_rate)) +
  facet_wrap(~year) +
  scale_fill_viridis_c() +
  theme_void()

Spatial autocorrelation:

library(spdep)

# Create neighbors list
nb <- poly2nb(counties)
lw <- nb2listw(nb, style = "W")

# Moran's I test
moran.test(counties$$median_income, lw)

# Local Moran's I (LISA)
local_moran <- localmoran(counties$$median_income, lw)
counties$$local_i <- local_moran[, 1]
counties$$p_value <- local_moran[, 5]

Python with geopandas

import geopandas as gpd
import matplotlib.pyplot as plt

# Read shapefile
counties = gpd.read_file("data/raw/us_counties.shp")

# Check and set CRS
print(counties.crs)
counties = counties.to_crs(epsg=4326)

# Basic operations
counties['area_km2'] = counties.geometry.area / 1e6

# Create GeoDataFrame from coordinates
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
points = gpd.GeoDataFrame(df, geometry=geometry, crs="EPSG:4326")

# Spatial join
points_with_county = gpd.sjoin(points, counties, how='left', predicate='within')

Mapping:

# Basic choropleth
fig, ax = plt.subplots(figsize=(10, 8))
counties.plot(column='median_income', cmap='plasma', legend=True,
              ax=ax, edgecolor='white', linewidth=0.1)
ax.set_axis_off()
ax.set_title('Median Income by County')
plt.tight_layout()

# With points
fig, ax = plt.subplots(figsize=(10, 8))
counties.plot(ax=ax, color='lightgray', edgecolor='gray')
points.plot(ax=ax, column='treatment', cmap='coolwarm',
            markersize=5, alpha=0.7, legend=True)
ax.set_axis_off()
plt.tight_layout()

Spatial autocorrelation with PySAL:

from libpysal.weights import Queen
from esda.moran import Moran, Moran_Local

# Create weights
w = Queen.from_dataframe(counties)
w.transform = 'r'  # Row-standardize

# Global Moran's I
moran = Moran(counties['median_income'], w)
print(f"Moran's I: {moran.I:.3f}, p-value: {moran.p_sim:.4f}")

# Local Moran's I
local_moran = Moran_Local(counties['median_income'], w)
counties['local_i'] = local_moran.Is
counties['p_value'] = local_moran.p_sim

8.4 Time Series Tools

Time Series vs. Panel Data
Time series analysis (Chapter 7, Chapter 16) focuses on a single unit observed over many periods—one country's GDP, one firm's stock price. Panel data (Chapter 15) involves many units over fewer periods. The methods differ: time series emphasizes autocorrelation, stationarity, and dynamic structure; panels emphasize cross-sectional variation and fixed effects. The tools here focus on single-unit time series; for panel methods, see Chapter 18.

Time Series Basics

R with tseries and forecast:

library(tidyverse)
library(tseries)
library(forecast)

# Create time series object
gdp <- ts(macro_data$$gdp, start = c(1960, 1), frequency = 4)

# Plot
autoplot(gdp) +
  labs(y = "GDP (billions)", title = "US Real GDP") +
  theme_minimal()

# Summary statistics
summary(gdp)
acf(gdp, main = "ACF of GDP")
pacf(gdp, main = "PACF of GDP")

Python with statsmodels:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf, pacf, adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Create datetime index
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
gdp = df['gdp']

# Plot
fig, ax = plt.subplots(figsize=(10, 4))
gdp.plot(ax=ax)
ax.set_ylabel('GDP (billions)')
ax.set_title('US Real GDP')
plt.tight_layout()

# ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(gdp, ax=axes[0], lags=20)
plot_pacf(gdp, ax=axes[1], lags=20)
plt.tight_layout()

Stationarity Testing

# ADF test
adf.test(gdp)

# KPSS test
kpss.test(gdp)

# Phillips-Perron test
pp.test(gdp)

# If non-stationary, difference
gdp_diff <- diff(gdp)
adf.test(gdp_diff)

Python:

from statsmodels.tsa.stattools import adfuller, kpss

# ADF test
adf_result = adfuller(gdp.dropna())
print(f'ADF Statistic: {adf_result[0]:.4f}')
print(f'p-value: {adf_result[1]:.4f}')

# KPSS test
kpss_result = kpss(gdp.dropna())
print(f'KPSS Statistic: {kpss_result[0]:.4f}')
print(f'p-value: {kpss_result[1]:.4f}')

# Difference if needed
gdp_diff = gdp.diff().dropna()

Decomposition

# STL decomposition
stl_decomp <- stl(gdp, s.window = "periodic")
autoplot(stl_decomp)

# Classical decomposition
decomp <- decompose(gdp)
plot(decomp)

# Extract components
trend <- decomp$$trend
seasonal <- decomp$$seasonal
residual <- decomp$$random

Python:

from statsmodels.tsa.seasonal import seasonal_decompose, STL

# Classical decomposition
decomp = seasonal_decompose(gdp, model='additive', period=4)

fig, axes = plt.subplots(4, 1, figsize=(10, 8))
decomp.observed.plot(ax=axes[0], title='Observed')
decomp.trend.plot(ax=axes[1], title='Trend')
decomp.seasonal.plot(ax=axes[2], title='Seasonal')
decomp.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()

# STL decomposition
stl = STL(gdp, period=4)
stl_result = stl.fit()
stl_result.plot()

ARIMA Modeling

# Manual ARIMA
model <- arima(gdp_diff, order = c(1, 0, 1))
summary(model)

# Automatic model selection
auto_model <- auto.arima(gdp)
summary(auto_model)

# Diagnostics
checkresiduals(auto_model)

# Forecast
fcast <- forecast(auto_model, h = 8)
autoplot(fcast) +
  theme_minimal()

Python:

from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima

# Manual ARIMA
model = ARIMA(gdp, order=(1, 1, 1))
model_fit = model.fit()
print(model_fit.summary())

# Automatic selection
auto_model = auto_arima(gdp, seasonal=True, m=4,
                        suppress_warnings=True)
print(auto_model.summary())

# Forecast
forecast = model_fit.forecast(steps=8)

# Plot forecast
fig, ax = plt.subplots(figsize=(10, 4))
gdp.plot(ax=ax, label='Observed')
forecast_index = pd.date_range(start=gdp.index[-1], periods=9, freq='Q')[1:]
ax.plot(forecast_index, forecast, label='Forecast', color='red')
ax.legend()
plt.tight_layout()

VAR Models

library(vars)

# Prepare data
var_data <- cbind(gdp_growth, inflation, interest_rate)

# Select lag length
VARselect(var_data, lag.max = 8)

# Estimate VAR
var_model <- VAR(var_data, p = 4)
summary(var_model)

# Impulse response functions
irf_result <- irf(var_model, impulse = "interest_rate",
                  response = "gdp_growth", n.ahead = 20)
plot(irf_result)

# Granger causality
causality(var_model, cause = "interest_rate")

Python:

from statsmodels.tsa.api import VAR

# Prepare data
var_data = df[['gdp_growth', 'inflation', 'interest_rate']].dropna()

# Fit VAR
model = VAR(var_data)
results = model.fit(maxlags=4, ic='aic')
print(results.summary())

# Impulse response
irf = results.irf(20)
irf.plot(impulse='interest_rate', response='gdp_growth')

# Granger causality
results.test_causality('gdp_growth', causing='interest_rate')

8.5 Network Analysis

Why Networks Matter

Many social phenomena have network structure: social connections, trade relationships, citation patterns, corporate boards, legislative cosponsorship. Network analysis provides tools to describe and analyze these relational structures.

Network Basics

Key concepts:

Nodes (vertices): Actors in the network (people, firms, countries)
Edges (links): Relationships between actors (friendships, trades, citations)
Directed vs. undirected: Do edges have direction? (Twitter follows vs. Facebook friends)
Weighted vs. unweighted: Do edges have strength? (trade volume vs. trade existence)

R with igraph

library(igraph)
library(tidyverse)

# Create network from edge list
edges <- data.frame(
  from = c("A", "A", "B", "B", "C", "D"),
  to = c("B", "C", "C", "D", "D", "E")
)
g <- graph_from_data_frame(edges, directed = FALSE)

# Basic properties
vcount(g)  # Number of nodes
ecount(g)  # Number of edges
is_directed(g)

# Add node attributes
V(g)$$size <- degree(g)  # Node size = degree

# Centrality measures
degree(g)                    # Number of connections
betweenness(g)               # Bridge position
closeness(g)                 # Average distance to others
eigen_centrality(g)$$vector   # Influence (connections to connected nodes)

# Network-level statistics
graph.density(g)             # Proportion of possible edges present
transitivity(g)              # Clustering coefficient
mean_distance(g)             # Average path length
diameter(g)                  # Longest shortest path

Community detection:

# Modularity-based community detection
communities <- cluster_louvain(g)
membership(communities)
modularity(communities)

# Add community to node attributes
V(g)$$community <- membership(communities)

Visualization:

# Basic plot
plot(g,
     vertex.size = V(g)$$size * 5,
     vertex.color = V(g)$$community,
     vertex.label.cex = 0.8,
     edge.width = 1,
     layout = layout_with_fr(g))

# With ggraph for ggplot2-style
library(ggraph)

ggraph(g, layout = "fr") +
  geom_edge_link(alpha = 0.5) +
  geom_node_point(aes(size = size, color = factor(community))) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  labs(size = "Degree", color = "Community")

Python with NetworkX

import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd

# Create network from edge list
edges = [("A", "B"), ("A", "C"), ("B", "C"), ("B", "D"), ("C", "D"), ("D", "E")]
G = nx.Graph()
G.add_edges_from(edges)

# Basic properties
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Directed: {G.is_directed()}")

# Centrality measures
degree_cent = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
closeness = nx.closeness_centrality(G)
eigenvector = nx.eigenvector_centrality(G)

# Network-level statistics
print(f"Density: {nx.density(G):.3f}")
print(f"Clustering: {nx.average_clustering(G):.3f}")
print(f"Avg path length: {nx.average_shortest_path_length(G):.3f}")
print(f"Diameter: {nx.diameter(G)}")

Community detection:

from networkx.algorithms import community

# Louvain community detection
communities = community.louvain_communities(G)
print(f"Found {len(communities)} communities")

# Create community mapping
community_map = {}
for i, comm in enumerate(communities):
    for node in comm:
        community_map[node] = i

Visualization:

# Basic visualization
fig, ax = plt.subplots(figsize=(8, 6))

# Node sizes based on degree
node_sizes = [300 * G.degree(n) for n in G.nodes()]

# Node colors based on community
node_colors = [community_map[n] for n in G.nodes()]

# Layout
pos = nx.spring_layout(G, seed=42)

# Draw
nx.draw(G, pos, ax=ax,
        node_size=node_sizes,
        node_color=node_colors,
        cmap=plt.cm.Set3,
        with_labels=True,
        font_size=10,
        edge_color='gray',
        alpha=0.8)

plt.tight_layout()
plt.savefig('output/figures/network.pdf')

Common Network Analyses

Bipartite networks (two-mode networks):

# R: Create bipartite network (e.g., authors and papers)
bip <- graph_from_data_frame(author_paper_edges)
V(bip)$$type <- V(bip)$$name %in% unique(author_paper_edges$$author)

# Project to one-mode (author-author by shared papers)
author_net <- bipartite_projection(bip)$$proj1

# Python: Bipartite network
from networkx.algorithms import bipartite

B = nx.Graph()
B.add_nodes_from(authors, bipartite=0)
B.add_nodes_from(papers, bipartite=1)
B.add_edges_from(author_paper_edges)

# Project to author-author
author_net = bipartite.projected_graph(B, authors)

Export for further analysis:

# R: Export to various formats
write_graph(g, "network.graphml", format = "graphml")
write_graph(g, "network.gml", format = "gml")

# Python: Export
nx.write_graphml(G, "network.graphml")
nx.write_gml(G, "network.gml")

8.6 Web Scraping and APIs

Legal and Ethical Considerations
Web scraping exists in a legal gray area. Key principles: (1) Check the site's Terms of Service—some explicitly prohibit scraping. (2) Respect robots.txt directives. (3) Don't overwhelm servers—add delays between requests. (4) Consider whether data contains personal information (GDPR, IRB implications). (5) Academic research often receives more latitude, but "fair use" for research is not a blanket exemption. When in doubt, contact the data source or consult your institution's legal office.

Data Collection from the Web

The web is a vast source of data for social science research: government websites, corporate filings, news articles, social media, online platforms. Two main approaches:

APIs (Application Programming Interfaces): Structured access provided by the data source
Web scraping: Extracting data from web pages directly

When to use which:

Prefer APIs when available—they're more reliable and typically permitted
Use scraping when no API exists or API limits are too restrictive
Always check terms of service and robots.txt

Working with APIs

R with httr:

library(httr)
library(jsonlite)

# Basic GET request
response <- GET("https://api.example.com/data",
                query = list(param1 = "value1", param2 = "value2"))

# Check status
status_code(response)

# Parse JSON response
data <- content(response, as = "text") %>%
  fromJSON()

# With authentication (API key in header)
response <- GET("https://api.example.com/data",
                add_headers(Authorization = paste("Bearer", api_key)))

Python with requests:

import requests
import json

# Basic GET request
response = requests.get(
    "https://api.example.com/data",
    params={"param1": "value1", "param2": "value2"}
)

# Check status
print(response.status_code)

# Parse JSON
data = response.json()

# With authentication
headers = {"Authorization": f"Bearer {api_key}"}
response = requests.get("https://api.example.com/data", headers=headers)

Example: FRED API (Federal Reserve Economic Data):

# R with fredr package
library(fredr)

# Set API key (get from https://fred.stlouisfed.org/docs/api/api_key.html)
fredr_set_key("your_api_key_here")

# Get GDP data
gdp <- fredr(series_id = "GDP",
             observation_start = as.Date("2000-01-01"))

# Get multiple series
series_ids <- c("GDP", "UNRATE", "CPIAUCSL")
macro_data <- map_dfr(series_ids, ~fredr(series_id = .x))

# Python with fredapi
from fredapi import Fred

fred = Fred(api_key='your_api_key_here')

# Get GDP data
gdp = fred.get_series('GDP', observation_start='2000-01-01')

# Get multiple series
import pandas as pd
series_ids = ['GDP', 'UNRATE', 'CPIAUCSL']
macro_data = pd.DataFrame({s: fred.get_series(s) for s in series_ids})

Web Scraping Basics

R with rvest:

library(rvest)
library(tidyverse)

# Read webpage
page <- read_html("https://example.com/page")

# Extract elements by CSS selector
titles <- page %>%
  html_elements(".article-title") %>%
  html_text()

# Extract table
table <- page %>%
  html_element("table") %>%
  html_table()

# Extract links
links <- page %>%
  html_elements("a") %>%
  html_attr("href")

# Extract by XPath
content <- page %>%
  html_elements(xpath = "//div[@class='content']/p") %>%
  html_text()

Python with BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Fetch page
response = requests.get("https://example.com/page")
soup = BeautifulSoup(response.content, 'html.parser')

# Extract elements by CSS selector
titles = [el.get_text() for el in soup.select(".article-title")]

# Extract table
table = soup.find('table')
df = pd.read_html(str(table))[0]

# Extract links
links = [a.get('href') for a in soup.find_all('a')]

# Extract by tag and class
content = [p.get_text() for p in soup.find_all('p', class_='content')]

Scraping Multiple Pages

# R: Scrape multiple pages with rate limiting
library(rvest)
library(purrr)

urls <- paste0("https://example.com/page/", 1:10)

scrape_page <- function(url) {
  Sys.sleep(1)  # Be polite: wait 1 second between requests

  page <- read_html(url)

  tibble(
    title = page %>% html_element(".title") %>% html_text(),
    content = page %>% html_element(".content") %>% html_text()
  )
}

# Scrape all pages
results <- map_dfr(urls, possibly(scrape_page, otherwise = NULL))

# Python: Scrape multiple pages
import time

urls = [f"https://example.com/page/{i}" for i in range(1, 11)]

def scrape_page(url):
    time.sleep(1)  # Be polite
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return {
        'title': soup.select_one('.title').get_text(),
        'content': soup.select_one('.content').get_text()
    }

results = []
for url in urls:
    try:
        results.append(scrape_page(url))
    except Exception as e:
        print(f"Error scraping {url}: {e}")

df = pd.DataFrame(results)

Best Practices for Web Data Collection

Ethical and Legal Considerations
Check terms of service: Many sites prohibit scraping
Respect robots.txt: Check example.com/robots.txt for rules
Rate limiting: Don't overwhelm servers—add delays between requests
Identify yourself: Set a user-agent string with contact info
Cache results: Don't re-scrape data you already have
Consider privacy: Be careful with personal data

Setting a user agent:

# R
response <- GET(url, user_agent("Research project - contact@university.edu"))

# Python
headers = {'User-Agent': 'Research project - contact@university.edu'}
response = requests.get(url, headers=headers)

Handling dynamic content (JavaScript-rendered pages):

# R with RSelenium
library(RSelenium)

driver <- rsDriver(browser = "chrome")
remote_driver <- driver$$client
remote_driver$$navigate("https://example.com")

# Wait for JavaScript
Sys.sleep(2)

# Get rendered HTML
page_source <- remote_driver$$getPageSource()[[1]]
page <- read_html(page_source)

remote_driver$close()

# Python with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://example.com")

# Wait for element
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "content"))
)

# Get rendered HTML
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

driver.quit()

Practical Guidance

When to Use What

Task

R Package

Python Package

Basic visualization

ggplot2

matplotlib, seaborn

Interactive plots

plotly

Text tokenization

tidytext

nltk, spacy

Topic models

topicmodels

sklearn, gensim

Spatial data

geopandas

Spatial statistics

spdep

pysal

Time series basics

stats, tseries

statsmodels

ARIMA

forecast

statsmodels, pmdarima

VAR

vars

statsmodels

Network analysis

igraph, ggraph

networkx

Web scraping

rvest

beautifulsoup4, requests

APIs

httr, jsonlite

requests

Dynamic scraping

RSelenium

selenium

Common Pitfalls

Pitfall 1: Ignoring Missing Data in Time Series Many time series functions fail silently with missing values or produce incorrect results.
How to avoid: Always check for and handle missing values explicitly. Use interpolation or indicator methods as appropriate.

Pitfall 2: Wrong CRS in Spatial Data Mixing coordinate reference systems produces incorrect spatial operations.
How to avoid: Always check CRS with st_crs() (R) or .crs (Python). Transform to common CRS before joining or overlaying.

Pitfall 3: Over-Interpreting Topic Models LDA produces topics regardless of whether coherent topics exist.
How to avoid: Use coherence metrics. Validate topics substantively. Try different numbers of topics.

Environment Management for Reproducibility

Package versions matter. Code that runs today may fail next year when dependencies update. Environment management tools lock package versions so analyses remain reproducible.

R with renv

# Initialize renv in a project
renv::init()

# Work normally, installing packages as needed
install.packages("sf")
install.packages("tidytext")

# Snapshot current environment (creates renv.lock)
renv::snapshot()

# To reproduce on another machine:
renv::restore()  # Installs exact versions from renv.lock

The renv.lock file records exact package versions. Commit it to version control so collaborators (and future you) can recreate the environment.

Python with conda

# Create environment from scratch
conda create -n myproject python=3.11
conda activate myproject

# Install packages
conda install pandas geopandas networkx

# Export environment
conda env export > environment.yml

# Recreate on another machine
conda env create -f environment.yml

Python with venv + pip

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate     # Windows

# Install packages
pip install pandas geopandas networkx

# Freeze requirements
pip freeze > requirements.txt

# Recreate on another machine
pip install -r requirements.txt

Box: Which Environment Tool?

Tool

Language

Best For

renv

Most R projects; plays well with RStudio

conda

Python (+ R)

Data science; handles non-Python dependencies (C libraries, GDAL)

venv + pip

Python

Lightweight Python-only projects

Docker

Any

Maximum reproducibility; bundles OS + dependencies (see Chapter 26)

Key principle: Record your environment in a lockfile committed to version control. A README saying "install pandas" is not reproducible; requirements.txt or renv.lock is.

Implementation Checklist

Visualization:

Text analysis:

Preprocessing documented
Stop words appropriate for domain
Number of topics justified
Topic coherence evaluated

Spatial:

CRS verified and consistent
Weights matrix appropriate (queen, rook, distance)
Spatial autocorrelation tested

Time series:

Stationarity tested
Appropriate differencing
Model diagnostics checked
Forecast uncertainty reported

Summary

Key takeaways:

ggplot2 (R) and matplotlib/seaborn (Python) handle most visualization needs; use plotly for interactivity.
Text analysis follows a standard pipeline: tokenize, clean, create document-term matrix, then analyze with topic models or sentiment.
Spatial analysis requires attention to coordinate reference systems; sf (R) and geopandas (Python) are the workhorses.
Time series analysis starts with stationarity testing; forecast (R) and statsmodels (Python) provide comprehensive ARIMA and VAR tools.
Network analysis with igraph (R) or NetworkX (Python) enables study of relational structures—centrality, communities, and network statistics.
Web data collection through APIs is preferred when available; web scraping with rvest (R) or BeautifulSoup (Python) works when APIs don't exist—but respect terms of service and rate limits.

Returning to the opening question: The tools in this chapter implement the methods from Chapters 5-7. The key is matching the tool to the task while maintaining reproducible practices from Chapter 4. Both R and Python ecosystems are mature enough for any common task; choose based on your broader workflow and team preferences.

Exercises

Conceptual

When would you choose interactive visualization over static? What are the tradeoffs?
Explain why coordinate reference systems matter for spatial analysis. What problems can arise from mixing CRS?

Applied

Take a corpus of text (academic abstracts, news articles, or tweets) and:
- Preprocess the text
- Fit a topic model with 5 topics
- Visualize the top words per topic
- Assess whether the topics are coherent
Using county-level data (unemployment, income, or another variable):
- Create a choropleth map
- Test for spatial autocorrelation
- Identify clusters using Local Moran's I
Using a macroeconomic time series (GDP, unemployment, or inflation):
- Test for stationarity
- Fit an appropriate ARIMA model
- Generate forecasts with confidence intervals
- Assess forecast accuracy on held-out data

Discussion

Some argue that interactive visualizations are more engaging but harder to reproduce. How would you balance these concerns in a research project?

PreviousChapter 7: Dynamics and Time Series Foundations NextChapter 9: The Causal Framework

Last updated 4 days ago

hashtagOpening Question

hashtagChapter Overview

hashtag8.1 Visualization

hashtagPrinciples Review

hashtagggplot2 (R)

hashtagmatplotlib/seaborn (Python)

hashtagInteractive Visualization

hashtag8.2 Text Analysis

hashtagText Processing Pipeline

hashtagR with tidytext

hashtagPython with scikit-learn and NLTK

hashtagBeyond Basic Text Analysis: Interdisciplinary Applications

hashtag8.3 Spatial Data

hashtagSpatial Data Basics

hashtagR with sf

hashtagPython with geopandas

hashtag8.4 Time Series Tools

hashtagTime Series Basics

hashtagStationarity Testing

hashtagDecomposition

hashtagARIMA Modeling

hashtagVAR Models

hashtag8.5 Network Analysis

hashtagWhy Networks Matter

hashtagNetwork Basics

hashtagR with igraph

hashtagPython with NetworkX

hashtagCommon Network Analyses

hashtag8.6 Web Scraping and APIs

hashtagData Collection from the Web

hashtagWorking with APIs

hashtagWeb Scraping Basics

hashtagScraping Multiple Pages

hashtagBest Practices for Web Data Collection

hashtagPractical Guidance

hashtagWhen to Use What

hashtagCommon Pitfalls

hashtagEnvironment Management for Reproducibility

hashtagImplementation Checklist

hashtagSummary

hashtagFurther Reading

hashtagEssential

hashtagFor Deeper Understanding

hashtagAdvanced/Specialized

hashtagNetwork Analysis

hashtagDigital Humanities and Interdisciplinary Methods

hashtagExercises

hashtagConceptual

hashtagApplied

hashtagDiscussion

Opening Question

Chapter Overview

8.1 Visualization

Principles Review

ggplot2 (R)

matplotlib/seaborn (Python)

Interactive Visualization

8.2 Text Analysis

Text Processing Pipeline

R with tidytext

Python with scikit-learn and NLTK

Beyond Basic Text Analysis: Interdisciplinary Applications

8.3 Spatial Data

Spatial Data Basics

R with sf

Python with geopandas

8.4 Time Series Tools

Time Series Basics

Stationarity Testing

Decomposition

ARIMA Modeling

VAR Models

8.5 Network Analysis

Why Networks Matter

Network Basics

R with igraph

Python with NetworkX

Common Network Analyses

8.6 Web Scraping and APIs

Data Collection from the Web

Working with APIs

Web Scraping Basics

Scraping Multiple Pages

Best Practices for Web Data Collection

Practical Guidance

When to Use What

Common Pitfalls

Environment Management for Reproducibility

Implementation Checklist

Summary

Further Reading

Essential

For Deeper Understanding

Advanced/Specialized

Network Analysis

Digital Humanities and Interdisciplinary Methods

Exercises

Conceptual

Applied

Discussion