Chapter 8: Programming Companion—Data Exploration

Opening Question

How do we implement the descriptive and exploratory methods from Part II in practical code?


Chapter Overview

This chapter provides practical implementations of the descriptive methods covered in Chapters 5-7. We focus on four areas: visualization, text analysis, spatial data, and time series. For each, we show how to accomplish common tasks in both R and Python.

The emphasis is on practical workflow rather than comprehensive coverage. We show the tools most commonly needed for empirical research, with pointers to more specialized resources for advanced applications.

What you will learn:

  • How to create publication-quality visualizations with ggplot2 and matplotlib

  • How to process and analyze text data

  • How to work with spatial data and create maps

  • How to analyze time series data: decomposition, testing, and forecasting

  • How to analyze network data: structure, centrality, and communities

  • How to collect data from the web: APIs and web scraping

Prerequisites: Chapter 4 (programming foundations), Chapters 5-7 (conceptual foundations)


8.1 Visualization

Principles Review

Good visualizations should:

  • Show the data clearly

  • Avoid distortion

  • Serve a purpose (exploration, presentation, or both)

  • Be accessible (consider colorblindness, clarity at different sizes)

The Two Audiences

Visualizations serve two distinct purposes: exploration (for you) and presentation (for readers). Exploratory plots can be rough—quick histograms, scatter matrices, diagnostic checks. Presentation figures must be polished—clear labels, appropriate scales, thoughtful color choices. Don't confuse the two: spending hours on a plot you'll never show wastes time; presenting a quick-and-dirty plot to readers wastes their attention.

ggplot2 (R)

ggplot2 uses a "grammar of graphics" approach: build plots from components.

Basic structure:

Common plot types:

Coefficient plots:

Event study plots:

Saving plots:

matplotlib/seaborn (Python)

Common plot types:

Coefficient plots:

Interactive Visualization

R with plotly:

Python with plotly:


8.2 Text Analysis

Text Processing Pipeline

Text analysis typically follows these steps:

  1. Load raw text

  2. Clean and normalize

  3. Tokenize

  4. Remove stop words, stem/lemmatize

  5. Create document-term matrix

  6. Analyze (topic models, sentiment, etc.)

R with tidytext

Topic modeling with LDA:

Sentiment analysis:

Python with scikit-learn and NLTK

Topic modeling:

Sentiment with VADER (for social media text):

Beyond Basic Text Analysis: Interdisciplinary Applications

Text as data extends well beyond topic modeling and sentiment. Researchers across disciplines use computational text methods to extract structured information from unstructured sources:

Optical Character Recognition (OCR): Converting scanned historical documents into machine-readable text. Libraries like Tesseract (via tesseract in R or pytesseract in Python) handle OCR, though historical fonts and degraded documents require careful preprocessing.

Named Entity Recognition (NER): Extracting people, places, organizations, and dates from text. SpaCy (Python) and spacyr (R) provide pre-trained models; fine-tuning on domain-specific corpora improves accuracy for historical or technical texts.

Extracting economic information: Prices, quantities, exchange rates, and other numeric data often appear in unstructured text (historical newspapers, corporate filings, government reports). Regular expressions combined with NER can systematically extract these values.

Machine translation: Multilingual corpora—historical diplomatic correspondence, cross-national news coverage, colonial archives—require translation before analysis. Google Translate API, DeepL, and open-source models (via Hugging Face) enable programmatic translation at scale.

Geocoding text: Place names in historical documents can be converted to coordinates for spatial analysis using geocoding APIs (Google Maps, OpenStreetMap Nominatim) or historical gazetteers.

For Further Study

The Programming Historian (https://programminghistorian.org/en/lessons/) offers peer-reviewed tutorials on these methods, written for humanities researchers but valuable for any discipline working with text as data. The methodological pluralism of social science increasingly requires moving between quantitative analysis and computational text methods.


8.3 Spatial Data

Spatial Data Basics

Spatial data comes in two main forms:

  • Vector: Points, lines, polygons (boundaries)

  • Raster: Grid cells (satellite imagery, elevation)

We focus on vector data, which is more common in social science.

R with sf

Mapping with ggplot2:

Spatial autocorrelation:

Python with geopandas

Mapping:

Spatial autocorrelation with PySAL:


8.4 Time Series Tools

Time Series vs. Panel Data

Time series analysis (Chapter 7, Chapter 16) focuses on a single unit observed over many periods—one country's GDP, one firm's stock price. Panel data (Chapter 15) involves many units over fewer periods. The methods differ: time series emphasizes autocorrelation, stationarity, and dynamic structure; panels emphasize cross-sectional variation and fixed effects. The tools here focus on single-unit time series; for panel methods, see Chapter 18.

Time Series Basics

R with tseries and forecast:

Python with statsmodels:

Stationarity Testing

R:

Python:

Decomposition

R:

Python:

ARIMA Modeling

R:

Python:

VAR Models

R:

Python:


8.5 Network Analysis

Why Networks Matter

Many social phenomena have network structure: social connections, trade relationships, citation patterns, corporate boards, legislative cosponsorship. Network analysis provides tools to describe and analyze these relational structures.

Figure 8.1: A simple network visualization showing nodes (actors) connected by edges (relationships). Node size reflects degree centrality; color indicates community membership detected by modularity optimization.

Network Basics

Key concepts:

  • Nodes (vertices): Actors in the network (people, firms, countries)

  • Edges (links): Relationships between actors (friendships, trades, citations)

  • Directed vs. undirected: Do edges have direction? (Twitter follows vs. Facebook friends)

  • Weighted vs. unweighted: Do edges have strength? (trade volume vs. trade existence)

R with igraph

Community detection:

Visualization:

Python with NetworkX

Community detection:

Visualization:

Common Network Analyses

Bipartite networks (two-mode networks):

Export for further analysis:


8.6 Web Scraping and APIs

Legal and Ethical Considerations

Web scraping exists in a legal gray area. Key principles: (1) Check the site's Terms of Service—some explicitly prohibit scraping. (2) Respect robots.txt directives. (3) Don't overwhelm servers—add delays between requests. (4) Consider whether data contains personal information (GDPR, IRB implications). (5) Academic research often receives more latitude, but "fair use" for research is not a blanket exemption. When in doubt, contact the data source or consult your institution's legal office.

Data Collection from the Web

The web is a vast source of data for social science research: government websites, corporate filings, news articles, social media, online platforms. Two main approaches:

  1. APIs (Application Programming Interfaces): Structured access provided by the data source

  2. Web scraping: Extracting data from web pages directly

When to use which:

  • Prefer APIs when available—they're more reliable and typically permitted

  • Use scraping when no API exists or API limits are too restrictive

  • Always check terms of service and robots.txt

Working with APIs

R with httr:

Python with requests:

Example: FRED API (Federal Reserve Economic Data):

Web Scraping Basics

R with rvest:

Python with BeautifulSoup:

Scraping Multiple Pages

Best Practices for Web Data Collection

Ethical and Legal Considerations

  1. Check terms of service: Many sites prohibit scraping

  2. Respect robots.txt: Check example.com/robots.txt for rules

  3. Rate limiting: Don't overwhelm servers—add delays between requests

  4. Identify yourself: Set a user-agent string with contact info

  5. Cache results: Don't re-scrape data you already have

  6. Consider privacy: Be careful with personal data

Setting a user agent:

Handling dynamic content (JavaScript-rendered pages):


Practical Guidance

When to Use What

Task
R Package
Python Package

Basic visualization

ggplot2

matplotlib, seaborn

Interactive plots

plotly

plotly

Text tokenization

tidytext

nltk, spacy

Topic models

topicmodels

sklearn, gensim

Spatial data

sf

geopandas

Spatial statistics

spdep

pysal

Time series basics

stats, tseries

statsmodels

ARIMA

forecast

statsmodels, pmdarima

VAR

vars

statsmodels

Network analysis

igraph, ggraph

networkx

Web scraping

rvest

beautifulsoup4, requests

APIs

httr, jsonlite

requests

Dynamic scraping

RSelenium

selenium

Common Pitfalls

Pitfall 1: Ignoring Missing Data in Time Series Many time series functions fail silently with missing values or produce incorrect results.

How to avoid: Always check for and handle missing values explicitly. Use interpolation or indicator methods as appropriate.

Pitfall 2: Wrong CRS in Spatial Data Mixing coordinate reference systems produces incorrect spatial operations.

How to avoid: Always check CRS with st_crs() (R) or .crs (Python). Transform to common CRS before joining or overlaying.

Pitfall 3: Over-Interpreting Topic Models LDA produces topics regardless of whether coherent topics exist.

How to avoid: Use coherence metrics. Validate topics substantively. Try different numbers of topics.

Environment Management for Reproducibility

Package versions matter. Code that runs today may fail next year when dependencies update. Environment management tools lock package versions so analyses remain reproducible.

R with renv

The renv.lock file records exact package versions. Commit it to version control so collaborators (and future you) can recreate the environment.

Python with conda

Python with venv + pip

Box: Which Environment Tool?

Tool
Language
Best For

renv

R

Most R projects; plays well with RStudio

conda

Python (+ R)

Data science; handles non-Python dependencies (C libraries, GDAL)

venv + pip

Python

Lightweight Python-only projects

Docker

Any

Maximum reproducibility; bundles OS + dependencies (see Chapter 26)

Key principle: Record your environment in a lockfile committed to version control. A README saying "install pandas" is not reproducible; requirements.txt or renv.lock is.

Implementation Checklist

Visualization:

Text analysis:

Spatial:

Time series:


Summary

Key takeaways:

  1. ggplot2 (R) and matplotlib/seaborn (Python) handle most visualization needs; use plotly for interactivity.

  2. Text analysis follows a standard pipeline: tokenize, clean, create document-term matrix, then analyze with topic models or sentiment.

  3. Spatial analysis requires attention to coordinate reference systems; sf (R) and geopandas (Python) are the workhorses.

  4. Time series analysis starts with stationarity testing; forecast (R) and statsmodels (Python) provide comprehensive ARIMA and VAR tools.

  5. Network analysis with igraph (R) or NetworkX (Python) enables study of relational structures—centrality, communities, and network statistics.

  6. Web data collection through APIs is preferred when available; web scraping with rvest (R) or BeautifulSoup (Python) works when APIs don't exist—but respect terms of service and rate limits.

Returning to the opening question: The tools in this chapter implement the methods from Chapters 5-7. The key is matching the tool to the task while maintaining reproducible practices from Chapter 4. Both R and Python ecosystems are mature enough for any common task; choose based on your broader workflow and team preferences.


Further Reading

Essential

  • Wickham, H. (2016). "ggplot2: Elegant Graphics for Data Analysis." Springer.

  • Lovelace, R., J. Nowosad, and J. Muenchow (2019). "Geocomputation with R." CRC Press. [Free online]

For Deeper Understanding

  • Silge, J. and D. Robinson (2017). "Text Mining with R." O'Reilly. [Free online]

  • Hyndman, R. and G. Athanasopoulos (2021). "Forecasting: Principles and Practice." 3rd ed. [Free online]

Advanced/Specialized

  • Rey, S. et al. "Geographic Data Science with Python." [pysal documentation]

  • VanderPlas, J. (2016). "Python Data Science Handbook." O'Reilly. [Free online]

Network Analysis

  • Kolaczyk, E. and G. Csárdi (2014). "Statistical Analysis of Network Data with R." Springer.

  • Newman, M. (2018). "Networks." 2nd ed. Oxford University Press.

Digital Humanities and Interdisciplinary Methods

  • The Programming Historian (https://programminghistorian.org/en/lessons/). Peer-reviewed tutorials on digital methods for humanities research, including:

    • OCR and digitizing historical documents

    • Named entity recognition and extracting geographic/economic information from text

    • Network analysis of historical correspondence and social networks

    • Web scraping for archival research

    • Machine translation and multilingual text analysis

  • Jurafsky, D. and J. Martin. "Speech and Language Processing." 3rd ed. [Free online draft] Comprehensive treatment of NLP.

  • Jockers, M. (2014). "Text Analysis with R for Students of Literature." Springer. Accessible introduction to computational text analysis.


Exercises

Conceptual

  1. When would you choose interactive visualization over static? What are the tradeoffs?

  2. Explain why coordinate reference systems matter for spatial analysis. What problems can arise from mixing CRS?

Applied

  1. Take a corpus of text (academic abstracts, news articles, or tweets) and:

    • Preprocess the text

    • Fit a topic model with 5 topics

    • Visualize the top words per topic

    • Assess whether the topics are coherent

  2. Using county-level data (unemployment, income, or another variable):

    • Create a choropleth map

    • Test for spatial autocorrelation

    • Identify clusters using Local Moran's I

  3. Using a macroeconomic time series (GDP, unemployment, or inflation):

    • Test for stationarity

    • Fit an appropriate ARIMA model

    • Generate forecasts with confidence intervals

    • Assess forecast accuracy on held-out data

Discussion

  1. Some argue that interactive visualizations are more engaging but harder to reproduce. How would you balance these concerns in a research project?

Last updated