Chapter 8: Programming Companion—Data Exploration
Opening Question
How do we implement the descriptive and exploratory methods from Part II in practical code?
Chapter Overview
This chapter provides practical implementations of the descriptive methods covered in Chapters 5-7. We focus on four areas: visualization, text analysis, spatial data, and time series. For each, we show how to accomplish common tasks in both R and Python.
The emphasis is on practical workflow rather than comprehensive coverage. We show the tools most commonly needed for empirical research, with pointers to more specialized resources for advanced applications.
What you will learn:
How to create publication-quality visualizations with ggplot2 and matplotlib
How to process and analyze text data
How to work with spatial data and create maps
How to analyze time series data: decomposition, testing, and forecasting
How to analyze network data: structure, centrality, and communities
How to collect data from the web: APIs and web scraping
Prerequisites: Chapter 4 (programming foundations), Chapters 5-7 (conceptual foundations)
8.1 Visualization
Principles Review
Good visualizations should:
Show the data clearly
Avoid distortion
Serve a purpose (exploration, presentation, or both)
Be accessible (consider colorblindness, clarity at different sizes)
The Two Audiences
Visualizations serve two distinct purposes: exploration (for you) and presentation (for readers). Exploratory plots can be rough—quick histograms, scatter matrices, diagnostic checks. Presentation figures must be polished—clear labels, appropriate scales, thoughtful color choices. Don't confuse the two: spending hours on a plot you'll never show wastes time; presenting a quick-and-dirty plot to readers wastes their attention.
ggplot2 (R)
ggplot2 uses a "grammar of graphics" approach: build plots from components.
Basic structure:
Common plot types:
Coefficient plots:
Event study plots:
Saving plots:
matplotlib/seaborn (Python)
Common plot types:
Coefficient plots:
Interactive Visualization
R with plotly:
Python with plotly:
8.2 Text Analysis
Text Processing Pipeline
Text analysis typically follows these steps:
Load raw text
Clean and normalize
Tokenize
Remove stop words, stem/lemmatize
Create document-term matrix
Analyze (topic models, sentiment, etc.)
R with tidytext
Topic modeling with LDA:
Sentiment analysis:
Python with scikit-learn and NLTK
Topic modeling:
Sentiment with VADER (for social media text):
Beyond Basic Text Analysis: Interdisciplinary Applications
Text as data extends well beyond topic modeling and sentiment. Researchers across disciplines use computational text methods to extract structured information from unstructured sources:
Optical Character Recognition (OCR): Converting scanned historical documents into machine-readable text. Libraries like Tesseract (via tesseract in R or pytesseract in Python) handle OCR, though historical fonts and degraded documents require careful preprocessing.
Named Entity Recognition (NER): Extracting people, places, organizations, and dates from text. SpaCy (Python) and spacyr (R) provide pre-trained models; fine-tuning on domain-specific corpora improves accuracy for historical or technical texts.
Extracting economic information: Prices, quantities, exchange rates, and other numeric data often appear in unstructured text (historical newspapers, corporate filings, government reports). Regular expressions combined with NER can systematically extract these values.
Machine translation: Multilingual corpora—historical diplomatic correspondence, cross-national news coverage, colonial archives—require translation before analysis. Google Translate API, DeepL, and open-source models (via Hugging Face) enable programmatic translation at scale.
Geocoding text: Place names in historical documents can be converted to coordinates for spatial analysis using geocoding APIs (Google Maps, OpenStreetMap Nominatim) or historical gazetteers.
For Further Study
The Programming Historian (https://programminghistorian.org/en/lessons/) offers peer-reviewed tutorials on these methods, written for humanities researchers but valuable for any discipline working with text as data. The methodological pluralism of social science increasingly requires moving between quantitative analysis and computational text methods.
8.3 Spatial Data
Spatial Data Basics
Spatial data comes in two main forms:
Vector: Points, lines, polygons (boundaries)
Raster: Grid cells (satellite imagery, elevation)
We focus on vector data, which is more common in social science.
R with sf
Mapping with ggplot2:
Spatial autocorrelation:
Python with geopandas
Mapping:
Spatial autocorrelation with PySAL:
8.4 Time Series Tools
Time Series vs. Panel Data
Time series analysis (Chapter 7, Chapter 16) focuses on a single unit observed over many periods—one country's GDP, one firm's stock price. Panel data (Chapter 15) involves many units over fewer periods. The methods differ: time series emphasizes autocorrelation, stationarity, and dynamic structure; panels emphasize cross-sectional variation and fixed effects. The tools here focus on single-unit time series; for panel methods, see Chapter 18.
Time Series Basics
R with tseries and forecast:
Python with statsmodels:
Stationarity Testing
R:
Python:
Decomposition
R:
Python:
ARIMA Modeling
R:
Python:
VAR Models
R:
Python:
8.5 Network Analysis
Why Networks Matter
Many social phenomena have network structure: social connections, trade relationships, citation patterns, corporate boards, legislative cosponsorship. Network analysis provides tools to describe and analyze these relational structures.

Network Basics
Key concepts:
Nodes (vertices): Actors in the network (people, firms, countries)
Edges (links): Relationships between actors (friendships, trades, citations)
Directed vs. undirected: Do edges have direction? (Twitter follows vs. Facebook friends)
Weighted vs. unweighted: Do edges have strength? (trade volume vs. trade existence)
R with igraph
Community detection:
Visualization:
Python with NetworkX
Community detection:
Visualization:
Common Network Analyses
Bipartite networks (two-mode networks):
Export for further analysis:
8.6 Web Scraping and APIs
Legal and Ethical Considerations
Web scraping exists in a legal gray area. Key principles: (1) Check the site's Terms of Service—some explicitly prohibit scraping. (2) Respect robots.txt directives. (3) Don't overwhelm servers—add delays between requests. (4) Consider whether data contains personal information (GDPR, IRB implications). (5) Academic research often receives more latitude, but "fair use" for research is not a blanket exemption. When in doubt, contact the data source or consult your institution's legal office.
Data Collection from the Web
The web is a vast source of data for social science research: government websites, corporate filings, news articles, social media, online platforms. Two main approaches:
APIs (Application Programming Interfaces): Structured access provided by the data source
Web scraping: Extracting data from web pages directly
When to use which:
Prefer APIs when available—they're more reliable and typically permitted
Use scraping when no API exists or API limits are too restrictive
Always check terms of service and robots.txt
Working with APIs
R with httr:
Python with requests:
Example: FRED API (Federal Reserve Economic Data):
Web Scraping Basics
R with rvest:
Python with BeautifulSoup:
Scraping Multiple Pages
Best Practices for Web Data Collection
Ethical and Legal Considerations
Check terms of service: Many sites prohibit scraping
Respect robots.txt: Check
example.com/robots.txtfor rulesRate limiting: Don't overwhelm servers—add delays between requests
Identify yourself: Set a user-agent string with contact info
Cache results: Don't re-scrape data you already have
Consider privacy: Be careful with personal data
Setting a user agent:
Handling dynamic content (JavaScript-rendered pages):
Practical Guidance
When to Use What
Basic visualization
ggplot2
matplotlib, seaborn
Interactive plots
plotly
plotly
Text tokenization
tidytext
nltk, spacy
Topic models
topicmodels
sklearn, gensim
Spatial data
sf
geopandas
Spatial statistics
spdep
pysal
Time series basics
stats, tseries
statsmodels
ARIMA
forecast
statsmodels, pmdarima
VAR
vars
statsmodels
Network analysis
igraph, ggraph
networkx
Web scraping
rvest
beautifulsoup4, requests
APIs
httr, jsonlite
requests
Dynamic scraping
RSelenium
selenium
Common Pitfalls
Pitfall 1: Ignoring Missing Data in Time Series Many time series functions fail silently with missing values or produce incorrect results.
How to avoid: Always check for and handle missing values explicitly. Use interpolation or indicator methods as appropriate.
Pitfall 2: Wrong CRS in Spatial Data Mixing coordinate reference systems produces incorrect spatial operations.
How to avoid: Always check CRS with
st_crs()(R) or.crs(Python). Transform to common CRS before joining or overlaying.
Pitfall 3: Over-Interpreting Topic Models LDA produces topics regardless of whether coherent topics exist.
How to avoid: Use coherence metrics. Validate topics substantively. Try different numbers of topics.
Environment Management for Reproducibility
Package versions matter. Code that runs today may fail next year when dependencies update. Environment management tools lock package versions so analyses remain reproducible.
R with renv
The renv.lock file records exact package versions. Commit it to version control so collaborators (and future you) can recreate the environment.
Python with conda
Python with venv + pip
Box: Which Environment Tool?
renv
R
Most R projects; plays well with RStudio
conda
Python (+ R)
Data science; handles non-Python dependencies (C libraries, GDAL)
venv + pip
Python
Lightweight Python-only projects
Docker
Any
Maximum reproducibility; bundles OS + dependencies (see Chapter 26)
Key principle: Record your environment in a lockfile committed to version control. A
READMEsaying "install pandas" is not reproducible;requirements.txtorrenv.lockis.
Implementation Checklist
Visualization:
Text analysis:
Spatial:
Time series:
Summary
Key takeaways:
ggplot2 (R) and matplotlib/seaborn (Python) handle most visualization needs; use plotly for interactivity.
Text analysis follows a standard pipeline: tokenize, clean, create document-term matrix, then analyze with topic models or sentiment.
Spatial analysis requires attention to coordinate reference systems; sf (R) and geopandas (Python) are the workhorses.
Time series analysis starts with stationarity testing; forecast (R) and statsmodels (Python) provide comprehensive ARIMA and VAR tools.
Network analysis with igraph (R) or NetworkX (Python) enables study of relational structures—centrality, communities, and network statistics.
Web data collection through APIs is preferred when available; web scraping with rvest (R) or BeautifulSoup (Python) works when APIs don't exist—but respect terms of service and rate limits.
Returning to the opening question: The tools in this chapter implement the methods from Chapters 5-7. The key is matching the tool to the task while maintaining reproducible practices from Chapter 4. Both R and Python ecosystems are mature enough for any common task; choose based on your broader workflow and team preferences.
Further Reading
Essential
Wickham, H. (2016). "ggplot2: Elegant Graphics for Data Analysis." Springer.
Lovelace, R., J. Nowosad, and J. Muenchow (2019). "Geocomputation with R." CRC Press. [Free online]
For Deeper Understanding
Silge, J. and D. Robinson (2017). "Text Mining with R." O'Reilly. [Free online]
Hyndman, R. and G. Athanasopoulos (2021). "Forecasting: Principles and Practice." 3rd ed. [Free online]
Advanced/Specialized
Rey, S. et al. "Geographic Data Science with Python." [pysal documentation]
VanderPlas, J. (2016). "Python Data Science Handbook." O'Reilly. [Free online]
Network Analysis
Kolaczyk, E. and G. Csárdi (2014). "Statistical Analysis of Network Data with R." Springer.
Newman, M. (2018). "Networks." 2nd ed. Oxford University Press.
Digital Humanities and Interdisciplinary Methods
The Programming Historian (https://programminghistorian.org/en/lessons/). Peer-reviewed tutorials on digital methods for humanities research, including:
OCR and digitizing historical documents
Named entity recognition and extracting geographic/economic information from text
Network analysis of historical correspondence and social networks
Web scraping for archival research
Machine translation and multilingual text analysis
Jurafsky, D. and J. Martin. "Speech and Language Processing." 3rd ed. [Free online draft] Comprehensive treatment of NLP.
Jockers, M. (2014). "Text Analysis with R for Students of Literature." Springer. Accessible introduction to computational text analysis.
Exercises
Conceptual
When would you choose interactive visualization over static? What are the tradeoffs?
Explain why coordinate reference systems matter for spatial analysis. What problems can arise from mixing CRS?
Applied
Take a corpus of text (academic abstracts, news articles, or tweets) and:
Preprocess the text
Fit a topic model with 5 topics
Visualize the top words per topic
Assess whether the topics are coherent
Using county-level data (unemployment, income, or another variable):
Create a choropleth map
Test for spatial autocorrelation
Identify clusters using Local Moran's I
Using a macroeconomic time series (GDP, unemployment, or inflation):
Test for stationarity
Fit an appropriate ARIMA model
Generate forecasts with confidence intervals
Assess forecast accuracy on held-out data
Discussion
Some argue that interactive visualizations are more engaging but harder to reproduce. How would you balance these concerns in a research project?
Last updated