Chapter 6: Describing Patterns in Data
Opening Question
How do we discover structure in data and communicate what we find?
Chapter Overview
Before causation comes description. Before asking whether X causes Y, we need to understand what X and Y look like: their distributions, their relationships, their patterns across time and space. Good description is valuable in itself—documenting facts about the world is a core scientific activity—and essential preparation for causal analysis.
This chapter covers the tools of descriptive empirical work: exploratory data analysis, dimension reduction, text analysis, and spatial methods. These techniques help us see patterns, summarize complexity, and generate hypotheses. Description doesn't establish causation, but causation without description is blind.
What you will learn:
Principles of exploratory data analysis and visualization
Dimension reduction: PCA, factor analysis, clustering
Text as data: topic models, sentiment analysis, content analysis
Spatial patterns: mapping, spatial autocorrelation
The relationship between quantitative description and qualitative thick description
Prerequisites: Chapter 2 (Data), Chapter 3 (Statistical Foundations)
Historical Context: From Tables to Visualization
The history of data visualization parallels the history of empirical social science.
William Playfair (1759-1823) invented the line chart, bar chart, and pie chart. His 1786 Commercial and Political Atlas visualized England's trade data in forms we'd recognize today.
Florence Nightingale (1820-1910) used her "coxcomb" diagrams to show that more soldiers died from preventable disease than battle wounds—visualization as advocacy.
Charles Joseph Minard (1781-1870) created what Edward Tufte called "the best statistical graphic ever drawn": a map of Napoleon's Russian campaign showing army size, direction, location, temperature, and time simultaneously.
The 20th century brought statistical formalization: correlation (Pearson), factor analysis (Spearman, Thurstone), exploratory data analysis (Tukey's 1977 EDA), and eventually computational tools.
Today: The explosion of data and computing power has transformed description. We can visualize millions of observations, fit models with thousands of variables, and analyze text at unprecedented scale.
6.1 Exploratory Data Analysis
The EDA Philosophy
John Tukey's exploratory data analysis (EDA) emphasizes looking at data before formal modeling:
"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey
Principles:
Let the data speak before imposing models
Use visual displays rather than just summary statistics
Be open to surprise
Iterate between looking and questioning
Univariate Description
For continuous variables:
Central tendency: Mean, median, trimmed mean
Mean is efficient if data are symmetric; median is robust to outliers
Spread: Standard deviation, IQR, MAD (median absolute deviation)
SD is efficient for normal data; IQR/MAD more robust
Shape: Skewness, kurtosis
Skewness > 0: Right tail heavier (income distributions)
Kurtosis > 3: Heavier tails than normal
Visualization:
Histogram: See the shape
Kernel density plot: Smoothed shape
Box plot: Five-number summary (min, Q1, median, Q3, max)
Q-Q plot: Compare to theoretical distribution

For categorical variables:
Frequency table
Bar chart (not pie chart—humans are bad at comparing angles)
Mode and category proportions
Bivariate Description
Two continuous variables:
Scatter plot: The fundamental display
Correlation coefficient: r=SD(X)⋅SD(Y)Cov(X,Y)
Lowess/loess: Nonparametric smoothed relationship
Continuous and categorical:
Box plots by group
Kernel densities by group
Summary statistics by group
Two categorical variables:
Cross-tabulation (contingency table)
Chi-squared test of independence
Mosaic plot
Worked Example: Wage Distribution
Data: 10,000 U.S. workers from the CPS
Univariate description of hourly wages:
Mean: $25.50
Median: $20.00
SD: $15.00
Skewness: 2.3 (right-skewed)
Min: $7.25, Max: $150.00
Visualization reveals: Spike at minimum wage, long right tail, possible outliers above $100
Log transformation: Log wages are approximately normal—common for income data
Bivariate: Wages by education
Less than HS: Median $12.50
HS diploma: Median $17.00
Some college: Median $20.00
Bachelor's: Median $30.00
Graduate: Median $40.00
The educational gradient is immediately visible in side-by-side box plots.
Multivariate Description
With many variables, visualization becomes harder. Strategies:
Small multiples: Array of simple plots (scatter plots for all variable pairs)
Pair plots: Matrix of bivariate scatter plots
Parallel coordinates: Each observation is a line across variable axes
Heat maps: Color-coded correlation matrices
Interactive visualization: Linked views, brushing, zooming
What Description Can and Cannot Do
Description can:
Characterize distributions
Identify patterns and anomalies
Suggest hypotheses
Communicate findings
Establish facts
Description cannot:
Establish causation
Control for confounders
Prove mechanisms
Project out of sample (without strong assumptions)
6.2 Dimension Reduction
The Curse of Dimensionality
With many variables, data become sparse. In high dimensions, all points are far from all other points; nearest-neighbor methods fail; models become unstable.
Dimension reduction projects high-dimensional data onto a lower-dimensional space that preserves important structure.
Principal Component Analysis (PCA)
Idea: Find directions of maximum variance in the data; project onto those directions.
Mathematics: For data matrix X (n × p, centered), find eigenvectors of X′X.
First principal component: Direction of maximum variance w1=argmax∣∣w∣∣=1Var(Xw)
Subsequent components: Orthogonal to previous, maximum remaining variance
Explained variance: The proportion of total variance captured by k components: ∑j=1pλj∑j=1kλj
where λj is the j-th eigenvalue.
Worked Example: Socioeconomic Status
Variables (measured for 1,000 households):
Income, Education years, Occupation prestige, Home value, Net worth, Car value
PCA results:
PC1 explains 65% of variance; loadings all positive → "overall SES"
PC2 explains 15% of variance; contrasts education vs. wealth → "human vs. financial capital"
First 2 PCs explain 80% of variance in 6 variables
Visualization: Plot households in PC1-PC2 space; see clustering, outliers, patterns
Factor Analysis
Idea: Variables are driven by latent factors plus idiosyncratic error.
Model: Xi=Λfi+εi
where:
Xi = observed variables (p × 1)
fi = latent factors (k × 1, k << p)
Λ = factor loadings (p × k)
εi = unique factors
Difference from PCA:
PCA: Components are exact linear combinations of observed variables
Factor analysis: Factors are latent; model includes error structure
Box: PCA vs. Factor Analysis—A Deeper Comparison
These methods are often confused because they can produce similar-looking output. But they embody different philosophies.
Goal
Variance summarization
Latent variable modeling
Direction
Data → Components
Latent factors → Data
Model
No statistical model
Explicit generative model
Error
None (exact decomposition)
Unique factors εi
Components/Factors
Defined by data
Assumed to exist
Uniqueness
Unique solution
Rotation required
Sample size
Works with n < p
Needs n >> p typically
Key intuition:
PCA asks: "What linear combinations of my variables capture the most variance?" The first principal component is the direction where your data is most spread out. There's no assumption about why—just finding the best summary.
Factor analysis asks: "What latent constructs might have generated these correlated variables?" It posits that unobserved factors (like "intelligence" or "SES") cause the observed correlations. The error term matters—it represents variation unique to each variable.
When each is appropriate:
PCA: Dimension reduction for prediction, visualization, dealing with multicollinearity, no theoretical commitment to latent constructs
Factor analysis: Psychometrics (measuring personality, ability), constructing validated scales, when you believe latent factors exist
Common mistake: Using PCA when you have a measurement model in mind. If you believe "socioeconomic status" is a real latent variable that causes income, education, and occupation prestige to be correlated, factor analysis is theoretically appropriate. PCA just finds variance-maximizing directions without causal interpretation.
Rotation: Factor solutions are not unique. Rotation (varimax, promax) seeks interpretable structure.
Clustering
Idea: Group observations into clusters with similar characteristics.
K-means clustering:
Choose k cluster centers
Assign each observation to nearest center
Update centers as cluster means
Iterate until convergence
Hierarchical clustering:
Build tree (dendrogram) of nested clusters
Agglomerative: Start with individual observations, merge
Divisive: Start with one cluster, split
Choosing k: Elbow method (variance explained), silhouette score, gap statistic
When to Use Which
Summarize variance
PCA
Posit latent constructs
Factor analysis
Group similar observations
Clustering
Prediction with many variables
PCA → regression
Construct indices
Factor analysis or PCA
6.3 Text as Data
The Text Revolution
Text data—documents, speeches, social media, surveys—are abundant but unstructured. Quantitative text analysis transforms words into numbers.
Applications:
Measuring political ideology from speeches
Analyzing consumer sentiment from reviews
Tracking media coverage
Coding open-ended survey responses
Preprocessing
Before analysis, text requires preprocessing:
Tokenization: Split text into words (tokens)
Case normalization: Convert to lowercase
Stop word removal: Remove common words (the, and, is)
Stemming/lemmatization: Reduce words to stems (running → run)
N-grams: Create multi-word tokens (e.g., "New York" as single token)
Bag of Words and TF-IDF
Bag of words: Represent document as vector of word counts (ignoring order)
Document-term matrix: Rows are documents, columns are terms, cells are counts
TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDFij=TFij×logDFjN
where:
TFij = frequency of term j in document i
DFj = number of documents containing term j
N = total number of documents
TF-IDF upweights distinctive terms; downweights common terms.
Topic Models
Idea: Documents are mixtures of topics; topics are distributions over words.
Latent Dirichlet Allocation (LDA):
Each document has a distribution over topics
Each topic has a distribution over words
Generative model inferred from observed word patterns
Output:
Topic-word distributions: What words characterize each topic?
Document-topic distributions: What topics are in each document?
Example: Apply LDA to Congressional speeches
Topic 1: Economy (tax, budget, spending, jobs)
Topic 2: Foreign policy (war, military, treaty, alliance)
Topic 3: Social issues (family, children, health, education)
Each speech is a mixture of these topics.
An influential application of this approach is Jensen, Kaplan, Naidu, and Wilse-Samson (2012), who analyze 130 years of Congressional speech to measure partisan language. They identify phrases that distinguish Republican from Democratic speakers and track how this partisan vocabulary diffuses through public discourse. The work demonstrates how text methods can transform qualitative historical records into quantitative measures of political polarization. A related strand examines whether economists' research itself exhibits political language patterns (Jelveh, Kogut, and Naidu 2024).
Sentiment Analysis
Idea: Measure positive/negative sentiment in text.
Dictionary methods:
Count positive words (excellent, wonderful) and negative words (terrible, awful)
Net sentiment = positive - negative
Machine learning methods:
Train classifier on labeled examples (positive/negative reviews)
Apply to new text
Challenges:
Negation ("not good")
Sarcasm
Domain specificity (terms mean different things in different contexts)
Text Analysis Workflow
Collect text corpus
Preprocess (tokenize, clean, transform)
Explore (word frequencies, co-occurrences)
Model (topic model, sentiment, classification)
Validate (human coding comparison)
Analyze (relate text measures to outcomes)
6.4 Spatial Patterns
Why Space Matters
"Everything is related to everything else, but near things are more related than distant things." — Waldo Tobler's First Law of Geography
Spatial patterns are pervasive in social data:
Housing prices depend on neighborhood
Crime clusters geographically
Economic activity agglomerates
Health outcomes vary by region
Mapping and Visualization
Choropleth maps: Color regions by variable value
Good for rates, densities, indices
Challenge: Area and population may not align
Point maps: Plot individual locations
Good for events, facilities, observations
Challenge: Overplotting in dense areas
Flow maps: Show movement between locations
Good for migration, trade, commuting

Spatial Autocorrelation
Definition 6.1 (Spatial Autocorrelation): Spatial autocorrelation measures the degree to which values at nearby locations are similar (positive autocorrelation) or dissimilar (negative autocorrelation).
Moran's I: I=∑i∑jwijn⋅∑i(yi−yˉ)2∑i∑jwij(yi−yˉ)(yj−yˉ)
where wij is spatial weight (e.g., 1 if i and j are neighbors, 0 otherwise).
Interpretation:
I>0: Positive autocorrelation (clustering)
I=0: Random spatial pattern
I<0: Negative autocorrelation (dispersion)
Local Indicators
Local Moran's I: Detect clusters and outliers at specific locations
Getis-Ord Gi*: Hot spot detection (clusters of high or low values)
Spatial Analysis Workflow
Map the data; look for patterns
Test for spatial autocorrelation (Moran's I)
Identify clusters and outliers (LISA)
Model spatial relationships if needed (Chapter 11 covers spatial regression)
Worked Example: Income and Geography
Data: Median household income by U.S. county
Choropleth map reveals:
High income corridors (Northeast, West Coast)
Low income regions (rural South, Appalachia)
Metropolitan clustering
Moran's I = 0.65 (p < 0.001): Strong positive spatial autocorrelation
LISA analysis identifies:
Hot spots: Counties with high income surrounded by high income
Cold spots: Counties with low income surrounded by low income
Spatial outliers: High-income counties in poor regions
Interpretation: Income is highly clustered; understanding geographic determinants requires examining local economic conditions, historical factors, and spillovers.
6.5 Network Analysis
Why Networks Matter
Many social and economic phenomena have network structure:
Social networks: Friendships, professional connections, social media
Economic networks: Supply chains, trade relationships, financial linkages
Information networks: Citations, hyperlinks, knowledge flows
Organizational networks: Firms in markets, interlocking directorates
Network structure shapes outcomes: diseases spread through social contacts, financial crises propagate through bank lending networks, ideas diffuse through professional connections.
Network Fundamentals
A network (or graph) consists of:
Nodes (vertices): Individuals, firms, countries, etc.
Edges (links): Connections between nodes
Weights (optional): Strength of connections
Types of networks:
Directed: Edges have direction (A follows B ≠ B follows A)
Undirected: Edges are symmetric (A is friends with B = B is friends with A)
Bipartite: Two types of nodes with edges only between types (workers and firms)
Centrality Measures
Who is important in a network? Different centrality measures capture different concepts:
Degree
Number of connections
Popular/active
Betweenness
Fraction of shortest paths through node
Broker/gatekeeper
Closeness
Inverse average distance to all nodes
Efficient access to network
Eigenvector
Connections to well-connected nodes
Influence (knows important people)
PageRank
Iterative importance (Google)
Prestige from incoming links
Network Structure
Global properties:
Density: Fraction of possible edges that exist
Diameter: Longest shortest path
Clustering coefficient: Probability that two neighbors of a node are themselves connected ("friends of friends are friends")
Average path length: Mean distance between nodes
Community detection: Finding groups of densely connected nodes
Economic Applications
Labor markets: Job search through networks; referral hiring
Finding: Most jobs found through "weak ties" (Granovetter 1973)
Mechanism: Weak ties bridge different social circles, providing novel information
Financial contagion: Bank failures spreading through lending networks
Systemic risk depends on network structure
Highly connected "hub" banks are too-interconnected-to-fail
Trade networks: Patterns of international trade
Core-periphery structure: Rich countries trade with everyone; poor countries trade mainly with rich ones
Supply chain networks reveal economic dependencies
Development: Social networks and technology adoption
Agricultural innovations spread through farmer networks
Network structure predicts which interventions diffuse
Descriptive vs. Causal Network Analysis
Warning: Network position is typically endogenous.
If central individuals have better outcomes, is it because:
Centrality causes good outcomes? (information access, social support)
Good outcomes cause centrality? (successful people attract connections)
Third factors cause both? (ability drives both success and connections)
Descriptive network analysis characterizes structure. Causal claims require identification strategies (Chapter 9+).
See Chapter 8 for network analysis code in R and Python.
Box: Growth Accounting and the Limits of the Credibility Revolution
The credibility revolution (Chapter 1) transformed empirical economics by demanding clear identification strategies for causal claims. But some of the most important economic questions don't fit neatly into the RCT/IV/DiD framework.
The China puzzle: How did China achieve 9% annual growth for three decades? This is arguably the most consequential economic story of our era—800 million people lifted from poverty. Yet we cannot randomize economic systems, find instruments for "adopting market reforms," or construct a synthetic control China.
Growth accounting offers a descriptive decomposition:
GDP growth=α⋅Capital growth+(1−α)⋅Labor growth+TFP growth
For China (approximate):
GDP growth: ~9%/year
Capital contribution: ~4%/year
Labor contribution: ~1%/year
TFP (productivity) growth: ~4%/year
This tells us what happened (roughly half was capital accumulation, half was productivity improvement) but not why. TFP is famously "a measure of our ignorance."
The methodological point: Descriptive frameworks like growth accounting remain valuable precisely because they organize our understanding of phenomena too large and complex for credibility-revolution methods. They don't replace causal inference—they complement it by:
Establishing facts: Before asking "why," we need to know "what"
Guiding causal questions: TFP growth raises questions amenable to micro-evidence (What policies? Which reforms? What mechanisms?)
Providing context: Micro-estimates of specific interventions gain meaning within macro patterns
The most important economic questions often require multiple methods: descriptive decomposition to characterize the phenomenon, credibility-revolution methods to identify specific mechanisms, and qualitative analysis to understand institutions and context (Chapter 23).
6.6 Qualitative Bridge: Thick Description
The Concept
Anthropologist Clifford Geertz distinguished "thin description" (what happened) from "thick description" (what it meant):
"The thin description of a wink is the rapid contraction of an eyelid. The thick description includes the context, meaning, and cultural significance—was it a conspiratorial signal, a parody, or a twitch?"
Quantitative description captures the thin layer; qualitative description adds thickness.
Complementarity
Quantitative strengths:
Precision and comparability
Generalization across cases
Pattern detection at scale
Qualitative strengths:
Context and meaning
Depth of understanding
Discovery of unexpected connections
Example: Inequality Statistics
Quantitative description: The Gini coefficient for U.S. income rose from 0.35 in 1970 to 0.48 in 2020.
Thick description: Ethnographic research reveals what rising inequality means in lived experience—the anxieties of middle-class families, the strategies of the poor, the isolation of the wealthy, the spatial separation of classes.
The number and the narrative complement each other.
When to Combine
Before quantitative analysis: Qualitative research identifies relevant variables, meaningful categories, important distinctions
During analysis: Qualitative cases illustrate quantitative patterns
After analysis: Qualitative follow-up explains surprising findings
Practical Guidance
Choosing Visualization
Distribution
Histogram
Density, box plot
Two continuous
Scatter plot
Hexbin if dense
Continuous × categorical
Box plots by group
Violin plots
Time series
Line chart
Area chart
Geographic
Choropleth
Dot density
Many variables
Pair plot, heat map
PCA visualization
Common Pitfalls
Pitfall 1: Correlation isn't causation A strong correlation in EDA doesn't establish a causal relationship.
How to avoid: Describe associations carefully; reserve causal language for methods from Part III.
Pitfall 2: Overinterpreting clusters K-means will always find k clusters, even in random data.
How to avoid: Validate clusters; test stability; use criteria for optimal k.
Pitfall 3: P-hacking in EDA Testing many relationships and reporting only significant ones.
How to avoid: Treat EDA as exploratory; confirm findings in separate data; report all analyses.
Pitfall 4: Chartjunk Excessive decoration obscures data.
How to avoid: Follow Tufte's principles: maximize data-ink ratio; minimize chartjunk.
Visualization Principles
Show the data: Don't hide behind summaries
Facilitate comparison: Align scales; use small multiples
Minimize clutter: Remove unnecessary elements
Label clearly: Axis labels, titles, legends
Consider color blindness: Use colorblind-friendly palettes
Know your audience: Tailor complexity appropriately
Summary
Key takeaways:
Description precedes and complements causation: Before asking "why," understand "what."
Exploratory data analysis emphasizes visualization and openness to surprise. Look at the data before modeling.
Dimension reduction (PCA, factor analysis, clustering) makes high-dimensional data tractable.
Text as data transforms unstructured text into quantitative measures through preprocessing, topic models, and sentiment analysis.
Spatial patterns are pervasive in social data. Spatial autocorrelation is the norm, not the exception.
Thick description from qualitative research provides meaning and context that quantitative description lacks.
Returning to the opening question: We discover structure through visualization, dimension reduction, and pattern recognition. We communicate findings through clear, honest displays that let the data speak. Good description requires both quantitative precision and qualitative understanding—numbers and narrative together.
Further Reading
Essential
Tukey (1977), Exploratory Data Analysis - The foundational text
Tufte (2001), The Visual Display of Quantitative Information - Principles of visualization
For Deeper Understanding
Jolliffe (2002), Principal Component Analysis - PCA theory and practice
Grimmer, Roberts, and Stewart (2022), Text as Data - Comprehensive text analysis
Anselin (1988), Spatial Econometrics - Spatial analysis foundations
Advanced/Specialized
Blei, Ng, and Jordan (2003), "Latent Dirichlet Allocation" - Topic model foundation
Wickham (2010), "A Layered Grammar of Graphics" - ggplot2 foundations
Bivand, Pebesma, and Gómez-Rubio (2013), Applied Spatial Data Analysis with R
Applications
Jensen, Kaplan, Naidu, and Wilse-Samson (2012), "Political Polarization and the Dynamics of Political Language" - Congressional speech analysis
Jelveh, Kogut, and Naidu (2024), "Political Language in Economics" - Text analysis of economics research
Gentzkow and Shapiro (2010), "What Drives Media Slant?" - Text analysis in economics
Chetty et al. (2016), "The Effects of Exposure to Better Neighborhoods" - Geographic variation
Ash and Chen (2023), "Text Algorithms in Economics" - NLP applications
Exercises
Conceptual
Explain the difference between PCA and factor analysis. When would you prefer one over the other?
Why is spatial autocorrelation so common in social data? Give three examples and explain the mechanism.
A researcher fits a 10-topic LDA model to newspaper articles. How would you assess whether the topics are meaningful?
Applied
Using a dataset with multiple socioeconomic variables (income, education, occupation, etc.):
Conduct PCA and interpret the first two components
Cluster observations; characterize the clusters
Create a composite "socioeconomic status" index
Using county-level data:
Create a choropleth map of a variable of interest
Calculate Moran's I for spatial autocorrelation
Identify hot spots and cold spots
Using a text corpus (e.g., State of the Union addresses):
Preprocess the text
Create a document-term matrix
Fit a topic model with 5 topics
Interpret and label the topics
Discussion
"Data visualization is just making pretty pictures. The real work is in statistical analysis." Critique this view.
Last updated