Chapter 6: Describing Patterns in Data

Opening Question

How do we discover structure in data and communicate what we find?


Chapter Overview

Before causation comes description. Before asking whether X causes Y, we need to understand what X and Y look like: their distributions, their relationships, their patterns across time and space. Good description is valuable in itself—documenting facts about the world is a core scientific activity—and essential preparation for causal analysis.

This chapter covers the tools of descriptive empirical work: exploratory data analysis, dimension reduction, text analysis, and spatial methods. These techniques help us see patterns, summarize complexity, and generate hypotheses. Description doesn't establish causation, but causation without description is blind.

What you will learn:

  • Principles of exploratory data analysis and visualization

  • Dimension reduction: PCA, factor analysis, clustering

  • Text as data: topic models, sentiment analysis, content analysis

  • Spatial patterns: mapping, spatial autocorrelation

  • The relationship between quantitative description and qualitative thick description

Prerequisites: Chapter 2 (Data), Chapter 3 (Statistical Foundations)


Historical Context: From Tables to Visualization

The history of data visualization parallels the history of empirical social science.

William Playfair (1759-1823) invented the line chart, bar chart, and pie chart. His 1786 Commercial and Political Atlas visualized England's trade data in forms we'd recognize today.

Florence Nightingale (1820-1910) used her "coxcomb" diagrams to show that more soldiers died from preventable disease than battle wounds—visualization as advocacy.

Charles Joseph Minard (1781-1870) created what Edward Tufte called "the best statistical graphic ever drawn": a map of Napoleon's Russian campaign showing army size, direction, location, temperature, and time simultaneously.

The 20th century brought statistical formalization: correlation (Pearson), factor analysis (Spearman, Thurstone), exploratory data analysis (Tukey's 1977 EDA), and eventually computational tools.

Today: The explosion of data and computing power has transformed description. We can visualize millions of observations, fit models with thousands of variables, and analyze text at unprecedented scale.


6.1 Exploratory Data Analysis

The EDA Philosophy

John Tukey's exploratory data analysis (EDA) emphasizes looking at data before formal modeling:

"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey

Principles:

  • Let the data speak before imposing models

  • Use visual displays rather than just summary statistics

  • Be open to surprise

  • Iterate between looking and questioning

Univariate Description

For continuous variables:

Central tendency: Mean, median, trimmed mean

  • Mean is efficient if data are symmetric; median is robust to outliers

Spread: Standard deviation, IQR, MAD (median absolute deviation)

  • SD is efficient for normal data; IQR/MAD more robust

Shape: Skewness, kurtosis

  • Skewness > 0: Right tail heavier (income distributions)

  • Kurtosis > 3: Heavier tails than normal

Visualization:

  • Histogram: See the shape

  • Kernel density plot: Smoothed shape

  • Box plot: Five-number summary (min, Q1, median, Q3, max)

  • Q-Q plot: Compare to theoretical distribution

Figure 6.1: Four ways to visualize the same distribution. Histogram shows shape and frequency; density plot provides a smoothed view; box plot summarizes the five-number summary with outliers; Q-Q plot reveals departures from normality (the upward curve indicates right skew).

For categorical variables:

  • Frequency table

  • Bar chart (not pie chart—humans are bad at comparing angles)

  • Mode and category proportions

Bivariate Description

Two continuous variables:

  • Scatter plot: The fundamental display

  • Correlation coefficient: r=Cov(X,Y)SD(X)SD(Y)r = \frac{Cov(X,Y)}{SD(X) \cdot SD(Y)}

  • Lowess/loess: Nonparametric smoothed relationship

Continuous and categorical:

  • Box plots by group

  • Kernel densities by group

  • Summary statistics by group

Two categorical variables:

  • Cross-tabulation (contingency table)

  • Chi-squared test of independence

  • Mosaic plot

Worked Example: Wage Distribution

Data: 10,000 U.S. workers from the CPS

Univariate description of hourly wages:

  • Mean: $25.50

  • Median: $20.00

  • SD: $15.00

  • Skewness: 2.3 (right-skewed)

  • Min: $7.25, Max: $150.00

Visualization reveals: Spike at minimum wage, long right tail, possible outliers above $100

Log transformation: Log wages are approximately normal—common for income data

Bivariate: Wages by education

  • Less than HS: Median $12.50

  • HS diploma: Median $17.00

  • Some college: Median $20.00

  • Bachelor's: Median $30.00

  • Graduate: Median $40.00

The educational gradient is immediately visible in side-by-side box plots.

Multivariate Description

With many variables, visualization becomes harder. Strategies:

Small multiples: Array of simple plots (scatter plots for all variable pairs)

Pair plots: Matrix of bivariate scatter plots

Parallel coordinates: Each observation is a line across variable axes

Heat maps: Color-coded correlation matrices

Interactive visualization: Linked views, brushing, zooming

What Description Can and Cannot Do

Description can:

  • Characterize distributions

  • Identify patterns and anomalies

  • Suggest hypotheses

  • Communicate findings

  • Establish facts

Description cannot:

  • Establish causation

  • Control for confounders

  • Prove mechanisms

  • Project out of sample (without strong assumptions)


6.2 Dimension Reduction

The Curse of Dimensionality

With many variables, data become sparse. In high dimensions, all points are far from all other points; nearest-neighbor methods fail; models become unstable.

Dimension reduction projects high-dimensional data onto a lower-dimensional space that preserves important structure.

Principal Component Analysis (PCA)

Idea: Find directions of maximum variance in the data; project onto those directions.

Mathematics: For data matrix XX (n × p, centered), find eigenvectors of XXX'X.

First principal component: Direction of maximum variance w1=argmaxw=1Var(Xw)w_1 = \arg\max_{||w||=1} Var(Xw)

Subsequent components: Orthogonal to previous, maximum remaining variance

Explained variance: The proportion of total variance captured by k components: j=1kλjj=1pλj\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^p \lambda_j}

where λj\lambda_j is the j-th eigenvalue.

Worked Example: Socioeconomic Status

Variables (measured for 1,000 households):

  • Income, Education years, Occupation prestige, Home value, Net worth, Car value

PCA results:

  • PC1 explains 65% of variance; loadings all positive → "overall SES"

  • PC2 explains 15% of variance; contrasts education vs. wealth → "human vs. financial capital"

  • First 2 PCs explain 80% of variance in 6 variables

Visualization: Plot households in PC1-PC2 space; see clustering, outliers, patterns

Factor Analysis

Idea: Variables are driven by latent factors plus idiosyncratic error.

Model: Xi=Λfi+εiX_i = \Lambda f_i + \varepsilon_i

where:

  • XiX_i = observed variables (p × 1)

  • fif_i = latent factors (k × 1, k << p)

  • Λ\Lambda = factor loadings (p × k)

  • εi\varepsilon_i = unique factors

Difference from PCA:

  • PCA: Components are exact linear combinations of observed variables

  • Factor analysis: Factors are latent; model includes error structure

Box: PCA vs. Factor Analysis—A Deeper Comparison

These methods are often confused because they can produce similar-looking output. But they embody different philosophies.

Aspect
PCA
Factor Analysis

Goal

Variance summarization

Latent variable modeling

Direction

Data → Components

Latent factors → Data

Model

No statistical model

Explicit generative model

Error

None (exact decomposition)

Unique factors εi\varepsilon_i

Components/Factors

Defined by data

Assumed to exist

Uniqueness

Unique solution

Rotation required

Sample size

Works with n < p

Needs n >> p typically

Key intuition:

PCA asks: "What linear combinations of my variables capture the most variance?" The first principal component is the direction where your data is most spread out. There's no assumption about why—just finding the best summary.

Factor analysis asks: "What latent constructs might have generated these correlated variables?" It posits that unobserved factors (like "intelligence" or "SES") cause the observed correlations. The error term matters—it represents variation unique to each variable.

When each is appropriate:

  • PCA: Dimension reduction for prediction, visualization, dealing with multicollinearity, no theoretical commitment to latent constructs

  • Factor analysis: Psychometrics (measuring personality, ability), constructing validated scales, when you believe latent factors exist

Common mistake: Using PCA when you have a measurement model in mind. If you believe "socioeconomic status" is a real latent variable that causes income, education, and occupation prestige to be correlated, factor analysis is theoretically appropriate. PCA just finds variance-maximizing directions without causal interpretation.

Rotation: Factor solutions are not unique. Rotation (varimax, promax) seeks interpretable structure.

Clustering

Idea: Group observations into clusters with similar characteristics.

K-means clustering:

  1. Choose k cluster centers

  2. Assign each observation to nearest center

  3. Update centers as cluster means

  4. Iterate until convergence

Hierarchical clustering:

  • Build tree (dendrogram) of nested clusters

  • Agglomerative: Start with individual observations, merge

  • Divisive: Start with one cluster, split

Choosing k: Elbow method (variance explained), silhouette score, gap statistic

When to Use Which

Goal
Method

Summarize variance

PCA

Posit latent constructs

Factor analysis

Group similar observations

Clustering

Prediction with many variables

PCA → regression

Construct indices

Factor analysis or PCA


6.3 Text as Data

The Text Revolution

Text data—documents, speeches, social media, surveys—are abundant but unstructured. Quantitative text analysis transforms words into numbers.

Applications:

  • Measuring political ideology from speeches

  • Analyzing consumer sentiment from reviews

  • Tracking media coverage

  • Coding open-ended survey responses

Preprocessing

Before analysis, text requires preprocessing:

Tokenization: Split text into words (tokens)

Case normalization: Convert to lowercase

Stop word removal: Remove common words (the, and, is)

Stemming/lemmatization: Reduce words to stems (running → run)

N-grams: Create multi-word tokens (e.g., "New York" as single token)

Bag of Words and TF-IDF

Bag of words: Represent document as vector of word counts (ignoring order)

Document-term matrix: Rows are documents, columns are terms, cells are counts

TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDFij=TFij×logNDFj\text{TF-IDF}_{ij} = \text{TF}_{ij} \times \log\frac{N}{\text{DF}_j}

where:

  • TFij\text{TF}_{ij} = frequency of term j in document i

  • DFj\text{DF}_j = number of documents containing term j

  • NN = total number of documents

TF-IDF upweights distinctive terms; downweights common terms.

Topic Models

Idea: Documents are mixtures of topics; topics are distributions over words.

Latent Dirichlet Allocation (LDA):

  • Each document has a distribution over topics

  • Each topic has a distribution over words

  • Generative model inferred from observed word patterns

Output:

  • Topic-word distributions: What words characterize each topic?

  • Document-topic distributions: What topics are in each document?

Example: Apply LDA to Congressional speeches

  • Topic 1: Economy (tax, budget, spending, jobs)

  • Topic 2: Foreign policy (war, military, treaty, alliance)

  • Topic 3: Social issues (family, children, health, education)

Each speech is a mixture of these topics.

An influential application of this approach is Jensen, Kaplan, Naidu, and Wilse-Samson (2012), who analyze 130 years of Congressional speech to measure partisan language. They identify phrases that distinguish Republican from Democratic speakers and track how this partisan vocabulary diffuses through public discourse. The work demonstrates how text methods can transform qualitative historical records into quantitative measures of political polarization. A related strand examines whether economists' research itself exhibits political language patterns (Jelveh, Kogut, and Naidu 2024).

Sentiment Analysis

Idea: Measure positive/negative sentiment in text.

Dictionary methods:

  • Count positive words (excellent, wonderful) and negative words (terrible, awful)

  • Net sentiment = positive - negative

Machine learning methods:

  • Train classifier on labeled examples (positive/negative reviews)

  • Apply to new text

Challenges:

  • Negation ("not good")

  • Sarcasm

  • Domain specificity (terms mean different things in different contexts)

Text Analysis Workflow

  1. Collect text corpus

  2. Preprocess (tokenize, clean, transform)

  3. Explore (word frequencies, co-occurrences)

  4. Model (topic model, sentiment, classification)

  5. Validate (human coding comparison)

  6. Analyze (relate text measures to outcomes)


6.4 Spatial Patterns

Why Space Matters

"Everything is related to everything else, but near things are more related than distant things." — Waldo Tobler's First Law of Geography

Spatial patterns are pervasive in social data:

  • Housing prices depend on neighborhood

  • Crime clusters geographically

  • Economic activity agglomerates

  • Health outcomes vary by region

Mapping and Visualization

Choropleth maps: Color regions by variable value

  • Good for rates, densities, indices

  • Challenge: Area and population may not align

Point maps: Plot individual locations

  • Good for events, facilities, observations

  • Challenge: Overplotting in dense areas

Flow maps: Show movement between locations

  • Good for migration, trade, commuting

Figure 6.2: Three approaches to spatial visualization. (a) Choropleth colors regions by value intensity; (b) Dot density places points proportionally within each region; (c) Proportional symbols size markers by value, preserving geographic location.

Spatial Autocorrelation

Definition 6.1 (Spatial Autocorrelation): Spatial autocorrelation measures the degree to which values at nearby locations are similar (positive autocorrelation) or dissimilar (negative autocorrelation).

Moran's I: I=nijwijijwij(yiyˉ)(yjyˉ)i(yiyˉ)2I = \frac{n}{\sum_i \sum_j w_{ij}} \cdot \frac{\sum_i \sum_j w_{ij}(y_i - \bar{y})(y_j - \bar{y})}{\sum_i (y_i - \bar{y})^2}

where wijw_{ij} is spatial weight (e.g., 1 if i and j are neighbors, 0 otherwise).

Interpretation:

  • I>0I > 0: Positive autocorrelation (clustering)

  • I=0I = 0: Random spatial pattern

  • I<0I < 0: Negative autocorrelation (dispersion)

Local Indicators

Local Moran's I: Detect clusters and outliers at specific locations

Getis-Ord Gi*: Hot spot detection (clusters of high or low values)

Spatial Analysis Workflow

  1. Map the data; look for patterns

  2. Test for spatial autocorrelation (Moran's I)

  3. Identify clusters and outliers (LISA)

  4. Model spatial relationships if needed (Chapter 11 covers spatial regression)

Worked Example: Income and Geography

Data: Median household income by U.S. county

Choropleth map reveals:

  • High income corridors (Northeast, West Coast)

  • Low income regions (rural South, Appalachia)

  • Metropolitan clustering

Moran's I = 0.65 (p < 0.001): Strong positive spatial autocorrelation

LISA analysis identifies:

  • Hot spots: Counties with high income surrounded by high income

  • Cold spots: Counties with low income surrounded by low income

  • Spatial outliers: High-income counties in poor regions

Interpretation: Income is highly clustered; understanding geographic determinants requires examining local economic conditions, historical factors, and spillovers.


6.5 Network Analysis

Why Networks Matter

Many social and economic phenomena have network structure:

  • Social networks: Friendships, professional connections, social media

  • Economic networks: Supply chains, trade relationships, financial linkages

  • Information networks: Citations, hyperlinks, knowledge flows

  • Organizational networks: Firms in markets, interlocking directorates

Network structure shapes outcomes: diseases spread through social contacts, financial crises propagate through bank lending networks, ideas diffuse through professional connections.

Network Fundamentals

A network (or graph) consists of:

  • Nodes (vertices): Individuals, firms, countries, etc.

  • Edges (links): Connections between nodes

  • Weights (optional): Strength of connections

Types of networks:

  • Directed: Edges have direction (A follows B ≠ B follows A)

  • Undirected: Edges are symmetric (A is friends with B = B is friends with A)

  • Bipartite: Two types of nodes with edges only between types (workers and firms)

Centrality Measures

Who is important in a network? Different centrality measures capture different concepts:

Measure
Definition
Interpretation

Degree

Number of connections

Popular/active

Betweenness

Fraction of shortest paths through node

Broker/gatekeeper

Closeness

Inverse average distance to all nodes

Efficient access to network

Eigenvector

Connections to well-connected nodes

Influence (knows important people)

PageRank

Iterative importance (Google)

Prestige from incoming links

Network Structure

Global properties:

  • Density: Fraction of possible edges that exist

  • Diameter: Longest shortest path

  • Clustering coefficient: Probability that two neighbors of a node are themselves connected ("friends of friends are friends")

  • Average path length: Mean distance between nodes

Community detection: Finding groups of densely connected nodes

Economic Applications

Labor markets: Job search through networks; referral hiring

  • Finding: Most jobs found through "weak ties" (Granovetter 1973)

  • Mechanism: Weak ties bridge different social circles, providing novel information

Financial contagion: Bank failures spreading through lending networks

  • Systemic risk depends on network structure

  • Highly connected "hub" banks are too-interconnected-to-fail

Trade networks: Patterns of international trade

  • Core-periphery structure: Rich countries trade with everyone; poor countries trade mainly with rich ones

  • Supply chain networks reveal economic dependencies

Development: Social networks and technology adoption

  • Agricultural innovations spread through farmer networks

  • Network structure predicts which interventions diffuse

Descriptive vs. Causal Network Analysis

Warning: Network position is typically endogenous.

If central individuals have better outcomes, is it because:

  • Centrality causes good outcomes? (information access, social support)

  • Good outcomes cause centrality? (successful people attract connections)

  • Third factors cause both? (ability drives both success and connections)

Descriptive network analysis characterizes structure. Causal claims require identification strategies (Chapter 9+).

See Chapter 8 for network analysis code in R and Python.


Box: Growth Accounting and the Limits of the Credibility Revolution

The credibility revolution (Chapter 1) transformed empirical economics by demanding clear identification strategies for causal claims. But some of the most important economic questions don't fit neatly into the RCT/IV/DiD framework.

The China puzzle: How did China achieve 9% annual growth for three decades? This is arguably the most consequential economic story of our era—800 million people lifted from poverty. Yet we cannot randomize economic systems, find instruments for "adopting market reforms," or construct a synthetic control China.

Growth accounting offers a descriptive decomposition:

GDP growth=αCapital growth+(1α)Labor growth+TFP growth\text{GDP growth} = \alpha \cdot \text{Capital growth} + (1-\alpha) \cdot \text{Labor growth} + \text{TFP growth}

For China (approximate):

  • GDP growth: ~9%/year

  • Capital contribution: ~4%/year

  • Labor contribution: ~1%/year

  • TFP (productivity) growth: ~4%/year

This tells us what happened (roughly half was capital accumulation, half was productivity improvement) but not why. TFP is famously "a measure of our ignorance."

The methodological point: Descriptive frameworks like growth accounting remain valuable precisely because they organize our understanding of phenomena too large and complex for credibility-revolution methods. They don't replace causal inference—they complement it by:

  1. Establishing facts: Before asking "why," we need to know "what"

  2. Guiding causal questions: TFP growth raises questions amenable to micro-evidence (What policies? Which reforms? What mechanisms?)

  3. Providing context: Micro-estimates of specific interventions gain meaning within macro patterns

The most important economic questions often require multiple methods: descriptive decomposition to characterize the phenomenon, credibility-revolution methods to identify specific mechanisms, and qualitative analysis to understand institutions and context (Chapter 23).


6.6 Qualitative Bridge: Thick Description

The Concept

Anthropologist Clifford Geertz distinguished "thin description" (what happened) from "thick description" (what it meant):

"The thin description of a wink is the rapid contraction of an eyelid. The thick description includes the context, meaning, and cultural significance—was it a conspiratorial signal, a parody, or a twitch?"

Quantitative description captures the thin layer; qualitative description adds thickness.

Complementarity

Quantitative strengths:

  • Precision and comparability

  • Generalization across cases

  • Pattern detection at scale

Qualitative strengths:

  • Context and meaning

  • Depth of understanding

  • Discovery of unexpected connections

Example: Inequality Statistics

Quantitative description: The Gini coefficient for U.S. income rose from 0.35 in 1970 to 0.48 in 2020.

Thick description: Ethnographic research reveals what rising inequality means in lived experience—the anxieties of middle-class families, the strategies of the poor, the isolation of the wealthy, the spatial separation of classes.

The number and the narrative complement each other.

When to Combine

Before quantitative analysis: Qualitative research identifies relevant variables, meaningful categories, important distinctions

During analysis: Qualitative cases illustrate quantitative patterns

After analysis: Qualitative follow-up explains surprising findings


Practical Guidance

Choosing Visualization

Data Type
Basic Choice
Alternative

Distribution

Histogram

Density, box plot

Two continuous

Scatter plot

Hexbin if dense

Continuous × categorical

Box plots by group

Violin plots

Time series

Line chart

Area chart

Geographic

Choropleth

Dot density

Many variables

Pair plot, heat map

PCA visualization

Common Pitfalls

Pitfall 1: Correlation isn't causation A strong correlation in EDA doesn't establish a causal relationship.

How to avoid: Describe associations carefully; reserve causal language for methods from Part III.

Pitfall 2: Overinterpreting clusters K-means will always find k clusters, even in random data.

How to avoid: Validate clusters; test stability; use criteria for optimal k.

Pitfall 3: P-hacking in EDA Testing many relationships and reporting only significant ones.

How to avoid: Treat EDA as exploratory; confirm findings in separate data; report all analyses.

Pitfall 4: Chartjunk Excessive decoration obscures data.

How to avoid: Follow Tufte's principles: maximize data-ink ratio; minimize chartjunk.

Visualization Principles

  1. Show the data: Don't hide behind summaries

  2. Facilitate comparison: Align scales; use small multiples

  3. Minimize clutter: Remove unnecessary elements

  4. Label clearly: Axis labels, titles, legends

  5. Consider color blindness: Use colorblind-friendly palettes

  6. Know your audience: Tailor complexity appropriately


Summary

Key takeaways:

  1. Description precedes and complements causation: Before asking "why," understand "what."

  2. Exploratory data analysis emphasizes visualization and openness to surprise. Look at the data before modeling.

  3. Dimension reduction (PCA, factor analysis, clustering) makes high-dimensional data tractable.

  4. Text as data transforms unstructured text into quantitative measures through preprocessing, topic models, and sentiment analysis.

  5. Spatial patterns are pervasive in social data. Spatial autocorrelation is the norm, not the exception.

  6. Thick description from qualitative research provides meaning and context that quantitative description lacks.

Returning to the opening question: We discover structure through visualization, dimension reduction, and pattern recognition. We communicate findings through clear, honest displays that let the data speak. Good description requires both quantitative precision and qualitative understanding—numbers and narrative together.


Further Reading

Essential

  • Tukey (1977), Exploratory Data Analysis - The foundational text

  • Tufte (2001), The Visual Display of Quantitative Information - Principles of visualization

For Deeper Understanding

  • Jolliffe (2002), Principal Component Analysis - PCA theory and practice

  • Grimmer, Roberts, and Stewart (2022), Text as Data - Comprehensive text analysis

  • Anselin (1988), Spatial Econometrics - Spatial analysis foundations

Advanced/Specialized

  • Blei, Ng, and Jordan (2003), "Latent Dirichlet Allocation" - Topic model foundation

  • Wickham (2010), "A Layered Grammar of Graphics" - ggplot2 foundations

  • Bivand, Pebesma, and Gómez-Rubio (2013), Applied Spatial Data Analysis with R

Applications

  • Jensen, Kaplan, Naidu, and Wilse-Samson (2012), "Political Polarization and the Dynamics of Political Language" - Congressional speech analysis

  • Jelveh, Kogut, and Naidu (2024), "Political Language in Economics" - Text analysis of economics research

  • Gentzkow and Shapiro (2010), "What Drives Media Slant?" - Text analysis in economics

  • Chetty et al. (2016), "The Effects of Exposure to Better Neighborhoods" - Geographic variation

  • Ash and Chen (2023), "Text Algorithms in Economics" - NLP applications


Exercises

Conceptual

  1. Explain the difference between PCA and factor analysis. When would you prefer one over the other?

  2. Why is spatial autocorrelation so common in social data? Give three examples and explain the mechanism.

  3. A researcher fits a 10-topic LDA model to newspaper articles. How would you assess whether the topics are meaningful?

Applied

  1. Using a dataset with multiple socioeconomic variables (income, education, occupation, etc.):

    • Conduct PCA and interpret the first two components

    • Cluster observations; characterize the clusters

    • Create a composite "socioeconomic status" index

  2. Using county-level data:

    • Create a choropleth map of a variable of interest

    • Calculate Moran's I for spatial autocorrelation

    • Identify hot spots and cold spots

  3. Using a text corpus (e.g., State of the Union addresses):

    • Preprocess the text

    • Create a document-term matrix

    • Fit a topic model with 5 topics

    • Interpret and label the topics

Discussion

  1. "Data visualization is just making pretty pictures. The real work is in statistical analysis." Critique this view.

Last updated