Chapter 2: Data—The Raw Material

Opening Question

How do we obtain reliable information about the social world, and what can go wrong along the way?


Chapter Overview

Empirical research begins with data. But data don't arrive pristine. They are collected by institutions with their own purposes, recorded in systems designed for administration rather than research, measured with instruments that may distort what we seek to understand, and reported by people who may misremember, deceive, or decline to respond. Understanding where data come from, how they're measured, and what can go wrong is essential for drawing valid conclusions.

This chapter surveys social science data sources: the major types, the challenges each presents, and the strategies for assessing and improving data quality. The goal is not to make you a data collection expert, but a sophisticated consumer of data, able to recognize quality issues and understand their implications for analysis.

What you will learn:

  • The major types of data sources and their relative strengths

  • How to assess validity and reliability of measures

  • Common problems: missing data, measurement error, selection

  • Strategies for linking datasets and constructing analysis samples

  • How qualitative data complements quantitative sources

Prerequisites: None—this is a foundational chapter


2.1 Types of Data Sources

Administrative Data

Administrative data are collected by governments and organizations for operational purposes, then repurposed for research.

Examples:

  • Tax records (income, employment)

  • Social Security earnings histories

  • Medicare/Medicaid claims (healthcare utilization)

  • School enrollment and test scores

  • Criminal justice records

  • Business registries

Strengths:

  • Coverage: Often universal or near-universal within the relevant population

  • No recall bias: Records contemporaneous, not dependent on memory

  • Large samples: Often the entire population of interest

  • Long time spans: Administrative systems persist for decades

  • Low cost: Data already exist; marginal cost of research access is low

Weaknesses:

  • Limited variables: Only what the system collects; often lacks covariates researchers want

  • Measurement tied to administrative categories: "Income" means what the tax code says, not what economists mean

  • Gaming and manipulation: Actors respond strategically (tax avoidance, gaming education metrics)

  • Access restrictions: Privacy concerns limit access; approval processes are lengthy

  • Changes over time: Administrative definitions change, creating discontinuities

Example: U.S. Social Security earnings data

Strengths: Universal coverage of formal employment, linked over lifetimes, exact earnings (up to taxable maximum)

Weaknesses: Self-employment earnings may be underreported; earnings above cap are truncated; no information on hours, occupation, or job characteristics

Survey Data

Surveys ask people directly about their characteristics, behaviors, and attitudes.

Types of surveys:

  • Cross-sectional: One-time snapshot (e.g., General Social Survey)

  • Repeated cross-section: Same questions, different samples over time (e.g., CPS monthly)

  • Panel/longitudinal: Same individuals tracked over time (e.g., PSID, NLSY)

Strengths:

  • Researcher control: Can ask exactly what you want to know

  • Subjective measures: Attitudes, beliefs, well-being—things administrative data can't capture

  • Standardization: Designed for research, with documentation

  • Contextual information: Rich covariates, household structure, history

Weaknesses:

  • Response bias: People may not report truthfully (social desirability, sensitive questions)

  • Recall error: Memory is imperfect, especially for dates and amounts

  • Nonresponse: Not everyone agrees to participate; nonresponse may be selective

  • Cost: Surveys are expensive; sample sizes limited by budget

  • Attrition: In panels, people drop out over time

Experimental Data

Experiments generate data through controlled intervention (Chapter 10 covers experimental design in depth).

Strengths:

  • Internal validity: Randomization ensures treatment-control comparability

  • Designed for causal inference: Variables and timing chosen for research purpose

  • Controlled measurement: Can standardize data collection across groups

Weaknesses:

  • External validity: Experimental populations and settings may not generalize

  • Hawthorne effects: Being studied may change behavior

  • Ethical constraints: Some interventions can't be randomized

  • Cost and logistics: Experiments are expensive and complex

Observational/Found Data

Observational data are collected for purposes other than research or generated naturally by human activity.

Examples:

  • Historical records (censuses, trade statistics, price lists)

  • Newspaper archives

  • Corporate financial data

  • Geographic information

Strengths:

  • Covers questions surveys can't: Historical, rare, or sensitive topics

  • May be the only option: For historical analysis or when surveys are infeasible

  • Rich context: Documents provide qualitative alongside quantitative information

Weaknesses:

  • Selection: What survived or was recorded may not be representative

  • Measurement inconsistency: Categories and definitions change over time

  • Requires expertise: Understanding context is essential for valid interpretation

Digital Trace Data

Digital systems generate massive amounts of data as byproducts of online activity.

Examples:

  • Social media posts and interactions

  • Web browsing and search histories

  • Mobile phone location data

  • E-commerce transactions

  • Sensor and IoT data

Strengths:

  • Scale: Billions of observations

  • Granularity: Fine-grained temporal and behavioral detail

  • Real behavior: What people do, not just what they say

  • Real-time: Captures dynamics as they unfold

Weaknesses:

  • Selection: Not everyone uses digital platforms equally

  • Construct validity: What does a "like" or "share" actually measure?

  • Platform changes: Data collection depends on platform policies that change

  • Privacy and ethics: Consent is murky; potential for harm

  • Noise: Much data is uninformative; signal extraction is hard

Unstructured Data

A growing share of empirical work uses data that doesn't fit neatly into rows and columns: text, images, audio, and video. These require specialized methods to convert into analyzable form.

Text data:

  • Company earnings calls, congressional speeches, news articles

  • Social media posts, product reviews, open-ended survey responses

  • Historical documents, legal filings, patent applications

  • Methods: Sentiment analysis, topic modeling, word embeddings, LLMs (see Chapter 8)

Image data:

  • Satellite imagery (nighttime lights as economic activity, deforestation)

  • Street View images (neighborhood characteristics, property values)

  • Historical photographs (infrastructure, urban change)

  • Medical imaging, product images

  • Methods: Computer vision, convolutional neural networks

Audio and video:

  • Recorded interviews, speeches, debates

  • Earnings call audio (tone, emotion beyond transcript)

  • Surveillance and body camera footage

  • Methods: Speech recognition, acoustic analysis, video understanding

Strengths:

  • Rich information not captured in structured data

  • Often available at scale (millions of documents, global satellite coverage)

  • Captures nuance and context

Weaknesses:

  • Requires ML/NLP expertise or specialized tools

  • Measurement validity is harder to assess (what does a "sentiment score" really mean?)

  • Computationally intensive

  • Training data may embed biases

Box: From Unstructured to Structured

The practical workflow involves converting unstructured data to analyzable variables:

Raw Data
Extracted Feature
Use in Analysis

Earnings call transcript

Sentiment score, topic proportions

Predict stock returns

Satellite nighttime lights

Pixel intensity by region-year

Proxy for GDP in data-poor countries

News articles

Named entity counts, event indicators

Measure policy uncertainty

Product reviews

Star rating, aspect-level sentiment

Study consumer preferences

The feature extraction step is where most methodological challenges arise. See Chapter 8 for implementation.

Major Data Repositories and Resources

Empirical researchers benefit from knowing the major repositories of publicly available data. Here are essential resources organized by type:

General-purpose archives:

  • ICPSR (Inter-university Consortium for Political and Social Research): The largest social science data archive, hosting thousands of datasets with documentation

  • Harvard Dataverse: Open repository for research data across disciplines

  • UK Data Service: British equivalent to ICPSR, excellent for UK and comparative data

Harmonized microdata:

  • IPUMS (Integrated Public Use Microdata Series): Harmonized census and survey microdata from the U.S. and internationally—essential for historical and cross-national research

  • Luxembourg Income Study (LIS): Harmonized income and wealth microdata from 50+ countries

  • Comparative Political Data Set: Cross-national data on political and economic indicators

Economic and financial data:

  • FRED (Federal Reserve Economic Data): Macroeconomic time series, easily accessible via API

  • World Bank Open Data: Development indicators for all countries

  • Penn World Table: Internationally comparable GDP, capital, and productivity measures

  • WRDS (Wharton Research Data Services): Financial and accounting data (institutional subscription required)

Health and demographic data:

  • NHANES (National Health and Nutrition Examination Survey): Physical exams and health measures

  • HRS (Health and Retirement Study): Longitudinal data on aging Americans

  • DHS (Demographic and Health Surveys): Standardized surveys in developing countries

Linked and administrative data:

  • Census Longitudinal Infrastructure: Links Census surveys over time

  • LEHD (Longitudinal Employer-Household Dynamics): Employer-employee matched data

  • Many countries now offer researcher access to linked administrative records through statistical agencies or secure data centers

Getting started: For most topics, begin with ICPSR or IPUMS. Search for existing datasets before collecting new data—someone may have already collected what you need.


2.2 Measurement: Validity and Reliability

The Concept of Measurement

Measurement connects abstract concepts to observable indicators. We want to measure "income," "education," "health," or "political ideology"—but we observe only proxies: tax filings, years of schooling, survey responses about symptoms, or voting behavior.

Definition 2.1 (Measurement): Measurement is the assignment of numbers (or categories) to units according to rules, intended to represent the magnitude of a property.

Validity

Definition 2.2 (Validity): A measure is valid if it captures the concept it's intended to measure.

Types of validity:

Construct validity: Does the measure capture the underlying construct?

  • Does a standardized test measure "intelligence" or test-taking skill?

  • Does self-reported happiness measure well-being?

Box: Goodhart's Law and Campbell's Law—When Measures Become Targets

Two related principles warn that measurement validity can deteriorate when measures are used for high-stakes decisions:

Goodhart's Law (1975): "When a measure becomes a target, it ceases to be a good measure."

Campbell's Law (1979): "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

Examples:

  • Schools: When test scores determine funding, teachers "teach to the test"—scores rise but learning may not

  • Hospitals: When mortality rates affect rankings, hospitals avoid high-risk patients or reclassify deaths

  • Police: When arrest quotas are targets, officers make more arrests—but not necessarily reduce crime

  • Academia: When citation counts determine careers, gaming citations becomes rational

The mechanism: A measure initially correlates with the construct because behavior optimizes the construct, not the measure. Once the measure becomes a target, behavior optimizes the measure directly, breaking the correlation.

Implications for research:

  1. Be cautious using high-stakes administrative measures as outcomes

  2. Prefer measures that are difficult to manipulate

  3. Use multiple measures to triangulate

  4. Consider whether your measurement itself might change behavior

Content validity: Does the measure cover all relevant aspects of the construct?

  • Does a measure of "socioeconomic status" capture wealth, income, education, and occupation?

Criterion validity: Does the measure correlate with other measures it should correlate with?

  • Predictive: Does it predict future outcomes? (SAT predicting college GPA)

  • Concurrent: Does it correlate with other measures of the same thing?

Face validity: Does the measure obviously relate to the concept?

  • Least rigorous; necessary but not sufficient

Reliability

Definition 2.3 (Reliability): A measure is reliable if it produces consistent results under consistent conditions.

Types of reliability:

Test-retest reliability: Does the measure give the same result when repeated?

  • Ask the same question twice; do answers agree?

Inter-rater reliability: Do different measurers agree?

  • Two coders categorizing text; do they assign the same codes?

Internal consistency: Do multiple items measuring the same construct correlate?

  • Cronbach's alpha for survey scales

The Validity-Reliability Relationship

A measure can be:

  • Reliable but not valid: Always gives the same wrong answer

  • Valid but not reliable: On average correct but noisy

  • Neither: Random and wrong

  • Both: The goal

Figure 2.1: Validity and reliability illustrated with target diagrams. Each dot represents one measurement; the center cross marks the true value. A valid measure hits the truth on average; a reliable measure produces consistent results. The first panel shows the ideal: both valid and reliable. The second shows high reliability (tight clustering) but bias (off-center). The third shows validity (centered on truth) but low reliability (high variance). The fourth shows neither property.

Figure 2.1 uses target diagrams to illustrate these concepts. Think of each measurement as throwing a dart at a target, where the bullseye represents the true value. Reliability means your throws cluster tightly together. Validity means they cluster around the bullseye. You can be consistently wrong (reliable but invalid—panel 2), on-average right but noisy (valid but unreliable—panel 3), or both wrong and inconsistent (panel 4). Only panel 1 achieves the goal of both.

Reliability is necessary but not sufficient for validity.

Implications for Analysis

Measurement error in outcome variables (YY measured with error):

  • Increases noise, reduces precision

  • Doesn't bias coefficients if error is classical (uncorrelated with XX)

Measurement error in treatment or explanatory variables (XX or DD measured with error):

  • In bivariate regression, classical measurement error biases the coefficient toward zero (attenuation bias)

  • In multivariate regression, the situation is more complex:

    • The mismeasured variable's coefficient is still attenuated

    • But coefficients on other covariates can be biased in any direction (up or down), depending on correlations between regressors

    • This makes the direction of overall bias often indeterminate

  • Non-classical error (correlated with true value) can bias in any direction regardless of model complexity

Warning: The simple "attenuation bias" intuition from introductory econometrics applies only to bivariate regression. In applied work with multiple controls, measurement error effects are generally unpredictable without strong assumptions about the covariance structure.

Proxy variables: Using an observable proxy for an unobserved concept:

  • May introduce bias if proxy imperfectly captures concept

  • Bias direction depends on the relationship between proxy and true variable


2.3 Missing Data

Types of Missing Data

Definition 2.4 (Missing Data Mechanisms):

  • MCAR (Missing Completely at Random): Probability of missing is independent of observed and unobserved values

  • MAR (Missing at Random): Probability of missing depends on observed but not unobserved values

  • MNAR (Missing Not at Random): Probability of missing depends on the unobserved value itself

MCAR example: Survey responses lost due to random computer failure.

MAR example: Higher-income respondents are less likely to report income, but among people with the same observed education and occupation, missingness is random.

MNAR example: People with the highest incomes refuse to report income because it's high.

Implications

MCAR: Complete-case analysis is unbiased (though inefficient). Simple imputation is valid.

MAR: Complete-case analysis may be biased. Imputation conditional on observed variables is valid.

MNAR: No general solution. Requires modeling the selection process or sensitivity analysis.

Box: MNAR and Partial Identification (Manski Bounds)

When data are MNAR, point identification of population parameters is generally impossible without untestable assumptions about the selection process. However, we can often obtain bounds—a range of values consistent with the data and minimal assumptions.

The key insight (Manski, 1989, 2003): Without assumptions about missing values, we can still bound parameters using worst-case reasoning. If we want to estimate E[Y]E[Y] and some YY values are missing:

E[Y]lower=E[Yobserved]P(observed)+YminP(missing)E[Y]^{lower} = E[Y|observed] \cdot P(observed) + Y_{min} \cdot P(missing) E[Y]upper=E[Yobserved]P(observed)+YmaxP(missing)E[Y]^{upper} = E[Y|observed] \cdot P(observed) + Y_{max} \cdot P(missing)

Example: A survey asks about income but 20% refuse to answer. Observed mean income is $60,000. If income is bounded between $0 and $500,000:

  • Lower bound: 0.8×60,000+0.2×0=48,0000.8 \times 60{,}000 + 0.2 \times 0 = 48{,}000

  • Upper bound: 0.8×60,000+0.2×500,000=148,0000.8 \times 60{,}000 + 0.2 \times 500{,}000 = 148{,}000

These worst-case bounds are often wide but are honest—they reflect our genuine uncertainty under MNAR.

Tightening bounds: Additional assumptions narrow the bounds:

  • Monotone selection: If we believe high-income people are more likely to refuse, we can rule out lower values for missing observations

  • Instrumental variables: Variables affecting selection but not outcomes can help (Chapter 12)

  • Exclusion restrictions: Prior knowledge about the selection process

Connection to sensitivity analysis: Rather than choosing one imputation model, bounds show the range of conclusions consistent with different assumptions. This is more honest than pretending we know the selection process.

See Chapter 17 for full treatment of partial identification and bounds.

Strategies

Complete-case analysis: Analyze only observations with no missing values.

  • Simple but wasteful; biased if not MCAR

Mean/mode imputation: Replace missing values with sample mean.

  • Preserves means but distorts variances and correlations

Regression imputation: Predict missing values from observed variables.

  • Better than mean imputation; still understates uncertainty

Multiple imputation: Generate multiple completed datasets, analyze each, combine results.

  • Accounts for imputation uncertainty

  • Requires MAR or explicit selection model

Maximum likelihood: Estimate parameters using all available information.

  • Efficient under MAR

  • Requires correctly specified model

Sample Selection

Missing data due to sample selection is particularly problematic:

  • Attrition in panels: People drop out of longitudinal studies

  • Survey nonresponse: Some populations hard to reach or refuse

  • Administrative truncation: Only see people who interact with the system

Selection models (Heckman correction) attempt to address this but require strong assumptions (Chapter 11 discusses selection on observables; Chapter 17 discusses bounds).


2.4 Data Quality Assessment

Red Flags

Impossibilities: Values outside logical range (negative ages, future dates, percentages over 100)

Implausibilities: Values that are logically possible but extremely unlikely (claiming 168 hours worked per week)

Heaping: Excessive clustering at round numbers (ages reported as 30, 40, 50)

Inconsistencies: Internal contradictions (unmarried but married last year; child older than parent)

Outliers: Extreme values that may be errors or genuine but influential observations

Validation Strategies

Cross-source validation: Compare measure to alternative source

  • Survey-reported income vs. administrative records

  • Self-reported health vs. medical claims

Predictive validity: Does the measure predict outcomes it should predict?

  • Educational attainment should predict earnings

Known-group validity: Does the measure differentiate groups it should differentiate?

  • A depression scale should show higher scores among diagnosed patients

Text-matching and creative proxies: When direct measurement is impossible, creative approaches can reveal information about unobserved quantities.

Example: Measuring Quality of Unfunded Research (Li 2017)

How do you measure the quality of research that was never funded and therefore never conducted? Li (2017) faced this problem when studying NIH peer review. She wanted to know whether grant reviewers could identify high-quality proposals, but couldn't observe what unfunded proposals would have produced.

Her solution: text-matching. She measured the textual similarity between unfunded proposals and subsequently published research. If an unfunded proposal closely matched later publications, that suggested the rejected idea was actually good---someone else pursued it. This proxy allowed her to assess whether "near-miss" applications (just below the funding threshold) contained valuable ideas that reviewers failed to recognize.

This exemplifies a broader principle: when the variable you want is unobservable, creative proxy construction can make the invisible visible. The key is validating that your proxy actually captures what you claim.

Sensitivity checks: How much do results change with different treatments of potential errors?

Documentation

Good data come with good documentation:

Codebook: Variable definitions, coding schemes, valid values

Technical documentation: Sampling procedures, weighting, questionnaire

Data dictionary: File structure, variable names, formats

User guide: How to use the data properly

Without documentation, data are nearly useless—or worse, actively misleading.


2.5 Linking and Constructing Data

Record Linkage

Linking records across datasets multiplies analytic possibilities but introduces new challenges.

Exact matching: Use unique identifiers (SSN, ID numbers)

  • Best case; rare outside administrative data

Probabilistic matching: Use multiple fields (name, birthdate, address) to find likely matches

  • Generates false matches and misses true matches

  • Requires tuning and validation

Linkage error: Both false positives (wrong matches) and false negatives (missed matches) can bias analysis

Panel Construction

Longitudinal data track units over time, but:

Attrition: Units drop out Refreshment: New units added (may not be comparable) Inconsistency: Definitions change over time Gaps: Observations missing for some periods

Sample Definition

Defining the analysis sample requires choices:

Population definition: Who is "in" the study population?

  • Working-age adults? Registered voters? Firms with >100 employees?

Temporal boundaries: What time period?

  • Calendar years? Cohorts? Event time?

Exclusions: Who is excluded and why?

  • Missing key variables? Outliers? Specific subgroups?

Each choice affects interpretation. Document and justify.


2.6 Data Collection: Practical Considerations

Primary vs. Secondary Data

Primary data: Collected by you for your research purpose

  • Full control over design

  • Expensive and time-consuming

Secondary data: Collected by others for other purposes

  • Cheaper and faster

  • Limited to what others chose to collect

Most academic research uses secondary data; understanding their origins matters for interpreting results.

Data Access

Public-use data: Freely available (often with registration)

  • Census microdata, IPUMS, many survey datasets

Restricted-access data: Requires application, approval, security protocols

  • Tax records, health records, confidential business data

  • May require working in secure facilities

Commercial data: Purchased from vendors

  • Financial data, consumer behavior, proprietary surveys

Web scraping: Collecting data from websites

  • Legal and ethical gray areas

  • Data quality uncertain

Ethics

Data about people raises ethical obligations:

Informed consent: Did people agree to be studied?

  • Not always possible (administrative data, historical records)

  • IRB review assesses risks and protections

Privacy: Can individuals be identified?

  • De-identification may not prevent re-identification

  • Cell sizes, rare characteristics, linkage attacks

Harm: Could research results harm subjects?

  • Stigmatization of groups

  • Policy changes affecting vulnerable populations

Data security: Are data stored and transmitted safely?


2.7 Qualitative Bridge: Documents, Interviews, Observation

Qualitative Data Sources

Quantitative data aren't the only source of knowledge. Qualitative approaches provide:

Documents: Letters, reports, meeting minutes, newspapers, speeches

  • Rich contextual information

  • Reveal reasoning and decision processes

  • Subject to availability and selection

Interviews: Structured or unstructured conversations

  • Access to subjective experience

  • Can probe unexpected directions

  • Interviewer effects, social desirability concerns

Observation: Direct witnessing of behavior

  • What people do, not just what they say

  • Hawthorne effects, observer presence changes behavior

  • Limited to observable phenomena

Complementarity with Quantitative Data

Quantitative strengths: Generalization, precision, testing, comparison

Qualitative strengths: Depth, context, discovery, explanation

Combined approaches:

  • Use qualitative research to identify variables for quantitative study

  • Use quantitative patterns to select cases for qualitative investigation

  • Triangulate findings across methods (Chapter 23)

Example: Understanding Survey Responses

A survey measures "job satisfaction" on a 1-5 scale. But what do respondents mean when they answer?

Quantitative alone: Correlate satisfaction with wages, hours, tenure Qualitative complement: Interview workers about what they consider when rating satisfaction

The qualitative work reveals that "job satisfaction" means different things to different people—some emphasize pay, others autonomy, others relationships. This informs interpretation of quantitative results.


2.8 Running Example: China's Growth Data

The Challenge

Measuring China's post-1978 economic growth involves all the data challenges discussed in this chapter:

Administrative data quality: Chinese official statistics are produced by agencies with incentives to overreport growth. Provincial GDP numbers often don't aggregate to national totals. Researchers debate whether official statistics can be trusted.

Measurement issues:

  • What price deflators to use when relative prices change dramatically?

  • How to measure output in sectors transitioning from plan to market?

  • Service sector notoriously hard to measure

Missing data: Pre-reform data are incomplete; some series were not collected or published; war and political disruption created gaps.

Alternative sources: Researchers have used satellite nighttime lights (proxy for economic activity), electricity consumption, trade partner data (Chinese exports reported by importers), and physical output measures to validate official statistics.

Assessment Strategies

Cross-validation: Compare official GDP with electricity consumption—the relationship was stable in other countries at similar development stages. In China, it's anomalous, suggesting measurement issues.

Partner country data: China's reported exports to Hong Kong can be compared to Hong Kong's reported imports from China. Discrepancies reveal under- or over-invoicing.

Physical output: Agricultural yields, industrial production (tons of steel, cement) provide cross-checks on value measures.

Expert assessment: Economists have developed adjusted series (Young 2003, Holz 2014) that attempt to correct known biases.

Implications

The uncertainty in Chinese data affects all subsequent analysis:

  • Growth rates could be overstated by 1-2 percentage points annually

  • TFP growth estimates depend heavily on assumptions

  • Policy conclusions must acknowledge data limitations

This illustrates a general principle: sophisticated analysis cannot overcome poor data. Understanding data quality is the foundation of credible empirical work.


Practical Guidance

Choosing Data Sources

Need
Recommended Source

Large sample, long panel

Administrative data

Subjective measures, attitudes

Survey data

Causal identification

Experimental data

Historical questions

Archival data

Behavioral detail

Digital trace data

Context and meaning

Qualitative data

Common Pitfalls

Pitfall 1: Taking data at face value Assuming data accurately represent what they claim without investigating measurement.

How to avoid: Read documentation; understand data collection; validate against other sources.

Pitfall 2: Ignoring missing data Dropping observations with missing values without considering selection.

How to avoid: Assess missing data mechanism; use appropriate imputation; report sensitivity.

Pitfall 3: Mechanical data cleaning Dropping outliers or "impossible" values without understanding why they occur.

How to avoid: Investigate outliers; they may be errors, but may also be informative; report decisions.

Pitfall 4: Definition drift Using variables whose definitions changed over time without adjustment.

How to avoid: Read documentation carefully; harmonize definitions; test for discontinuities.

Data Quality Checklist


Summary

Key takeaways:

  1. Data come in many forms: Administrative, survey, experimental, observational, and digital trace data each have strengths and weaknesses.

  2. Validity and reliability are distinct: A measure can be consistently wrong (reliable but not valid) or on-average right but noisy (valid but not reliable).

  3. Missing data mechanisms matter: MCAR, MAR, and MNAR require different approaches; ignoring missing data can bias results.

  4. Data quality requires active assessment: Don't trust; verify. Cross-validate, check documentation, investigate anomalies.

  5. Data construction involves choices: Sample definition, variable construction, and handling of problems all affect results. Document and justify.

  6. Qualitative data complement quantitative: Documents, interviews, and observation provide context, meaning, and validation.

Returning to the opening question: Reliable information about the social world comes from understanding where data originate, how they're measured, and what can go wrong. No data are perfect; the goal is to understand limitations and their implications for analysis. Sophisticated methods cannot compensate for poor data, but careful attention to data quality enables credible inference.


Further Reading

Essential

  • Groves et al. (2009), Survey Methodology - Comprehensive treatment of survey data

  • Einav and Levin (2014), "Economics in the Age of Big Data" - Administrative and digital data

For Deeper Understanding

  • Little and Rubin (2019), Statistical Analysis with Missing Data - Missing data methods

  • Bound, Brown, and Mathiowetz (2001), "Measurement Error in Survey Data" - Handbook chapter

  • Herzog, Scheuren, and Winkler (2007), Data Quality and Record Linkage Techniques - Linkage methods

Advanced/Specialized

  • Christen (2012), Data Matching - Probabilistic linkage methods

  • Salganik (2018), Bit by Bit: Social Research in the Digital Age - Digital data for social science

  • van der Laan and Rose (2018), Targeted Learning in Data Science - Missing data and causal inference

Applications

  • Chetty et al. (2016), "The Effects of Exposure to Better Neighborhoods on Children" - Administrative data exemplar

  • Holz (2014), "The Quality of China's GDP Statistics" - Data quality assessment

  • Meyer, Mok, and Sullivan (2015), "Household Surveys in Crisis" - Survey data quality trends

  • Li (2017), "Expertise versus Bias in Evaluation: Evidence from the NIH" - Innovative text-matching approach to measure quality of unobserved counterfactuals


Exercises

Conceptual

  1. Explain the difference between validity and reliability using the example of a bathroom scale. What would it mean for the scale to be (a) reliable but not valid, (b) valid but not reliable?

  2. Why is missing not at random (MNAR) particularly problematic? Give an example from survey data where MNAR is likely.

  3. What are the tradeoffs between administrative data and survey data for measuring income inequality?

Applied

  1. Choose a publicly available dataset (e.g., from ICPSR or a government statistical agency):

    • Locate and read the documentation

    • Identify potential measurement issues

    • Assess the extent of missing data

    • Write a brief (1 page) data quality assessment

  2. Using a dataset with income measured by both administrative records and survey self-report:

    • Compare the distributions

    • Calculate the correlation

    • Identify patterns in discrepancies (who under/over-reports?)

Discussion

  1. A colleague argues: "Administrative data are always better than survey data because they're not subject to response bias." Critique this claim.

Last updated