Chapter 5: Survey Methods

Opening Question

How do we learn about millions of people by talking to only a few thousand?


Chapter Overview

Surveys are the workhorses of social science data collection. Want to know unemployment rates, political opinions, health behaviors, or consumer confidence? You probably need a survey. Yet surveys are also fragile instruments: small changes in question wording produce large changes in answers, response rates have plummeted, and the people who respond may differ systematically from those who don't.

This chapter covers survey methodology—the science of asking questions and generalizing from samples to populations. We'll examine sampling design (how to select respondents), questionnaire design (how to ask questions), and survey operations (how to conduct fieldwork). The goal is to make you a sophisticated consumer and critic of survey data, able to assess quality and understand limitations.

What you will learn:

  • How probability sampling enables inference from sample to population

  • The effects of stratification, clustering, and complex designs on precision

  • Principles of questionnaire design and common question pitfalls

  • Sources of survey error: nonresponse, measurement error, mode effects

  • How to assess survey quality and when to trust survey estimates

Prerequisites: Chapter 3 (Statistical Foundations)

Surveys Beyond Academia

While this chapter focuses on surveys for research, surveys pervade modern life far beyond academic inquiry:

Market research: Companies constantly survey consumers about product preferences, brand awareness, and satisfaction. A new product launch might involve extensive survey research to understand demand, optimize pricing, and identify target demographics. Market research firms maintain large respondent panels, typically recruited through opt-in mechanisms.

Political polling: Election forecasts, approval ratings, and issue polling shape political strategy and media coverage. Polling organizations face intense scrutiny of their methods, especially after high-profile misses.

Litigation: Surveys serve as evidence in trademark disputes (consumer confusion), discrimination cases (attitudes and experiences), and class actions (calculating damages). Courts apply specific standards for survey admissibility (see Diamond, Reference Guide on Survey Research).

Price indices: The Consumer Price Index relies on surveys of what consumers purchase and where they shop. These expenditure surveys determine CPI weights—which matter enormously for inflation-indexed contracts, Social Security adjustments, and monetary policy.

Quality measurement: Patient satisfaction (HCAHPS), customer experience (NPS), and employee engagement surveys drive organizational decisions and sometimes determine reimbursement rates.

Implementation: Modern surveys typically use platforms like Qualtrics, SurveyMonkey, or Typeform for design and administration. Respondents may come from panel companies (e.g., Dynata, Prolific), probability-based panels (e.g., NORC's AmeriSpeak), or convenience samples. The choice of panel profoundly affects data quality and generalizability (see Section 5.3).


Historical Context: The Birth of Scientific Sampling

Modern survey sampling emerged from a remarkable intellectual shift in the early 20th century.

Before probability sampling (pre-1930s): Surveys used "quota sampling" or "purposive selection"—interviewers chose respondents to match known population characteristics. The Literary Digest famously predicted Landon would defeat Roosevelt in 1936 based on 2.4 million responses. They were catastrophically wrong; their sample (from telephone directories and car registrations) overrepresented wealthy Republicans.

Jerzy Neyman (1934) established the mathematical foundations of probability sampling, proving that random selection plus appropriate weighting yields unbiased population estimates with quantifiable uncertainty.

The U.S. Census Bureau and Gallup Organization pioneered practical implementation in the 1930s-40s. George Gallup correctly predicted Roosevelt's victory in 1936 using a much smaller but properly designed probability sample.

The survey revolution: By mid-century, probability sampling became standard for government statistics (Current Population Survey, 1940s), market research, and academic social science.

Contemporary challenges: Response rates have fallen from 70%+ in the 1970s to under 10% for many surveys today. The rise of cell phones disrupted telephone sampling. Online panels offer convenience but selection concerns. The field continues to evolve.

Major U.S. Government Surveys

Understanding the major federal statistical surveys is essential for empirical researchers. These surveys differ in purpose, design, and relationship to the decennial Census.

The Decennial Census (every 10 years) attempts to count every person in the United States. It provides the "gold standard" population counts and serves as the sampling frame for other surveys. The Census also yields detailed demographic data for small geographic areas. Its limitation is frequency—by year 5, the data may be outdated.

The American Community Survey (ACS) replaced the Census long form in 2005. It samples about 3.5 million households annually, providing continuous measurement of demographic, economic, housing, and social characteristics. Unlike the decennial Census, the ACS is a sample (about 1% of the population), so estimates for small areas have larger margins of error. The ACS bridges Census years, providing 1-year estimates for areas with 65,000+ population and 5-year estimates for smaller areas.

The Current Population Survey (CPS) is the source of monthly unemployment statistics. It surveys about 60,000 households monthly, focusing on labor force status. Annual supplements (March ASEC, school enrollment, voting, etc.) provide additional detail. The CPS pioneered many modern survey techniques and remains the benchmark for labor market data.

The Employment Situation combines CPS household data with the Current Employment Statistics (CES) establishment survey, which samples about 670,000 worksites. This dual approach—asking households about employment and asking businesses about payrolls—provides cross-validation. Discrepancies between the two series (e.g., during economic turning points) generate extensive analysis.

Other major surveys:

  • Survey of Income and Program Participation (SIPP): Panel survey tracking income, program participation, and family dynamics

  • National Health Interview Survey (NHIS): Annual health status and healthcare access

  • American Housing Survey (AHS): Housing conditions and costs

  • Consumer Expenditure Survey (CE): Spending patterns, used for CPI weights

These surveys share common infrastructure. Most use the Census as a sampling frame, employ multi-stage stratified cluster designs, and benefit from decades of methodological refinement. The Bureau of Labor Statistics, Census Bureau, and National Center for Health Statistics maintain quality standards that academic surveys often cannot match.


5.1 Sampling Design and Inference

The Target Population

Every survey begins with a target population—the group about which we want to draw conclusions.

Examples:

  • U.S. adults age 18+ (political polls)

  • Civilian non-institutional population age 16+ (labor force surveys)

  • Households in a city (local surveys)

  • Businesses with 20+ employees (establishment surveys)

Frame population: The list from which we actually sample. Ideally equals the target population; in practice, there are gaps.

Coverage error: Difference between target and frame populations. Examples:

  • Phone surveys miss people without phones

  • Address-based sampling misses the homeless

  • Business registries exclude informal enterprises

Simple Random Sampling (SRS)

Definition 5.1 (Simple Random Sample): A sample of size nn is a simple random sample if every subset of nn units from the population has equal probability of selection.

Under SRS, the sample mean yˉ\bar{y} is an unbiased estimator of the population mean μ\mu: E[yˉ]=μE[\bar{y}] = \mu

The variance of the sample mean is: Var(yˉ)=σ2nNnN1Var(\bar{y}) = \frac{\sigma^2}{n} \cdot \frac{N-n}{N-1}

where σ2\sigma^2 is the population variance and NnN1\frac{N-n}{N-1} is the finite population correction (approximately 1 when n<<Nn << N).

Confidence interval for population mean: yˉ±zα/2s2n\bar{y} \pm z_{\alpha/2} \cdot \sqrt{\frac{s^2}{n}}

Stratified Sampling

Idea: Divide the population into non-overlapping strata (groups); sample independently within each stratum.

Why stratify?

  1. Precision: If strata are homogeneous internally, stratification reduces variance

  2. Guaranteed representation: Ensure all subgroups are represented

  3. Administrative convenience: Different sampling procedures in different areas

Stratified estimator of population mean: yˉst=h=1HWhyˉh\bar{y}_{st} = \sum_{h=1}^{H} W_h \bar{y}_h

where Wh=Nh/NW_h = N_h/N is stratum hh's population share and yˉh\bar{y}_h is the stratum sample mean.

Variance (under proportional allocation): Var(yˉst)=h=1HWh2Sh2nhVar(\bar{y}_{st}) = \sum_{h=1}^{H} W_h^2 \cdot \frac{S_h^2}{n_h}

Stratification never increases variance (compared to SRS of same total size) and often substantially reduces it.

Worked Example: Stratified vs. SRS

Population: 10,000 employees at a company. We want to estimate mean salary.

  • 8,000 non-managers: mean = 50,000 USD, SD = 10,000

  • 2,000 managers: mean = 100,000 USD, SD = 25,000

True population mean:

0.8×50000+0.2×100000=60000 USD0.8 \times 50000 + 0.2 \times 100000 = 60000 \text{ USD}

SRS with n=400:

  • Overall SD ≈ 24,000 (pooled)

  • SE(mean) ≈ 24000 / 20 = 1,200

Stratified with n=400 (proportional allocation: 320 non-managers, 80 managers):

SE(yˉ)=0.82×100002320+0.22×25000280=200000+312500716SE(\bar{y}) = \sqrt{0.8^2 \times \frac{10000^2}{320} + 0.2^2 \times \frac{25000^2}{80}} = \sqrt{200000 + 312500} \approx 716

Gain from stratification: SE reduced by 40%! Stratification is particularly valuable when strata differ substantially in means.

Cluster Sampling

Idea: Sample groups (clusters) of units rather than individual units; measure all or some units within selected clusters.

Why cluster?

  1. Cost efficiency: Traveling to interview dispersed individuals is expensive; interviewing everyone in selected neighborhoods is cheaper

  2. No individual frame: May not have a list of individuals, but have a list of schools, villages, or households

Example: To survey U.S. adults, we might:

  1. Sample 100 counties (primary sampling units)

  2. Within selected counties, sample census tracts

  3. Within selected tracts, sample households

  4. Interview adults in selected households

Stratification vs. Clustering: A Crucial Distinction

Stratification and clustering both involve dividing the population into groups, but they work in opposite directions:

Stratification:

  • Divide population into groups (strata)

  • Sample from every stratum

  • Works best when strata are internally homogeneous (similar within) but externally heterogeneous (different between)

  • Effect: Reduces variance

Clustering:

  • Divide population into groups (clusters)

  • Sample some clusters, then sample units within selected clusters

  • Problem: clusters are typically internally heterogeneous but externally similar (geographic neighbors resemble each other)

  • Effect: Increases variance

The key intuition: stratification guarantees representation of all groups, which stabilizes estimates. Clustering means you might miss entire clusters, introducing extra variability. If the clusters you happen to select are atypical, your estimate will be off.

When both are used together (as in most national surveys), the stratification typically reduces variance while the clustering increases it. The net effect depends on the specific design.

The Design Effect

The design effect measures how the complex design affects precision compared to simple random sampling:

Definition 5.2 (Design Effect, DEFF): The design effect is the ratio of the variance under the actual design to the variance under SRS of the same size:

DEFF=Var(yˉcomplex)Var(yˉSRS)\text{DEFF} = \frac{Var(\bar{y}_{\text{complex}})}{Var(\bar{y}_{\text{SRS}})}

Effective sample size: neff=n/DEFFn_{eff} = n / DEFF

Typical values:

  • DEFF < 1: Stratification dominates (good)

  • DEFF > 1: Clustering dominates (common in household surveys)

  • DEFF = 2-4 typical for national household surveys with geographic clustering

Worked Example: Intraclass Correlation and Clustering

The problem: Suppose you survey 1,000 students about test anxiety. You could sample 1,000 students from across the country (expensive, logistically difficult) or sample 50 schools and interview 20 students per school (much cheaper). Both give you n = 1,000. Are they equally informative?

No. Students in the same school share experiences—same teachers, same curriculum, similar peer groups. If one student in a school reports high anxiety, others in that school are more likely to as well. This within-cluster similarity means that each additional student from the same school provides less new information than a student from a different school.

The intraclass correlation (ρ\rho) quantifies this similarity—the proportion of total variance that exists between clusters (rather than within). The design effect for cluster sampling is approximately:

DEFF1+(m1)ρ\text{DEFF} \approx 1 + (m-1)\rho

where mm is the number of units per cluster.

Scenario: Survey of students, sampling 50 schools with 20 students each (n = 1,000).

If ρ=0.05\rho = 0.05 (students in same school only slightly similar—5% of variance is between-school):

DEFF=1+(201)×0.05=1.95\text{DEFF} = 1 + (20-1) \times 0.05 = 1.95

Effective sample size: 1000/1.955131000 / 1.95 \approx 513

The clustering cuts effective precision nearly in half! Your 1,000 respondents provide only as much information as 513 would under simple random sampling.

Practical implications:

  • When ρ\rho is larger (more similarity within clusters), the penalty is worse

  • When clusters are larger (more students per school), the penalty is worse

  • Even small ICC values matter when cluster sizes are large

Complex Survey Designs

Real surveys combine stratification, clustering, and unequal selection probabilities:

Multi-stage sampling: Sample PSUs (counties), then secondary units (tracts), then ultimate units (households)

Unequal probabilities: Oversample minorities to ensure adequate subgroup sample sizes; weight to restore representativeness

Survey weights: The inverse of selection probability, adjusted for nonresponse wi=1πi×(nonresponse adjustment)×(post-stratification)w_i = \frac{1}{\pi_i} \times \text{(nonresponse adjustment)} \times \text{(post-stratification)}

Weighted estimator: yˉw=iwiyiiwi\bar{y}_w = \frac{\sum_i w_i y_i}{\sum_i w_i}


5.2 Questionnaire Design

The Art and Science of Asking Questions

Question wording matters—a lot. Small changes produce large differences in responses.

Classic example (Schuman and Presser):

  • "Do you think the United States should allow public speeches against democracy?"

  • "Do you think the United States should forbid public speeches against democracy?"

"Allow" and "forbid" are logical opposites, but ~20% of respondents give inconsistent answers between the two framings. "Forbid" sounds harsh; people avoid it.

Question Types

Closed-ended: Respondent chooses from provided options

  • Easier to analyze

  • May miss important response categories

  • Susceptible to order effects

Open-ended: Respondent provides own answer

  • Richer information

  • Expensive to code

  • Higher respondent burden

Rating scales: Likert scales (agree-disagree), semantic differentials, numerical ratings

  • Number of points (5 vs. 7 vs. 11)

  • Labeled vs. unlabeled endpoints

  • Presence of midpoint

Principles of Good Question Design

1. Use simple, familiar language

  • Bad: "What is your current labor force participation status?"

  • Good: "Last week, did you do any work for pay?"

2. Avoid double-barreled questions

  • Bad: "Do you think taxes are too high and government spending should be cut?"

  • Good: Ask separately about taxes and spending

3. Avoid leading questions

  • Bad: "Don't you agree that the economy is doing poorly?"

  • Good: "How would you rate current economic conditions?"

4. Avoid loaded or emotional language

  • Bad: "Do you support killing innocent babies through abortion?"

  • Good: "Do you think abortion should be legal in all cases, legal in some cases, or illegal in all cases?"

5. Provide balanced response options

  • Bad: "Rate your satisfaction: Very satisfied / Satisfied / Somewhat satisfied"

  • Good: "Rate your satisfaction: Very satisfied / Somewhat satisfied / Neither / Somewhat dissatisfied / Very dissatisfied"

6. Ensure questions are answerable

  • Bad: "How many hours did you watch television last month?"

  • Good: "Yesterday, about how many hours did you watch television?"

Response Effects

Acquiescence: Tendency to agree with statements regardless of content

  • Solution: Mix positively and negatively worded items

Social desirability: Tendency to give socially acceptable answers

  • Solution: Indirect questions, list experiments, self-administered modes

Primacy/recency effects: First or last response options may be chosen more often

  • Solution: Randomize option order

Context effects: Earlier questions influence later responses

  • Solution: Careful ordering; separate sensitive topics

Cognitive Testing

Before fielding a survey, test questions through:

Cognitive interviews: Ask respondents to "think aloud" while answering; probe for understanding

Focus groups: Group discussions about question meaning and difficulty

Pilot testing: Small-scale field test; analyze response patterns for problems


5.3 Survey Operations

Data Collection Modes

Face-to-face: Interviewer visits respondent

  • Highest response rates

  • Most expensive

  • Best for complex questionnaires

  • Interviewer effects possible

Telephone: Random digit dialing (RDD) or list-based

  • Lower cost than F2F

  • Response rates declining (screening, cell phones)

  • Cannot show visual materials

Mail/paper: Self-administered questionnaire sent by post

  • Low cost

  • Low response rates

  • Limited questionnaire complexity

  • No interviewer effects

Web/online: Internet-based surveys

  • Very low cost

  • Coverage concerns (not everyone online)

  • Probability panels expensive; convenience samples biased

  • Complex questionnaires possible (branching, multimedia)

Mixed-mode: Combine modes to balance cost, coverage, response

  • Example: Mail invitation with web option, phone follow-up for nonrespondents

Probability Panels vs. Opt-In Panels

The rise of online surveys has created a fundamental distinction in data quality:

Probability-based panels recruit members through probability sampling (e.g., random address-based sampling), then provide internet access if needed. Examples include NORC's AmeriSpeak, Pew's American Trends Panel, and the RAND American Life Panel. These panels support population inference because every member had a known probability of selection.

Opt-in panels recruit volunteers who sign up to take surveys, often for monetary incentives. Examples include Amazon Mechanical Turk, Prolific, and commercial panels from Dynata, Lucid, and others. These samples are convenient and cheap but cannot support population inference—we don't know who chose not to join.

Key differences:

Feature
Probability Panel
Opt-In Panel

Recruitment

Random selection

Self-selection

Coverage

Near-complete

Biased toward internet-active

Population inference

Valid

Invalid

Cost per complete

$20-50+

$1-5

Speed

Moderate

Fast

Sample size

Limited by panel

Large

When to use which:

  • Policy-relevant prevalence estimates: Probability panel required

  • Causal experiments with internal validity focus: Opt-in often acceptable

  • Exploratory or pilot research: Opt-in can be useful

  • Sensitive topics with stigma: Opt-in may get more honest responses (less social pressure)

Online Survey Methodology

Online surveys present distinct methodological challenges:

Attention and engagement: Unlike interviewer-administered surveys, no one ensures respondents are paying attention. Common problems include "speeders" (completing implausibly fast), "straightliners" (giving same response to all items), and bots (automated responses). Researchers increasingly include attention checks, CAPTCHAs, and timing filters.

Device effects: Respondents use phones, tablets, and computers with different screen sizes and interfaces. Mobile respondents may give shorter open-ended responses and show different dropout patterns. Surveys should be mobile-optimized.

Satisficing and survey fatigue: Without interviewer encouragement, respondents take cognitive shortcuts. This manifests as selecting the first reasonable option, ignoring instructions, and abandoning lengthy surveys. Best practice: keep surveys short, vary question formats, and consider incentive design.

Non-naïve respondents: Frequent survey-takers learn what researchers want and may game responses. On platforms like MTurk, the same respondents appear across many studies, raising concerns about treatment effect heterogeneity and cross-study contamination.

Best practices for online surveys:

  1. Pilot extensively with target population

  2. Include attention checks but don't over-filter

  3. Keep surveys under 15 minutes

  4. Use forced-response sparingly (people skip for reasons)

  5. Monitor completion times and flag outliers

  6. Validate against external benchmarks when possible

Mode Effects

The same question asked different ways yields different answers:

Privacy: Sensitive behaviors (drug use, sexual behavior) more reported in self-administered modes

Interviewer presence: Socially desirable responses higher in interviewer modes

Visual vs. aural: Response scale presentation differs between modes

Satisficing: Web respondents may speed through; F2F interviewers maintain engagement

Nonresponse

Nonresponse is the dominant quality concern in modern surveys.

Types:

  • Unit nonresponse: Sampled person doesn't participate at all

  • Item nonresponse: Respondent skips particular questions

Response rate calculation (AAPOR standards): RR=CompletesCompletes+Partials+Refusals+Noncontacts+Unknown eligibilityRR = \frac{\text{Completes}}{\text{Completes} + \text{Partials} + \text{Refusals} + \text{Noncontacts} + \text{Unknown eligibility}}

Trends: Response rates have fallen dramatically:

  • 1970s telephone surveys: 70-80%

  • 1990s: 50-60%

  • Today: Often under 10%

When Nonresponse Causes Bias

Theorem 5.1 (Nonresponse Bias): The bias in the respondent mean equals:

Bias(yˉr)=(1RR)×(yˉryˉnr)\text{Bias}(\bar{y}_{\text{r}}) = (1 - RR) \times (\bar{y}_{\text{r}} - \bar{y}_{\text{nr}})

where yˉr\bar{y}_{\text{r}} is the respondent mean and yˉnr\bar{y}_{\text{nr}} is the nonrespondent mean.

This formula reveals two key insights. First, bias depends on both the response rate and the difference between respondents and nonrespondents. A 50% response rate with no difference produces no bias; a 90% response rate with a large difference produces substantial bias. Second, we can never directly observe yˉnr\bar{y}_{\text{nr}}—that's the fundamental problem. We must rely on auxiliary information, external benchmarks, or assumptions.

Implications:

  • Low response rate alone doesn't guarantee bias

  • Bias depends on differential nonresponse

  • A 10% response rate with random nonresponse has no bias

  • A 70% response rate with selective nonresponse can have large bias

Weighting and Adjustment

Nonresponse adjustment: Reweight respondents to compensate for differential nonresponse

  • Model response propensity; weight by inverse propensity

  • Requires auxiliary data available for respondents and nonrespondents

Post-stratification: Adjust weights so weighted sample matches known population totals

  • Example: Ensure sample matches Census age × gender × race × education distribution

  • Reduces bias if auxiliary variables predict both nonresponse and outcomes

Raking (iterative proportional fitting): Adjust to match multiple marginal distributions simultaneously

Multilevel Regression and Poststratification (MRP)

Traditional post-stratification runs into the "curse of dimensionality": crossing age × gender × race × education × state creates thousands of cells, many with few or no observations. MRP (Gelman and Little 1997; Park, Gelman, and Bafumi 2004) solves this through multilevel modeling.

The MRP approach:

  1. Model the response: Fit a multilevel (hierarchical) regression predicting the outcome from demographics and geography: Yi=αstate[i]+βage[i]+γrace[i]+...+εiY_i = \alpha_{state[i]} + \beta_{age[i]} + \gamma_{race[i]} + ... + \varepsilon_i With random effects for states, age groups, etc.

  2. Predict for all cells: Use the model to predict Y^\hat{Y} for every demographic × geographic cell—even cells with no respondents.

  3. Poststratify: Weight predictions by known population counts in each cell: Y^pop=jNjY^j/jNj\hat{Y}_{pop} = \sum_j N_j \hat{Y}_j / \sum_j N_j

Why it works: The multilevel model "borrows strength" across cells. A sparse cell (e.g., Hispanic men aged 65+ in Wyoming) is informed by other cells with similar demographics or geography. This enables adjustment on many more dimensions than traditional weighting.

Applications:

  • State-level opinion: Estimate public opinion for all 50 states from a national sample

  • Subgroup analysis: Estimate outcomes for small demographic groups

  • Non-probability samples: Make opt-in panels more representative by aggressive MRP adjustment

MRP for Opt-In Samples

A controversial but increasingly common use of MRP is to salvage non-probability (opt-in) samples. The idea: if selection into the sample depends on observable characteristics, and if the outcome model is correct, MRP can adjust for selection bias.

Conditions for success:

  • Selection depends on variables included in the model

  • The outcome-covariate relationship is correctly specified

  • Population cell counts are accurately known

Limitations: MRP cannot fix samples with fundamentally different populations (opt-in panelists may differ in unobservable ways from non-joiners). It also requires substantial modeling expertise and can give false confidence when the model is wrong.

Current evidence: Kennedy et al. (2020) show that MRP-adjusted opt-in samples can match probability sample accuracy for some outcomes, but fail badly for others—particularly those related to political engagement (since opt-in panelists are unusually engaged).


5.4 Total Survey Error Framework

The Error Sources

The Total Survey Error (TSE) framework organizes all sources of error:

Representation errors (who is measured):

  • Coverage error: Target ≠ Frame population

  • Sampling error: Frame ≠ Sample

  • Nonresponse error: Sample ≠ Respondents

Measurement errors (how they're measured):

  • Validity: Construct ≠ Operationalization

  • Measurement error: True value ≠ Measured value

Figure 5.1: The Total Survey Error framework organizes error sources into two dimensions: Representation (errors in who is measured, from target population through respondents) and Measurement (errors in how constructs are measured, from abstract concept through recorded response). The final survey statistic reflects the cumulative impact of all error sources.

Figure 5.1 illustrates how errors accumulate at each stage. On the representation side, we start with the target population we want to learn about, but we can only sample from the frame population (coverage error), we only observe a sample (sampling error), and not everyone in the sample responds (nonresponse error). On the measurement side, we start with an abstract construct (e.g., "job satisfaction"), operationalize it as survey questions (validity), record responses that may not reflect true values (measurement error), and process responses into data (processing error). Total survey error is the cumulative impact of all these sources.

Tradeoffs

Survey design involves tradeoffs:

  • Cost vs. quality: Higher response rates are expensive

  • Timeliness vs. thoroughness: Quick surveys sacrifice depth

  • Standardization vs. flexibility: Rigid protocols ensure consistency; adaptive designs improve efficiency

  • Coverage vs. convenience: Probability samples are expensive; convenience samples are cheap but biased

Assessing Survey Quality

Questions to ask about any survey:

  1. What is the target population? Is the frame complete?

  2. What sampling method was used? What is the design effect?

  3. What is the response rate? How was it calculated?

  4. What mode(s) of data collection?

  5. What weighting or adjustments were applied?

  6. Are questionnaire instruments documented and validated?


5.5 Survey Experiments

Survey experiments embed randomized treatments within surveys, combining experimental control with survey-based measurement. They're increasingly common in political science, sociology, and economics.

Types of Survey Experiments

Question wording experiments: Randomly vary how questions are asked

Example: Half of respondents see "Do you favor allowing women to have abortions?" while half see "Do you favor prohibiting abortions?" Different wording, same policy—different support levels reveal framing effects.

Vignette experiments: Present hypothetical scenarios with randomly varied attributes

Example: "A company has 100 employees. [It made $1 million / $100,000 in profits last year.] The CEO proposes [a 10% / 2% wage cut]. How fair is this?" Randomizing profits and cut size identifies their independent effects on fairness perceptions.

Conjoint experiments: Present multi-attribute profiles for comparison

Example: Respondents see two hypothetical job candidates varying on education, experience, gender, and race. "Which would you hire?" Averaging over many choices identifies the effect of each attribute on selection probability.

List experiments (item count technique): Measure sensitive attitudes indirectly

Example: Control group counts how many of 4 innocuous statements they agree with. Treatment group gets the same 4 plus a sensitive item. The difference in counts reveals the prevalence of the sensitive attitude without anyone admitting it directly.

Advantages of Survey Experiments

  • Causal identification: Random assignment enables clean causal inference

  • Mechanism isolation: Can vary single factors to identify their effects

  • Sensitive topics: Techniques like list experiments reduce social desirability bias

  • Scale: Can collect thousands of experimental observations cheaply

  • Factorial designs: Can test many factors simultaneously

Limitations

  • Hypothetical scenarios: People may respond differently to vignettes than real situations

  • Demand effects: Respondents may guess the hypothesis

  • External validity: Lab/survey responses may not predict real behavior

  • Sample issues: Online samples may not represent target populations

  • Attention: Respondents may not read scenarios carefully

Box: Conjoint Analysis for Causal Attribution

Conjoint experiments are particularly powerful for studying discrimination and preferences because they:

  1. Randomize multiple attributes simultaneously—avoiding omitted variable bias

  2. Force tradeoffs—revealing relative importance of attributes

  3. Mask purpose—reducing social desirability bias

Example: Hainmueller and Hopkins (2015) study immigration preferences using conjoint. Respondents see pairs of hypothetical immigrants varying on country of origin, education, language, occupation, and other attributes. The experimental design identifies the causal effect of each attribute on admission support—impossible with observational data where attributes correlate.

Implementation: The cjoint package in R implements analysis; Python users can use statsmodels with clustered standard errors by respondent.

Design Considerations

  • Power: Survey experiments often require large samples (1000+ respondents)

  • Balance: Check that randomization succeeded across conditions

  • Attention checks: Include questions to verify respondents are reading carefully

  • Pre-registration: Specify hypotheses and analysis before collecting data

  • Manipulation checks: Verify respondents perceived the treatment as intended


5.6 Running Example: The Current Population Survey

Background

The Current Population Survey (CPS) is the U.S. government's primary source of labor force statistics. Monthly unemployment rates, labor force participation, and earnings data all come from the CPS.

Design

Target population: Civilian non-institutional population age 16+

Sample design:

  • Multi-stage stratified cluster sample

  • ~60,000 households per month

  • Stratified by state and urban/rural

  • PSUs are counties or groups of counties

  • Ultimate sampling unit is address

Rotation: 4-8-4 pattern

  • Household in sample for 4 consecutive months

  • Out 8 months

  • Back in for 4 months

  • Enables monthly change estimates

Mode: Computer-assisted personal interviewing (CAPI) for first and fifth month; computer-assisted telephone interviewing (CATI) for others

Key Questions

Employment status (previous week):

  1. Did you do any work for pay or profit?

  2. Did you have a job from which you were temporarily absent?

  3. Did you do any unpaid work in a family business?

  4. Were you looking for work?

  5. Were you available for work?

These questions operationalize the ILO definition of unemployment:

  • Without work

  • Currently available for work

  • Actively seeking work

Quality Considerations

Strengths:

  • Large sample, precise national estimates

  • Consistent methodology over decades

  • Professional interviewers, quality control

  • Monthly frequency captures dynamics

Limitations:

  • Proxy responses (household member answers for others)

  • Recall period (previous week may not be typical)

  • Definition sensitivity (marginal employment status)

  • Nonresponse adjustment may mask changes

Measurement Issues

The measured unemployment rate is sensitive to:

Reference period: "Last week" anchors responses but may not be typical

Classification rules: Someone who looked for work 5 weeks ago is "not in labor force," not "unemployed"

Proxy reporting: About half of responses are from another household member

Social desirability: Stigma may reduce unemployment reporting

These issues don't necessarily bias trends (they're consistent over time) but affect interpretation of levels.


Practical Guidance

When Surveys Are the Right Tool

Research Need
Survey Appropriate?
Notes

Population prevalence

Yes

Core survey strength

Attitudes/opinions

Yes

Only surveys measure subjective states

Sensitive behaviors

Maybe

Mode matters; consider alternatives

Historical behavior

Caution

Recall error increases with time

Administrative outcomes

Often no

Administrative data usually better

Small populations

Challenging

May need complete enumeration

Common Pitfalls

Pitfall 1: Ignoring weights Analyzing survey data without weights yields biased estimates if the sample design involves unequal probabilities or nonresponse adjustment.

How to avoid: Always use provided weights; use survey-aware software (svyset in Stata, survey package in R).

Pitfall 2: Treating convenience samples as representative Amazon Mechanical Turk, social media polls, and opt-in panels are not probability samples.

How to avoid: Be explicit about sample limitations; don't claim population inference.

Pitfall 3: Ignoring design effects Using formulas for simple random sampling when the design is clustered understates standard errors.

How to avoid: Use survey-aware procedures that account for stratification and clustering.

Pitfall 4: Over-interpreting small changes Month-to-month changes in surveys are noisy; true changes may be smaller than observed.

How to avoid: Report confidence intervals; focus on trends rather than single-month changes.

Survey Quality Checklist


Qualitative Bridge

Surveys and Qualitative Methods

Surveys excel at measuring what and how many. They're less good at explaining why or capturing nuance. Qualitative methods complement surveys:

Before the survey (formative research):

  • Focus groups to understand how respondents think about topics

  • Cognitive interviews to test question comprehension

  • Ethnography to identify relevant constructs

After the survey (interpretive research):

  • In-depth interviews to explain survey patterns

  • Case studies to understand mechanisms behind correlations

  • Observation to validate self-reports

Example: Health Surveys

The National Health Interview Survey finds that 15% of adults report frequent anxiety.

Survey strengths: Prevalence estimate, demographic patterns, trends

Survey limitations: What does "frequent anxiety" mean to respondents? How does it affect daily life? What coping strategies do people use?

Qualitative complement: In-depth interviews with anxious individuals reveal the lived experience, coping mechanisms, and healthcare-seeking that explain survey patterns.


Integration Note

Connections to Other Methods

Method
Relationship
See Chapter

Data Quality

Survey quality is a special case

Ch. 2

RCTs

Survey outcomes in experiments

Ch. 10

Selection

Survey nonresponse is selection

Ch. 11

Heterogeneity

Subgroup estimates require appropriate design

Ch. 20

Triangulation Strategies

Survey estimates gain credibility when:

  1. Administrative validation: Survey reports match administrative records

  2. Cross-survey consistency: Different surveys yield similar estimates

  3. Qualitative alignment: In-depth research explains survey findings

  4. Trend plausibility: Changes are consistent with known events

  5. Benchmark alignment: Totals match Census or other benchmarks


Summary

Key takeaways:

  1. Probability sampling enables inference from sample to population with quantifiable uncertainty. Non-probability samples cannot support population claims.

  2. Stratification increases precision; clustering decreases it. The design effect summarizes the net impact.

  3. Question wording matters enormously. Small changes produce large differences in responses.

  4. Nonresponse is the dominant quality concern in modern surveys. Low response rates don't necessarily cause bias, but differential nonresponse does.

  5. The Total Survey Error framework organizes error sources. Multiple types of error can compound or offset.

  6. Always use weights and survey-aware analysis when working with complex survey data.

Returning to the opening question: We learn about millions from thousands through probability sampling—random selection that gives every population member a known chance of inclusion. Combined with proper weighting and careful design, this enables valid inference. But surveys are fragile: question wording, nonresponse, mode effects, and coverage gaps all threaten validity. Sophisticated users understand these limitations and interpret accordingly.


Further Reading

Essential

  • Groves et al. (2009), Survey Methodology - The definitive textbook

  • Tourangeau, Rips, and Rasinski (2000), The Psychology of Survey Response - Cognitive foundations

For Deeper Understanding

  • Lohr (2010), Sampling: Design and Analysis - Technical sampling theory

  • Dillman, Smyth, and Christian (2014), Internet, Phone, Mail, and Mixed-Mode Surveys - Practical design guide

  • Biemer and Lyberg (2003), Introduction to Survey Quality - Error framework

  • Stantcheva (2023), "How to Run Surveys: A Guide to Creating Your Own Identifying Variation and Revealing the Invisible" - Annual Review of Economics, comprehensive guide for economists conducting original surveys

Historical/Methodological

  • Converse (1987), Survey Research in the United States - History of the field

  • Krosnick (1999), "Survey Research" - Annual Review overview

  • AAPOR (American Association for Public Opinion Research) - Standards and best practices

Special Topics

  • Diamond (2011), "Reference Guide on Survey Research" in Reference Manual on Scientific Evidence (National Academies Press) - Legal standards for survey evidence; essential reading for litigation applications

  • Bureau of Labor Statistics CPS documentation - Example of high-quality government survey

  • Pew Research Center methodology reports - Accessible quality discussions

  • Meyer, Mok, and Sullivan (2015), "Household Surveys in Crisis" - Quality trends


Exercises

Conceptual

  1. Explain why stratification always (weakly) improves precision compared to SRS, while clustering always (weakly) reduces it. Under what conditions would stratification have no effect? Clustering have no effect?

  2. A telephone survey has a 5% response rate. A critic argues the results are worthless. How would you respond? What additional information would you want?

  3. "The question asked was: 'Don't you agree that government should do more to help the poor?' 75% agreed." What problems do you see with this question and interpretation?

Applied

  1. Find documentation for a major survey (CPS, General Social Survey, Health and Retirement Study, etc.):

    • What is the target population?

    • What is the sampling design?

    • What is the response rate?

    • How are weights constructed?

    • Write a brief (1 page) quality assessment.

  2. Design a brief survey (10 questions) to measure student satisfaction with their university. Apply the principles from this chapter. Have 3 people complete your survey using think-aloud protocols. What problems did you discover?

Discussion

  1. Response rates have fallen from 70% to under 10% over the past few decades, yet many surveys continue to produce estimates that match external benchmarks. How is this possible? What does it tell us about the relationship between response rates and bias?


Technical Appendix

A. Variance Estimation for Complex Surveys

Linearization (Taylor series): For nonlinear statistics (ratios, regression coefficients), linearize and apply design-based variance formulas.

Replication methods:

  • Jackknife: Delete one PSU at a time; estimate variance from variation across replicates

  • Bootstrap: Resample PSUs with replacement; estimate variance from bootstrap distribution

  • Balanced repeated replication (BRR): Systematic replicate construction for stratified designs

B. Optimal Allocation in Stratified Sampling

Neyman allocation: Allocate sample to strata proportional to stratum size times stratum SD:

nhNhShn_h \propto N_h S_h

This minimizes variance for fixed total sample size. The intuition: sample more heavily from strata that are large (more weight in population mean) and variable (more uncertainty to reduce).

Cost-weighted allocation: When per-interview costs vary by stratum (e.g., face-to-face in rural areas vs. telephone in urban areas):

nhNhShchn_h \propto \frac{N_h S_h}{\sqrt{c_h}}

where chc_h is the per-interview cost in stratum hh. This minimizes variance for fixed total budget. We sample less from expensive strata, but not proportionally less—the square root reflects the tradeoff between cost savings and precision loss.

Last updated