Chapter 5: Survey Methods
Opening Question
How do we learn about millions of people by talking to only a few thousand?
Chapter Overview
Surveys are the workhorses of social science data collection. Want to know unemployment rates, political opinions, health behaviors, or consumer confidence? You probably need a survey. Yet surveys are also fragile instruments: small changes in question wording produce large changes in answers, response rates have plummeted, and the people who respond may differ systematically from those who don't.
This chapter covers survey methodology—the science of asking questions and generalizing from samples to populations. We'll examine sampling design (how to select respondents), questionnaire design (how to ask questions), and survey operations (how to conduct fieldwork). The goal is to make you a sophisticated consumer and critic of survey data, able to assess quality and understand limitations.
What you will learn:
How probability sampling enables inference from sample to population
The effects of stratification, clustering, and complex designs on precision
Principles of questionnaire design and common question pitfalls
Sources of survey error: nonresponse, measurement error, mode effects
How to assess survey quality and when to trust survey estimates
Prerequisites: Chapter 3 (Statistical Foundations)
Surveys Beyond Academia
While this chapter focuses on surveys for research, surveys pervade modern life far beyond academic inquiry:
Market research: Companies constantly survey consumers about product preferences, brand awareness, and satisfaction. A new product launch might involve extensive survey research to understand demand, optimize pricing, and identify target demographics. Market research firms maintain large respondent panels, typically recruited through opt-in mechanisms.
Political polling: Election forecasts, approval ratings, and issue polling shape political strategy and media coverage. Polling organizations face intense scrutiny of their methods, especially after high-profile misses.
Litigation: Surveys serve as evidence in trademark disputes (consumer confusion), discrimination cases (attitudes and experiences), and class actions (calculating damages). Courts apply specific standards for survey admissibility (see Diamond, Reference Guide on Survey Research).
Price indices: The Consumer Price Index relies on surveys of what consumers purchase and where they shop. These expenditure surveys determine CPI weights—which matter enormously for inflation-indexed contracts, Social Security adjustments, and monetary policy.
Quality measurement: Patient satisfaction (HCAHPS), customer experience (NPS), and employee engagement surveys drive organizational decisions and sometimes determine reimbursement rates.
Implementation: Modern surveys typically use platforms like Qualtrics, SurveyMonkey, or Typeform for design and administration. Respondents may come from panel companies (e.g., Dynata, Prolific), probability-based panels (e.g., NORC's AmeriSpeak), or convenience samples. The choice of panel profoundly affects data quality and generalizability (see Section 5.3).
Historical Context: The Birth of Scientific Sampling
Modern survey sampling emerged from a remarkable intellectual shift in the early 20th century.
Before probability sampling (pre-1930s): Surveys used "quota sampling" or "purposive selection"—interviewers chose respondents to match known population characteristics. The Literary Digest famously predicted Landon would defeat Roosevelt in 1936 based on 2.4 million responses. They were catastrophically wrong; their sample (from telephone directories and car registrations) overrepresented wealthy Republicans.
Jerzy Neyman (1934) established the mathematical foundations of probability sampling, proving that random selection plus appropriate weighting yields unbiased population estimates with quantifiable uncertainty.
The U.S. Census Bureau and Gallup Organization pioneered practical implementation in the 1930s-40s. George Gallup correctly predicted Roosevelt's victory in 1936 using a much smaller but properly designed probability sample.
The survey revolution: By mid-century, probability sampling became standard for government statistics (Current Population Survey, 1940s), market research, and academic social science.
Contemporary challenges: Response rates have fallen from 70%+ in the 1970s to under 10% for many surveys today. The rise of cell phones disrupted telephone sampling. Online panels offer convenience but selection concerns. The field continues to evolve.
Major U.S. Government Surveys
Understanding the major federal statistical surveys is essential for empirical researchers. These surveys differ in purpose, design, and relationship to the decennial Census.
The Decennial Census (every 10 years) attempts to count every person in the United States. It provides the "gold standard" population counts and serves as the sampling frame for other surveys. The Census also yields detailed demographic data for small geographic areas. Its limitation is frequency—by year 5, the data may be outdated.
The American Community Survey (ACS) replaced the Census long form in 2005. It samples about 3.5 million households annually, providing continuous measurement of demographic, economic, housing, and social characteristics. Unlike the decennial Census, the ACS is a sample (about 1% of the population), so estimates for small areas have larger margins of error. The ACS bridges Census years, providing 1-year estimates for areas with 65,000+ population and 5-year estimates for smaller areas.
The Current Population Survey (CPS) is the source of monthly unemployment statistics. It surveys about 60,000 households monthly, focusing on labor force status. Annual supplements (March ASEC, school enrollment, voting, etc.) provide additional detail. The CPS pioneered many modern survey techniques and remains the benchmark for labor market data.
The Employment Situation combines CPS household data with the Current Employment Statistics (CES) establishment survey, which samples about 670,000 worksites. This dual approach—asking households about employment and asking businesses about payrolls—provides cross-validation. Discrepancies between the two series (e.g., during economic turning points) generate extensive analysis.
Other major surveys:
Survey of Income and Program Participation (SIPP): Panel survey tracking income, program participation, and family dynamics
National Health Interview Survey (NHIS): Annual health status and healthcare access
American Housing Survey (AHS): Housing conditions and costs
Consumer Expenditure Survey (CE): Spending patterns, used for CPI weights
These surveys share common infrastructure. Most use the Census as a sampling frame, employ multi-stage stratified cluster designs, and benefit from decades of methodological refinement. The Bureau of Labor Statistics, Census Bureau, and National Center for Health Statistics maintain quality standards that academic surveys often cannot match.
5.1 Sampling Design and Inference
The Target Population
Every survey begins with a target population—the group about which we want to draw conclusions.
Examples:
U.S. adults age 18+ (political polls)
Civilian non-institutional population age 16+ (labor force surveys)
Households in a city (local surveys)
Businesses with 20+ employees (establishment surveys)
Frame population: The list from which we actually sample. Ideally equals the target population; in practice, there are gaps.
Coverage error: Difference between target and frame populations. Examples:
Phone surveys miss people without phones
Address-based sampling misses the homeless
Business registries exclude informal enterprises
Simple Random Sampling (SRS)
Definition 5.1 (Simple Random Sample): A sample of size n is a simple random sample if every subset of n units from the population has equal probability of selection.
Under SRS, the sample mean yˉ is an unbiased estimator of the population mean μ: E[yˉ]=μ
The variance of the sample mean is: Var(yˉ)=nσ2⋅N−1N−n
where σ2 is the population variance and N−1N−n is the finite population correction (approximately 1 when n<<N).
Confidence interval for population mean: yˉ±zα/2⋅ns2
Stratified Sampling
Idea: Divide the population into non-overlapping strata (groups); sample independently within each stratum.
Why stratify?
Precision: If strata are homogeneous internally, stratification reduces variance
Guaranteed representation: Ensure all subgroups are represented
Administrative convenience: Different sampling procedures in different areas
Stratified estimator of population mean: yˉst=∑h=1HWhyˉh
where Wh=Nh/N is stratum h's population share and yˉh is the stratum sample mean.
Variance (under proportional allocation): Var(yˉst)=∑h=1HWh2⋅nhSh2
Stratification never increases variance (compared to SRS of same total size) and often substantially reduces it.
Worked Example: Stratified vs. SRS
Population: 10,000 employees at a company. We want to estimate mean salary.
8,000 non-managers: mean = 50,000 USD, SD = 10,000
2,000 managers: mean = 100,000 USD, SD = 25,000
True population mean:
0.8×50000+0.2×100000=60000 USD
SRS with n=400:
Overall SD ≈ 24,000 (pooled)
SE(mean) ≈ 24000 / 20 = 1,200
Stratified with n=400 (proportional allocation: 320 non-managers, 80 managers):
SE(yˉ)=0.82×320100002+0.22×80250002=200000+312500≈716
Gain from stratification: SE reduced by 40%! Stratification is particularly valuable when strata differ substantially in means.
Cluster Sampling
Idea: Sample groups (clusters) of units rather than individual units; measure all or some units within selected clusters.
Why cluster?
Cost efficiency: Traveling to interview dispersed individuals is expensive; interviewing everyone in selected neighborhoods is cheaper
No individual frame: May not have a list of individuals, but have a list of schools, villages, or households
Example: To survey U.S. adults, we might:
Sample 100 counties (primary sampling units)
Within selected counties, sample census tracts
Within selected tracts, sample households
Interview adults in selected households
Stratification vs. Clustering: A Crucial Distinction
Stratification and clustering both involve dividing the population into groups, but they work in opposite directions:
Stratification:
Divide population into groups (strata)
Sample from every stratum
Works best when strata are internally homogeneous (similar within) but externally heterogeneous (different between)
Effect: Reduces variance
Clustering:
Divide population into groups (clusters)
Sample some clusters, then sample units within selected clusters
Problem: clusters are typically internally heterogeneous but externally similar (geographic neighbors resemble each other)
Effect: Increases variance
The key intuition: stratification guarantees representation of all groups, which stabilizes estimates. Clustering means you might miss entire clusters, introducing extra variability. If the clusters you happen to select are atypical, your estimate will be off.
When both are used together (as in most national surveys), the stratification typically reduces variance while the clustering increases it. The net effect depends on the specific design.
The Design Effect
The design effect measures how the complex design affects precision compared to simple random sampling:
Definition 5.2 (Design Effect, DEFF): The design effect is the ratio of the variance under the actual design to the variance under SRS of the same size:
DEFF=Var(yˉSRS)Var(yˉcomplex)
Effective sample size: neff=n/DEFF
Typical values:
DEFF < 1: Stratification dominates (good)
DEFF > 1: Clustering dominates (common in household surveys)
DEFF = 2-4 typical for national household surveys with geographic clustering
Worked Example: Intraclass Correlation and Clustering
The problem: Suppose you survey 1,000 students about test anxiety. You could sample 1,000 students from across the country (expensive, logistically difficult) or sample 50 schools and interview 20 students per school (much cheaper). Both give you n = 1,000. Are they equally informative?
No. Students in the same school share experiences—same teachers, same curriculum, similar peer groups. If one student in a school reports high anxiety, others in that school are more likely to as well. This within-cluster similarity means that each additional student from the same school provides less new information than a student from a different school.
The intraclass correlation (ρ) quantifies this similarity—the proportion of total variance that exists between clusters (rather than within). The design effect for cluster sampling is approximately:
DEFF≈1+(m−1)ρ
where m is the number of units per cluster.
Scenario: Survey of students, sampling 50 schools with 20 students each (n = 1,000).
If ρ=0.05 (students in same school only slightly similar—5% of variance is between-school):
DEFF=1+(20−1)×0.05=1.95
Effective sample size: 1000/1.95≈513
The clustering cuts effective precision nearly in half! Your 1,000 respondents provide only as much information as 513 would under simple random sampling.
Practical implications:
When ρ is larger (more similarity within clusters), the penalty is worse
When clusters are larger (more students per school), the penalty is worse
Even small ICC values matter when cluster sizes are large
Complex Survey Designs
Real surveys combine stratification, clustering, and unequal selection probabilities:
Multi-stage sampling: Sample PSUs (counties), then secondary units (tracts), then ultimate units (households)
Unequal probabilities: Oversample minorities to ensure adequate subgroup sample sizes; weight to restore representativeness
Survey weights: The inverse of selection probability, adjusted for nonresponse wi=πi1×(nonresponse adjustment)×(post-stratification)
Weighted estimator: yˉw=∑iwi∑iwiyi
5.2 Questionnaire Design
The Art and Science of Asking Questions
Question wording matters—a lot. Small changes produce large differences in responses.
Classic example (Schuman and Presser):
"Do you think the United States should allow public speeches against democracy?"
"Do you think the United States should forbid public speeches against democracy?"
"Allow" and "forbid" are logical opposites, but ~20% of respondents give inconsistent answers between the two framings. "Forbid" sounds harsh; people avoid it.
Question Types
Closed-ended: Respondent chooses from provided options
Easier to analyze
May miss important response categories
Susceptible to order effects
Open-ended: Respondent provides own answer
Richer information
Expensive to code
Higher respondent burden
Rating scales: Likert scales (agree-disagree), semantic differentials, numerical ratings
Number of points (5 vs. 7 vs. 11)
Labeled vs. unlabeled endpoints
Presence of midpoint
Principles of Good Question Design
1. Use simple, familiar language
Bad: "What is your current labor force participation status?"
Good: "Last week, did you do any work for pay?"
2. Avoid double-barreled questions
Bad: "Do you think taxes are too high and government spending should be cut?"
Good: Ask separately about taxes and spending
3. Avoid leading questions
Bad: "Don't you agree that the economy is doing poorly?"
Good: "How would you rate current economic conditions?"
4. Avoid loaded or emotional language
Bad: "Do you support killing innocent babies through abortion?"
Good: "Do you think abortion should be legal in all cases, legal in some cases, or illegal in all cases?"
5. Provide balanced response options
Bad: "Rate your satisfaction: Very satisfied / Satisfied / Somewhat satisfied"
Good: "Rate your satisfaction: Very satisfied / Somewhat satisfied / Neither / Somewhat dissatisfied / Very dissatisfied"
6. Ensure questions are answerable
Bad: "How many hours did you watch television last month?"
Good: "Yesterday, about how many hours did you watch television?"
Response Effects
Acquiescence: Tendency to agree with statements regardless of content
Solution: Mix positively and negatively worded items
Social desirability: Tendency to give socially acceptable answers
Solution: Indirect questions, list experiments, self-administered modes
Primacy/recency effects: First or last response options may be chosen more often
Solution: Randomize option order
Context effects: Earlier questions influence later responses
Solution: Careful ordering; separate sensitive topics
Cognitive Testing
Before fielding a survey, test questions through:
Cognitive interviews: Ask respondents to "think aloud" while answering; probe for understanding
Focus groups: Group discussions about question meaning and difficulty
Pilot testing: Small-scale field test; analyze response patterns for problems
5.3 Survey Operations
Data Collection Modes
Face-to-face: Interviewer visits respondent
Highest response rates
Most expensive
Best for complex questionnaires
Interviewer effects possible
Telephone: Random digit dialing (RDD) or list-based
Lower cost than F2F
Response rates declining (screening, cell phones)
Cannot show visual materials
Mail/paper: Self-administered questionnaire sent by post
Low cost
Low response rates
Limited questionnaire complexity
No interviewer effects
Web/online: Internet-based surveys
Very low cost
Coverage concerns (not everyone online)
Probability panels expensive; convenience samples biased
Complex questionnaires possible (branching, multimedia)
Mixed-mode: Combine modes to balance cost, coverage, response
Example: Mail invitation with web option, phone follow-up for nonrespondents
Probability Panels vs. Opt-In Panels
The rise of online surveys has created a fundamental distinction in data quality:
Probability-based panels recruit members through probability sampling (e.g., random address-based sampling), then provide internet access if needed. Examples include NORC's AmeriSpeak, Pew's American Trends Panel, and the RAND American Life Panel. These panels support population inference because every member had a known probability of selection.
Opt-in panels recruit volunteers who sign up to take surveys, often for monetary incentives. Examples include Amazon Mechanical Turk, Prolific, and commercial panels from Dynata, Lucid, and others. These samples are convenient and cheap but cannot support population inference—we don't know who chose not to join.
Key differences:
Recruitment
Random selection
Self-selection
Coverage
Near-complete
Biased toward internet-active
Population inference
Valid
Invalid
Cost per complete
$20-50+
$1-5
Speed
Moderate
Fast
Sample size
Limited by panel
Large
When to use which:
Policy-relevant prevalence estimates: Probability panel required
Causal experiments with internal validity focus: Opt-in often acceptable
Exploratory or pilot research: Opt-in can be useful
Sensitive topics with stigma: Opt-in may get more honest responses (less social pressure)
Online Survey Methodology
Online surveys present distinct methodological challenges:
Attention and engagement: Unlike interviewer-administered surveys, no one ensures respondents are paying attention. Common problems include "speeders" (completing implausibly fast), "straightliners" (giving same response to all items), and bots (automated responses). Researchers increasingly include attention checks, CAPTCHAs, and timing filters.
Device effects: Respondents use phones, tablets, and computers with different screen sizes and interfaces. Mobile respondents may give shorter open-ended responses and show different dropout patterns. Surveys should be mobile-optimized.
Satisficing and survey fatigue: Without interviewer encouragement, respondents take cognitive shortcuts. This manifests as selecting the first reasonable option, ignoring instructions, and abandoning lengthy surveys. Best practice: keep surveys short, vary question formats, and consider incentive design.
Non-naïve respondents: Frequent survey-takers learn what researchers want and may game responses. On platforms like MTurk, the same respondents appear across many studies, raising concerns about treatment effect heterogeneity and cross-study contamination.
Best practices for online surveys:
Pilot extensively with target population
Include attention checks but don't over-filter
Keep surveys under 15 minutes
Use forced-response sparingly (people skip for reasons)
Monitor completion times and flag outliers
Validate against external benchmarks when possible
Mode Effects
The same question asked different ways yields different answers:
Privacy: Sensitive behaviors (drug use, sexual behavior) more reported in self-administered modes
Interviewer presence: Socially desirable responses higher in interviewer modes
Visual vs. aural: Response scale presentation differs between modes
Satisficing: Web respondents may speed through; F2F interviewers maintain engagement
Nonresponse
Nonresponse is the dominant quality concern in modern surveys.
Types:
Unit nonresponse: Sampled person doesn't participate at all
Item nonresponse: Respondent skips particular questions
Response rate calculation (AAPOR standards): RR=Completes+Partials+Refusals+Noncontacts+Unknown eligibilityCompletes
Trends: Response rates have fallen dramatically:
1970s telephone surveys: 70-80%
1990s: 50-60%
Today: Often under 10%
When Nonresponse Causes Bias
Theorem 5.1 (Nonresponse Bias): The bias in the respondent mean equals:
Bias(yˉr)=(1−RR)×(yˉr−yˉnr)
where yˉr is the respondent mean and yˉnr is the nonrespondent mean.
This formula reveals two key insights. First, bias depends on both the response rate and the difference between respondents and nonrespondents. A 50% response rate with no difference produces no bias; a 90% response rate with a large difference produces substantial bias. Second, we can never directly observe yˉnr—that's the fundamental problem. We must rely on auxiliary information, external benchmarks, or assumptions.
Implications:
Low response rate alone doesn't guarantee bias
Bias depends on differential nonresponse
A 10% response rate with random nonresponse has no bias
A 70% response rate with selective nonresponse can have large bias
Weighting and Adjustment
Nonresponse adjustment: Reweight respondents to compensate for differential nonresponse
Model response propensity; weight by inverse propensity
Requires auxiliary data available for respondents and nonrespondents
Post-stratification: Adjust weights so weighted sample matches known population totals
Example: Ensure sample matches Census age × gender × race × education distribution
Reduces bias if auxiliary variables predict both nonresponse and outcomes
Raking (iterative proportional fitting): Adjust to match multiple marginal distributions simultaneously
Multilevel Regression and Poststratification (MRP)
Traditional post-stratification runs into the "curse of dimensionality": crossing age × gender × race × education × state creates thousands of cells, many with few or no observations. MRP (Gelman and Little 1997; Park, Gelman, and Bafumi 2004) solves this through multilevel modeling.
The MRP approach:
Model the response: Fit a multilevel (hierarchical) regression predicting the outcome from demographics and geography: Yi=αstate[i]+βage[i]+γrace[i]+...+εi With random effects for states, age groups, etc.
Predict for all cells: Use the model to predict Y^ for every demographic × geographic cell—even cells with no respondents.
Poststratify: Weight predictions by known population counts in each cell: Y^pop=∑jNjY^j/∑jNj
Why it works: The multilevel model "borrows strength" across cells. A sparse cell (e.g., Hispanic men aged 65+ in Wyoming) is informed by other cells with similar demographics or geography. This enables adjustment on many more dimensions than traditional weighting.
Applications:
State-level opinion: Estimate public opinion for all 50 states from a national sample
Subgroup analysis: Estimate outcomes for small demographic groups
Non-probability samples: Make opt-in panels more representative by aggressive MRP adjustment
MRP for Opt-In Samples
A controversial but increasingly common use of MRP is to salvage non-probability (opt-in) samples. The idea: if selection into the sample depends on observable characteristics, and if the outcome model is correct, MRP can adjust for selection bias.
Conditions for success:
Selection depends on variables included in the model
The outcome-covariate relationship is correctly specified
Population cell counts are accurately known
Limitations: MRP cannot fix samples with fundamentally different populations (opt-in panelists may differ in unobservable ways from non-joiners). It also requires substantial modeling expertise and can give false confidence when the model is wrong.
Current evidence: Kennedy et al. (2020) show that MRP-adjusted opt-in samples can match probability sample accuracy for some outcomes, but fail badly for others—particularly those related to political engagement (since opt-in panelists are unusually engaged).
5.4 Total Survey Error Framework
The Error Sources
The Total Survey Error (TSE) framework organizes all sources of error:
Representation errors (who is measured):
Coverage error: Target ≠ Frame population
Sampling error: Frame ≠ Sample
Nonresponse error: Sample ≠ Respondents
Measurement errors (how they're measured):
Validity: Construct ≠ Operationalization
Measurement error: True value ≠ Measured value

Figure 5.1 illustrates how errors accumulate at each stage. On the representation side, we start with the target population we want to learn about, but we can only sample from the frame population (coverage error), we only observe a sample (sampling error), and not everyone in the sample responds (nonresponse error). On the measurement side, we start with an abstract construct (e.g., "job satisfaction"), operationalize it as survey questions (validity), record responses that may not reflect true values (measurement error), and process responses into data (processing error). Total survey error is the cumulative impact of all these sources.
Tradeoffs
Survey design involves tradeoffs:
Cost vs. quality: Higher response rates are expensive
Timeliness vs. thoroughness: Quick surveys sacrifice depth
Standardization vs. flexibility: Rigid protocols ensure consistency; adaptive designs improve efficiency
Coverage vs. convenience: Probability samples are expensive; convenience samples are cheap but biased
Assessing Survey Quality
Questions to ask about any survey:
What is the target population? Is the frame complete?
What sampling method was used? What is the design effect?
What is the response rate? How was it calculated?
What mode(s) of data collection?
What weighting or adjustments were applied?
Are questionnaire instruments documented and validated?
5.5 Survey Experiments
Survey experiments embed randomized treatments within surveys, combining experimental control with survey-based measurement. They're increasingly common in political science, sociology, and economics.
Types of Survey Experiments
Question wording experiments: Randomly vary how questions are asked
Example: Half of respondents see "Do you favor allowing women to have abortions?" while half see "Do you favor prohibiting abortions?" Different wording, same policy—different support levels reveal framing effects.
Vignette experiments: Present hypothetical scenarios with randomly varied attributes
Example: "A company has 100 employees. [It made $1 million / $100,000 in profits last year.] The CEO proposes [a 10% / 2% wage cut]. How fair is this?" Randomizing profits and cut size identifies their independent effects on fairness perceptions.
Conjoint experiments: Present multi-attribute profiles for comparison
Example: Respondents see two hypothetical job candidates varying on education, experience, gender, and race. "Which would you hire?" Averaging over many choices identifies the effect of each attribute on selection probability.
List experiments (item count technique): Measure sensitive attitudes indirectly
Example: Control group counts how many of 4 innocuous statements they agree with. Treatment group gets the same 4 plus a sensitive item. The difference in counts reveals the prevalence of the sensitive attitude without anyone admitting it directly.
Advantages of Survey Experiments
Causal identification: Random assignment enables clean causal inference
Mechanism isolation: Can vary single factors to identify their effects
Sensitive topics: Techniques like list experiments reduce social desirability bias
Scale: Can collect thousands of experimental observations cheaply
Factorial designs: Can test many factors simultaneously
Limitations
Hypothetical scenarios: People may respond differently to vignettes than real situations
Demand effects: Respondents may guess the hypothesis
External validity: Lab/survey responses may not predict real behavior
Sample issues: Online samples may not represent target populations
Attention: Respondents may not read scenarios carefully
Box: Conjoint Analysis for Causal Attribution
Conjoint experiments are particularly powerful for studying discrimination and preferences because they:
Randomize multiple attributes simultaneously—avoiding omitted variable bias
Force tradeoffs—revealing relative importance of attributes
Mask purpose—reducing social desirability bias
Example: Hainmueller and Hopkins (2015) study immigration preferences using conjoint. Respondents see pairs of hypothetical immigrants varying on country of origin, education, language, occupation, and other attributes. The experimental design identifies the causal effect of each attribute on admission support—impossible with observational data where attributes correlate.
Implementation: The
cjointpackage in R implements analysis; Python users can usestatsmodelswith clustered standard errors by respondent.
Design Considerations
Power: Survey experiments often require large samples (1000+ respondents)
Balance: Check that randomization succeeded across conditions
Attention checks: Include questions to verify respondents are reading carefully
Pre-registration: Specify hypotheses and analysis before collecting data
Manipulation checks: Verify respondents perceived the treatment as intended
5.6 Running Example: The Current Population Survey
Background
The Current Population Survey (CPS) is the U.S. government's primary source of labor force statistics. Monthly unemployment rates, labor force participation, and earnings data all come from the CPS.
Design
Target population: Civilian non-institutional population age 16+
Sample design:
Multi-stage stratified cluster sample
~60,000 households per month
Stratified by state and urban/rural
PSUs are counties or groups of counties
Ultimate sampling unit is address
Rotation: 4-8-4 pattern
Household in sample for 4 consecutive months
Out 8 months
Back in for 4 months
Enables monthly change estimates
Mode: Computer-assisted personal interviewing (CAPI) for first and fifth month; computer-assisted telephone interviewing (CATI) for others
Key Questions
Employment status (previous week):
Did you do any work for pay or profit?
Did you have a job from which you were temporarily absent?
Did you do any unpaid work in a family business?
Were you looking for work?
Were you available for work?
These questions operationalize the ILO definition of unemployment:
Without work
Currently available for work
Actively seeking work
Quality Considerations
Strengths:
Large sample, precise national estimates
Consistent methodology over decades
Professional interviewers, quality control
Monthly frequency captures dynamics
Limitations:
Proxy responses (household member answers for others)
Recall period (previous week may not be typical)
Definition sensitivity (marginal employment status)
Nonresponse adjustment may mask changes
Measurement Issues
The measured unemployment rate is sensitive to:
Reference period: "Last week" anchors responses but may not be typical
Classification rules: Someone who looked for work 5 weeks ago is "not in labor force," not "unemployed"
Proxy reporting: About half of responses are from another household member
Social desirability: Stigma may reduce unemployment reporting
These issues don't necessarily bias trends (they're consistent over time) but affect interpretation of levels.
Practical Guidance
When Surveys Are the Right Tool
Population prevalence
Yes
Core survey strength
Attitudes/opinions
Yes
Only surveys measure subjective states
Sensitive behaviors
Maybe
Mode matters; consider alternatives
Historical behavior
Caution
Recall error increases with time
Administrative outcomes
Often no
Administrative data usually better
Small populations
Challenging
May need complete enumeration
Common Pitfalls
Pitfall 1: Ignoring weights Analyzing survey data without weights yields biased estimates if the sample design involves unequal probabilities or nonresponse adjustment.
How to avoid: Always use provided weights; use survey-aware software (svyset in Stata, survey package in R).
Pitfall 2: Treating convenience samples as representative Amazon Mechanical Turk, social media polls, and opt-in panels are not probability samples.
How to avoid: Be explicit about sample limitations; don't claim population inference.
Pitfall 3: Ignoring design effects Using formulas for simple random sampling when the design is clustered understates standard errors.
How to avoid: Use survey-aware procedures that account for stratification and clustering.
Pitfall 4: Over-interpreting small changes Month-to-month changes in surveys are noisy; true changes may be smaller than observed.
How to avoid: Report confidence intervals; focus on trends rather than single-month changes.
Survey Quality Checklist
Qualitative Bridge
Surveys and Qualitative Methods
Surveys excel at measuring what and how many. They're less good at explaining why or capturing nuance. Qualitative methods complement surveys:
Before the survey (formative research):
Focus groups to understand how respondents think about topics
Cognitive interviews to test question comprehension
Ethnography to identify relevant constructs
After the survey (interpretive research):
In-depth interviews to explain survey patterns
Case studies to understand mechanisms behind correlations
Observation to validate self-reports
Example: Health Surveys
The National Health Interview Survey finds that 15% of adults report frequent anxiety.
Survey strengths: Prevalence estimate, demographic patterns, trends
Survey limitations: What does "frequent anxiety" mean to respondents? How does it affect daily life? What coping strategies do people use?
Qualitative complement: In-depth interviews with anxious individuals reveal the lived experience, coping mechanisms, and healthcare-seeking that explain survey patterns.
Integration Note
Connections to Other Methods
Data Quality
Survey quality is a special case
Ch. 2
RCTs
Survey outcomes in experiments
Ch. 10
Selection
Survey nonresponse is selection
Ch. 11
Heterogeneity
Subgroup estimates require appropriate design
Ch. 20
Triangulation Strategies
Survey estimates gain credibility when:
Administrative validation: Survey reports match administrative records
Cross-survey consistency: Different surveys yield similar estimates
Qualitative alignment: In-depth research explains survey findings
Trend plausibility: Changes are consistent with known events
Benchmark alignment: Totals match Census or other benchmarks
Summary
Key takeaways:
Probability sampling enables inference from sample to population with quantifiable uncertainty. Non-probability samples cannot support population claims.
Stratification increases precision; clustering decreases it. The design effect summarizes the net impact.
Question wording matters enormously. Small changes produce large differences in responses.
Nonresponse is the dominant quality concern in modern surveys. Low response rates don't necessarily cause bias, but differential nonresponse does.
The Total Survey Error framework organizes error sources. Multiple types of error can compound or offset.
Always use weights and survey-aware analysis when working with complex survey data.
Returning to the opening question: We learn about millions from thousands through probability sampling—random selection that gives every population member a known chance of inclusion. Combined with proper weighting and careful design, this enables valid inference. But surveys are fragile: question wording, nonresponse, mode effects, and coverage gaps all threaten validity. Sophisticated users understand these limitations and interpret accordingly.
Further Reading
Essential
Groves et al. (2009), Survey Methodology - The definitive textbook
Tourangeau, Rips, and Rasinski (2000), The Psychology of Survey Response - Cognitive foundations
For Deeper Understanding
Lohr (2010), Sampling: Design and Analysis - Technical sampling theory
Dillman, Smyth, and Christian (2014), Internet, Phone, Mail, and Mixed-Mode Surveys - Practical design guide
Biemer and Lyberg (2003), Introduction to Survey Quality - Error framework
Stantcheva (2023), "How to Run Surveys: A Guide to Creating Your Own Identifying Variation and Revealing the Invisible" - Annual Review of Economics, comprehensive guide for economists conducting original surveys
Historical/Methodological
Converse (1987), Survey Research in the United States - History of the field
Krosnick (1999), "Survey Research" - Annual Review overview
AAPOR (American Association for Public Opinion Research) - Standards and best practices
Special Topics
Diamond (2011), "Reference Guide on Survey Research" in Reference Manual on Scientific Evidence (National Academies Press) - Legal standards for survey evidence; essential reading for litigation applications
Bureau of Labor Statistics CPS documentation - Example of high-quality government survey
Pew Research Center methodology reports - Accessible quality discussions
Meyer, Mok, and Sullivan (2015), "Household Surveys in Crisis" - Quality trends
Exercises
Conceptual
Explain why stratification always (weakly) improves precision compared to SRS, while clustering always (weakly) reduces it. Under what conditions would stratification have no effect? Clustering have no effect?
A telephone survey has a 5% response rate. A critic argues the results are worthless. How would you respond? What additional information would you want?
"The question asked was: 'Don't you agree that government should do more to help the poor?' 75% agreed." What problems do you see with this question and interpretation?
Applied
Find documentation for a major survey (CPS, General Social Survey, Health and Retirement Study, etc.):
What is the target population?
What is the sampling design?
What is the response rate?
How are weights constructed?
Write a brief (1 page) quality assessment.
Design a brief survey (10 questions) to measure student satisfaction with their university. Apply the principles from this chapter. Have 3 people complete your survey using think-aloud protocols. What problems did you discover?
Discussion
Response rates have fallen from 70% to under 10% over the past few decades, yet many surveys continue to produce estimates that match external benchmarks. How is this possible? What does it tell us about the relationship between response rates and bias?
Technical Appendix
A. Variance Estimation for Complex Surveys
Linearization (Taylor series): For nonlinear statistics (ratios, regression coefficients), linearize and apply design-based variance formulas.
Replication methods:
Jackknife: Delete one PSU at a time; estimate variance from variation across replicates
Bootstrap: Resample PSUs with replacement; estimate variance from bootstrap distribution
Balanced repeated replication (BRR): Systematic replicate construction for stratified designs
B. Optimal Allocation in Stratified Sampling
Neyman allocation: Allocate sample to strata proportional to stratum size times stratum SD:
nh∝NhSh
This minimizes variance for fixed total sample size. The intuition: sample more heavily from strata that are large (more weight in population mean) and variable (more uncertainty to reduce).
Cost-weighted allocation: When per-interview costs vary by stratum (e.g., face-to-face in rural areas vs. telephone in urban areas):
nh∝chNhSh
where ch is the per-interview cost in stratum h. This minimizes variance for fixed total budget. We sample less from expensive strata, but not proportionally less—the square root reflects the tradeoff between cost savings and precision loss.
Last updated