Bailey (2023) proposes a new paradigm for polling based on an algebraic identity for the error in a sample mean discussed by Meng (2018). I too appreciated the clear exposition of survey error in Meng’s article, and relied heavily on insights in Meng (2018) and Rao (2021) when writing the new chapter on nonprobability sampling in the latest edition of my sampling textbook (Lohr, 2022, chap. 15). Although it has long been known that nonresponse bias for estimating a population mean depends on the correlation between survey participation and the characteristic being measured, Meng discussed the implications of that correlation in the era of big data. Meng provided a framework for addressing questions such as: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” (Meng, 2018, p. 685). One of the important conclusions in Meng’s article is the tremendous value of information from probability samples relative to other types of samples, and he gave an example in which the mean squared error (MSE) from a convenience sample of half of a population of one billion people (with expected correlation 0.05) is larger than the MSE from a simple random sample (SRS) of size 400.
Bailey (2023), arguing that today’s low poll response rates mean that “random sampling is dead,” proposes replacing inference based on probability sampling with efforts to measure the correlations between survey participation and specific outcome variables. Many of his conclusions and recommendations, however, rest on strong implicit assumptions about participation mechanisms. In this discussion I explore some of those assumptions and examine their implications for inference.
The discussion is organized as follows. I begin by reviewing Meng’s equation and defining random variables that may be used for statistical inference. I then discuss the special characteristics of probability samples and argue that there are benefits to selecting the initial sample using probability methods even if the response rate is low. After considering the role of auxiliary information for reducing bias and examining assumptions underlying randomized response instrument models, I conclude with some suggestions for moving forward.
Let
where
and
denote the population variances of
is the population correlation.
The formulation of survey error in Equation (3), involving the correlation between the outcome variable of interest and the response mechanism, has been known for a long time. Hartley and Ross (1954) used a similar algebraic identity when deriving the bias of the ratio estimator, and numerous authors have noted that nonresponse bias depends on the correlation between response indicators or propensities and
Equation (3) is an algebraic identity for the difference between the mean of the particular sample collected (the population members with
In model-based inference, the values
In design-based sampling theory—perhaps we should call it participation-based sampling theory in this context since not all samples are designed—the quantities
For a sample with some degree of self-selection, whether a convenience sample of volunteers or a probability sample with nonresponse, the values of
Many of the conclusions in Bailey (2023) appear to be predicated on an unstated assumption that the distribution of the random variables
There is no reason to believe, however, that the distribution of the latent variables
In a probability sample with full response, participation is described by
The development of probability sampling theory was a tremendous breakthrough for the age-old problem of how to generalize from a sample we have seen to population members we have not seen. Probability sampling allows this generalization—along with an assessment of the accuracy of estimates—because the probability distribution of the participation indicator is fully under the control of the sampler. The pioneers of probability sampling, however, were well aware that the validity of inferences under probability sampling theory depends on having full response. For example, Deming (1950, p. 35) wrote: “A sample is no longer a probability sample if it is ruined by nonresponse or any other difficulty of execution,” and argued that even a nonresponse rate of 5% could seriously affect results.
When there is nonresponse within a probability sample, the participation indicator can be written as
Equation (4) has approximately the same form as the expected value of Equation (2), with
Consider two survey designs. The first sends a broadcast invitation to everyone on the email list (with participation indicator
The bias will be reduced for the SRS if the pollster modifies the recruitment procedure to obtain a lower expected value of
Probability sampling by itself does not cure the problem of nonresponse bias, but it allows the pollster to concentrate resources on obtaining high-quality data from a selected sample instead of spreading those efforts among the whole population. Lohr (2022) argued that starting with a probability sample has several advantages even when response rates are low:
The sampling frame for a probability sample is well defined, and many frames used in practice have high coverage. When coverage is incomplete by design (for example, when the frame excludes persons in institutions) the probability sampler can limit inference to the frame population. Many nonprobability samples lack a sampling frame, which makes it challenging to assess coverage.
The sampler can devote more resources toward persuading members of the selected sample to participate. This changes the distribution of the response indicators
It is more difficult for a malevolent actor to influence a probability sample. A population member can participate at most once in a probability survey, thus preventing situations in which one person or machine clicks the ‘take this survey’ button 10,000 times. Even when pollsters have methods for preventing or detecting multiple responses, a nonprobability sample might be manipulable. For example, a malevolent actor knowledgeable about survey weighting could arrange for a large group of people with a variety of claimed demographic characteristics to sign up for an online opt-in poll where participants are recruited by broadcast advertisement. None of the auxiliary information that would be available for weighting or nonresponse modeling would be able to compensate for the bias caused because all of these demographically diverse poll participants say they have opinion X.
A would-be malevolent actor cannot arrange for people with opinion X to flood a probability sample because the initial selection is random.
The probability sample often has more information available that can be used for weighting, imputation, or other types of nonresponse modeling. For example, a sample drawn from a list of registered voters may have information on age, political party, and past voting behavior that can be used to study and adjust for nonresponse bias.
If
where
where
Weighting is a powerful tool for reducing bias, and high-quality auxiliary information can improve estimates even from surveys with a high degree of selection bias. For example, Bailey (2023) cites the 1936 Literary Digest survey as an example of a sampling “fiasco” and indeed the unweighted estimates were poor. Fifty-four percent of the Digest’s sample of nearly 2.4 million respondents said they were supporting Landon for president, but Roosevelt ended up winning the election with more than 60% of the popular vote. If, however, the Digest editors had used available auxiliary information to weight the data, the weighting would have partially corrected for the overrepresentation of Republican voters in the sample and led the editors to predict that Roosevelt would win the election (Lohr & Brick, 2017).
As Bailey (2023) mentions, however, most inferences based on weighted estimates rely on an assumption that the response mechanism is missing at random (MAR) and that the weighting model removes the nonresponse bias. If the MAR assumption is not true, then confidence intervals based on formulas from probability sampling no longer have the stated coverage probability.
The problem is that one cannot assess the MAR assumption from the sample itself—one needs external information such as independent estimates of the means of outcome variables. For example, Mercer et al. (2018), comparing weighted estimates from online opt-in surveys to benchmarks from high-quality federal surveys, found that even the most effective weighting adjustments were able to remove only about 30% of the bias from the unweighted estimates, and thus the response mechanisms for those surveys cannot be considered to be MAR. They concluded that the quality of the auxiliary information mattered more than the particular statistical method for using such information, and advocated obtaining a richer set of auxiliary variables that go beyond core demographics.
Because weighted estimates can still be biased, there is a large body of research exploring not-missing-at-random (NMAR) models in which the response mechanism depends on the values of unobserved data as well as on observed data. The statistical properties for estimators calculated from these models depend on the validity of assumptions made about the nonparticipants—theorems often have the form ‘If assumptions A, B, and C hold, then our proposed estimate is approximately unbiased with approximate variance given in Equation D.’
NMAR models are useful for exploring potential nonresponse mechanisms and are an important part of nonresponse bias analyses. Users can assess whether the explicitly stated assumptions of a particular NMAR model apply to their own circumstances. But, as Molenberghs et al. (2008) pointed out, “the correctness of the alternative model can only be verified in as far as it fits the observed data” (p. 372, emphasis in original). Molenberghs et al. (2008) showed that every NMAR model has a MAR counterpart that fits the observed data equally well. MAR and NMAR models with equivalent fits may give different estimates of
Bailey (2023) suggests collecting additional auxiliary information for modeling nonresponse through use of ‘randomized response instruments’ in which persons in the selected sample are randomly assigned to different survey protocols that are designed to have different response rates. Many pollsters use randomized experiments to test questions and explore methods for improving response rates, but Bailey (2023) claims that assigning some of the selected sample members to a protocol that is known to achieve a low response rate can “allow us to assess whether the response mechanism is ignorable or not” and “create population estimates that [are] purged of the malign effects of
This sounds like it is too good to be true, and I believe it is. Every method of nonresponse adjustment or bias estimation requires assumptions about the nonrespondents. In this case, the very strong assumptions required for randomized response instrument methods are implicit in the model identifiability conditions.
Consider the fictional data in Table 1 from an experiment with two sampling protocols. An SRS of size 10,000 is selected from the population. Sample members who are randomly assigned to the group with
Respondents | |||||
Protocol | Nonrespondents | ||||
5,000 | 4,000 | 350 | 650 | 0.65 | |
---|---|---|---|---|---|
5,000 | 3,000 | 1,000 | 1,000 | 0.50 | |
All | 10,000 | 7,000 | 1,350 | 1,650 | 0.55 |
If there were no nonresponse bias, we would expect the means of the respondents from the two protocols in Table 1 to be approximately equal because the sample members were randomly assigned. Thus, Table 1 clearly exhibits a problem with nonresponse bias in one or both
Figure 1 depicts three of the many possible relationships between response probabilities and
The saturated logistic regression model relating
but Sun et al. (2018) showed in their Example 1 that this saturated model is not identifiable and therefore cannot be fit. Thus, an assumption must be made to reduce the number of parameters. Sun et al. (2018) fixed
is identifiable.
When
The assumption that
The protocols for many probability surveys have been refined through years of research and experimentation, and often nonresponse follow-up efforts are designed to raise the response rates for subpopulations that are underrepresented after initial contact efforts. If a well-researched protocol 1 has a moderately high response rate, and protocol 0 modifies protocol 1 so as to reduce the response rate (for example, by skipping nonresponse follow-up), I think assuming a structure similar to Figure 1(a), where protocol 1 results in MAR data, is more reasonable than assuming a structure such as (c) that forces the interaction to be zero. And if the structure in (a) holds, then one should take the entire sample using protocol 1 because estimates from that protocol are unbiased. Why would a sampler want to allocate half of the sample to an inferior protocol with a lower response rate and then adjust the weights for the superior protocol according to the results from the inferior protocol?
In this example, I assumed no other auxiliary information was available. If other information is available, the weighting adjustments from the randomized response instrument method are likely to be less extreme because some of the difference between the protocols will be explained by the other auxiliary variables. But anyone using this method must still assume that, after conditioning on other covariates, the response instrument does not interact with
In general, different sampling protocols are likely to have different relationships between
The current paradigm used by pollsters relies on strong assumptions that missing data are MAR and nonresponse weighting removes the bias from estimates. Assumptions for the new paradigm are not stated in Bailey (2023), but they appear to be equally strong if not stronger. External information about the nonrespondents is needed to assess the validity of the assumptions in either paradigm for a particular survey.
Equations (1)–(3) are algebraically equivalent, and thus the information that can be learned from each expression for a particular survey is identical. Nevertheless, looking at the expression for survey error in multiple ways can provide additional insights for survey planning, and many statisticians have studied nonresponse bias through examining the correlation of response propensities and outcome variables. Bethlehem (1988), for example, used the expected value of Equation (2) to provide guidance for constructing poststrata designed to minimize nonresponse bias, and his proposed methods are now standard practice.
Modern-day polls are not representative in the sense defined by Neyman (1934). Confidence intervals calculated under the assumption of unbiasedness have too-low coverage probability when there is nonresponse bias. These problems are well known, and there is no magic cure for low response rates. To date, full-response probability sampling is the only method guaranteed to produce unbiased estimates and accurate margins of error.
There are, however, a number of steps that pollsters can take to improve methodology and acknowledge the limitations of estimates. Transparency is paramount, and every poll should be accompanied by a full methodology report that describes how the sample was recruited, how concepts were measured, and how estimates were calculated. This report should include breakdowns of the response rate and a discussion of how nonresponse might affect the results of estimates for the full population and for subgroups.
Montaquila and Olson (2012) provided numerous practical tools for conducting nonresponse bias analyses, including fitting statistical models to look at the relationship between
Pollsters can take a cue from the language of mathematics and state (in lay language, of course) the conditions under which the estimates are approximately unbiased and the margin of error describes the uncertainty. Some poll reports (see, for example, Lopes et al., 2023) refer to ‘margin of sampling error’ and emphasize that other sources of error may also affect the estimates, and I think this is a much better practice than simply stating a margin of error without explaining that it measures only one type of uncertainty.
Finally, pollsters can be guided by solid mathematical and empirical research on methods for estimation and reducing nonresponse bias. There is a huge amount of excellent work, including recent research on the nature of statistical information and bias (Meng, 2018), statistical properties of estimates from nonprobability samples (Rao, 2021; Wu, 2022), empirical investigations comparing estimates from different sampling protocols and with different types of weighting variables (Dutwin & Buskirk, 2017; Mercer et al., 2018), investigation of new variables that can be used for weighting (Peytchev et al., 2018), and investigations into respondents who provide deliberately misleading answers to polls (Kennedy et al., 2020). Couper (2017) described recent methodological and technological advances that can improve survey quality and Jamieson et al. (2023) provided additional suggestions for protecting the integrity of survey research. While surveys face a number of major challenges, I am optimistic that the survey research community is up to the task.
I would like to thank Professor Meng for inviting me to provide this discussion, and I am grateful to J.N.K. Rao and Mike Brick for many helpful discussions on inferential issues related to big data, nonprobability samples, and data integration.
Sharon L. Lohr has no financial or non-financial disclosures to share for this article.
Bailey, M. A. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede
Bethlehem, J. G. (1988). Reduction of nonresponse bias through regression estimation. Journal of Official Statistics, 4(3), 251–260.
Brick, J. M. (2013). Unit nonresponse and weighting adjustments: A critical review. Journal of Official Statistics, 29(3), 329–353. https://doi.org/10.2478/jos-2013-0026
Couper, M. P. (2017). New developments in survey data collection. Annual Review of Sociology, 43, 121–145. https://doi.org/10.1146/annurev-soc-060116-053613
Deming, W. E. (1950). Some theory of sampling. John Wiley & Sons.
Dutwin, D., & Buskirk, T. D. (2017). Apples to oranges or Gala versus Golden Delicious? Comparing data quality of nonprobability internet samples to low response rate probability samples. Public Opinion Quarterly, 81(S1), 213–239. https://doi.org/10.1093/POQ%2FNFW061
Hartley, H., & Ross, A. (1954). Unbiased ratio estimators. Nature, 174(4423), 270–271. https://doi.org/10.1038/174270a0
Haziza, D., & Lesage, É. (2016). A discussion of weighting procedures for unit nonresponse. Journal of Official Statistics, 32(1), 129–145. https://doi.org/10.1515/jos-2016-0006
Jamieson, K. H., Lupia, A., Amaya, A., Brady, H. E., Bautista, R., Clinton, J. D., Dever, J. A., Dutwin, D., Goroff, D. L., Hillygus, D. S., Kennedy, C., Langer, G., Lapinski, J. S., Link, M., Philpot, T., Prewitt, K., Rivers, D., Vavreck, L., Wilson, D. C., & McNutt, M. K. (2023). Protecting the integrity of survey research. PNAS Nexus, 2(3), Article pgad049. https://doi.org/10.1093/pnasnexus/pgad049
Kennedy, C., Hatley, N., Lau, A., Mercer, A., Keeter, S., Ferno, J., & Asare-Marfo, D. (2020). Assessing the risks to online polls from bogus respondents. Pew Research. https://www.pewresearch.org/methods/wp-content/uploads/sites/10/2020/02/PM_02.18.20_dataquality_FULL.REPORT.pdf
Lohr, S. L. (2022). Sampling: Design and analysis (3rd ed.). CRC Press.
Lohr, S. L., & Brick, J. M. (2017). Roosevelt predicted to win: Revisiting the 1936 Literary Digest poll. Statistics, Politics and Policy, 8(1), 65–84. https://doi.org/10.1515/spp-2016-0006
Lopes, L., Kearney, A., Washington, I., Valdes, I., Yilma, H., & Hamel, L. (2023, August). KFF health misinformation tracking poll pilot. KFF. https://www.kff.org/coronavirus-covid-19/poll-finding/kff-health-misinformation-tracking-poll-pilot/
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF
Mercer, A., Lau, A., & Kennedy, C. (2018). For weighting online opt-in samples, what matters most? Pew Research. https://www.pewresearch.org/methods/2018/01/26/for-weighting-online-opt-in-samples-what-matters-most/
Molenberghs, G., Beunckens, C., Sotto, C., & Kenward, M. G. (2008). Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(2), 371–388. https://doi.org/10.1111/j.1467-9868.2007.00640.x
Montaquila, J., & Olson, K. M. (2012). Practical tools for nonresponse bias studies [Webinar]. Survey Research Methods Section of the American Statistical Association; American Association of Public Opinion Research. https://community.amstat.org/surveyresearchmethodssection/programs/new-item2
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625. https://doi.org/10.2307/2342192
Peytchev, A., Presser, S., & Zhang, M. (2018). Improving traditional nonresponse bias adjustments: Combining statistical properties with social theory. Journal of Survey Statistics and Methodology, 6(4), 491–515. https://doi.org/10.1093/jssam/smx035
Platek, R. (1980). Causes of incomplete data, adjustments and effects. Survey Methodology, 6(2), 93–132. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1980002/article/54945-eng.pdf?st=k9d-jaTD
Rao, J. N. K. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhyā Series B, 83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
Särndal, C.-E., & Lundström, S. (2005). Estimation in surveys with nonresponse. John Wiley & Sons.
Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., & Tchetgen, E. J. T. (2018). Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28(4), 1965–1983. https://doi.org/10.5705%2Fss.202016.0324
Wu, C. (2022). Statistical inference with non-probability survey samples (with discussion). Survey Methodology, 48(2), 283–373. https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.pdf
©2023 Sharon L. Lohr. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.