Data quality in sample surveys, and opinion polls in particular, is an increasing concern, given sampling-frame deficiencies, rising nonresponse rates, and perceived failures in election polls. Concern is evidenced by the recent report from the American Association for Public Opinion Research (AAPOR) on data quality metrics for online surveys (AAPOR, 2022).
Bailey’s (2023) statement that “random sampling is, for all practical purposes, dead” might be defensible for opinion polling, but not for the field of survey sampling as a whole. In areas such as auditing, probability sampling is still practical; in other settings, government statistical agencies like the U.S. Census Bureau and the National Center for Health Statistics, and survey research organizations like Westat, Research Triangle Institute, and the Institute for Social Research at Michigan, strive to conduct high-quality probability surveys. The ideal of random sampling remains a very important concept even when it is not attainable in practice.
Bailey argues that Meng’s (2018) discussion of the role of probability sampling in the era of ‘big data,’ and in particular Meng’s “Law of Large Populations,” heralds a paradigm shift in the assessment of polling data collected by nonrandom sampling methods. Meng has provided useful illustrations of the trade-off between bias and variance in survey data, emphasizing the role of probability sampling in the era of big data. See, for example, Bradley et al. (2021) and Meng’s (2016) discussion of Keiding and Louis (2016), a wide-ranging debate of the pros and cons of probability sampling in the context of epidemiologic studies. However, I have reservations about his proposed Law of Large Populations (Meng, 2018), as represented by what Bailey calls “Meng’s equation”:
I argue that
Meng is a friend who is aware of my reservations, and I appreciate his generous invitation to write about them in this commentary.
The left side of Equation 1,
where E(b) is the expected bias and var is the variance of the sample estimate. In a simple superpopulation model where units are assumed exchangeable and unit outcomes modeled as independent, a particular form of the RMSE is
where (1 – f) is the finite population correction.
A characteristic of this formulation is that bias from nonrandom selection, unlike precision, remains relatively constant as a function of sample size n. One might argue that bias actually increases with n, because collection of a larger sample is harder to control and more subject to measurement error; but measurement error issues are not a part of Meng’s (2018) discussion. As n increases, the relative contribution of precision (as measured by the variance), to the RMSE decreases, and bias increasingly dominates. Thus, for ‘big data,’ it is bias, not precision, that is the key issue. Meng has illustrated the impact of even small amounts of bias on the RMSE of sample estimates.
Provided f is small, it is the sample size n, not the population size N, that controls both the RMSE and the relative contribution of bias and variance. For example, a random sample of n = 1,000 yields the same precision whether the population size is 100,000 or 100 million; for nonrandom forms of sampling, if
Equations 1 and 4 are both valid expressions, so the underlying mathematics is not an issue. The data defect correlation is a measure of selection bias, but I do not think it is very easy to interpret; the correlation usually measures association between continuous variables, and is not a natural measure of association when one of the variables, R, is binary. The fact that it a dimensionless quantity makes it potentially transportable across studies (but see the next section for a counterargument); it is less useful for assessing bias in a particular substantive setting. For example, I do not have a particularly strong intuitive notion of the difference between a correlation of 0.01 and 0.05. On the other hand, the more basic difference in means
The ability to address selection bias is limited without auxiliary information. Addressing selection bias more generally requires a superpopulation model for the joint distribution of R and Y, given auxiliary covariates X known for the whole population. Assuming for simplicity independence over population units i, two main approaches for formulating these models can be distinguished. Selection models factor the joint distribution of
where the first factor characterizes the distribution of
where the first distribution characterizes the distribution of
The data defect correlation
Suppose that for unit i,
where BERN denotes the Bernoulli distribution,
The following pattern-mixture model can be viewed as a natural generalization of the expression in Equation 2:
An alternative to (8) is the following pattern-mixture model:
The difference in means for selected and unselected cases, namely
Little and Rubin (2019, chap. 15) argue that the pattern-mixture model (Example 2) is easier to interpret than the selection model (Example 1). In particular, in (9), the selection effect is characterized by
Other useful features of the pattern-mixture model are that it is often much easier to fit than the probit selection model, given assumptions to render the parameters estimable, and that it can include covariates that are only available in aggregate data form. Also, imputations of the missing values are based on the predictive distribution of Y given X and R = 0, which is modeled directly in the pattern-mixture factorization. For sensitivity analysis for nonignorable selection, pattern-mixture models are easier to implement and (in my view) easier to interpret for nonstatisticians.
I now return to the simpler case without auxiliary covariates. The data defect correlation has another feature that limits its value as a universal constant: its value varies depending on the selection fraction f.
If N is increased holding
the standardized selection bias multiplied by
As a numerical example, suppose
If N = 106, then
The data defect correlation is very different in these cases, but the selection bias is approximately the same.
Joint models for R and Y all suffer from lack of identifiability of parameters. In the context of nonresponse, a close relative of the problem of selection, Little and Rubin (2019, chap. 15) describe five strategies for missing not-at-random models:
Collect data on a subsample of nonrespondents, and use information from this sample to weight the selected units or impute the values of unselected units. Simulation studies in Glynn et al. (1986) in the context of nonresponse suggest that even a small nonrespondent subsample can markedly reduce sensitivity of inference to nonignorable nonresponse.
Use Bayesian modeling, including a prior distribution for unidentified parameters. An early example is Rubin (1977).
Impose restrictions to identify parameters. Two applications of this approach in the context of the Heckman (1976) model, one successful and one not successful, are described in Little and Rubin (2019, Examples 15.11 and 15.12). Bailey’s proposal for randomized response designs is an example, though this approach seems more tuned to missing data than to selection effects.
Selectively discard data based on assumed missing not-at-random assumptions. See in particular subsample ignorable likelihood methods (Little & Zhang, 2011).
Conduct a sensitivity analysis, varying values of the parameters measuring deviations from ignorable selection. Example 3 below illustrates one such approach. The approach was originally developed in the context of nonresponse (Andridge & Little, 2011), but has recently been adapted to develop indices of selection bias for nonrandom samples for means (Boonstra et al., 2021; Little et al., 2020) and regression coefficients (West et al., 2021). Extensions to a binary outcome are provided in Andridge and Little (2020) and Andridge et al. (2019).
Let Y be a continuous survey variable recorded on the selected sample, and
where f is an arbitrary function. With some additional assumptions, missingness can also depend on auxiliary variables uncorrelated with X. The maximum likelihood estimate of the population mean of Y is
Bayesian inference for this model is also relatively straightforward. For recent applications, see Andridge and Thompson (2015), West and Andridge (2023), and Andridge (2023).
Sensitivity to nonignorable selection bias in the above analysis is reduced when
I appreciate useful suggestions on this article by the editor, and my colleagues Michael Elliott and Yajuan Si.
Roderick J. Little has no financial or non-financial disclosures to share for this article.
American Association for Public Opinion Research. (2022). Data quality metrics for online samples: Considerations for study design and analysis. https://aapor.org/publications-resources/reports/
Andridge, R. R. (2023). Using proxy pattern-mixture models to explain bias in estimates of COVID-19 vaccine uptake from two large surveys. ArXiv. https://doi.org/10.48550/arXiv.2307.16653
Andridge, R. R., & Little, R. J. (2011). Proxy-pattern mixture analysis for survey nonresponse. Journal of Official Statistics, 27(2), 153–180. https://doi.org/10.2478/jos-2020-0035
Andridge, R. R., & Little, R. J. (2020). Proxy pattern-mixture analysis for a binary survey variable subject to nonresponse. Journal of Official Statistics, 36(3), 703–728. https://doi.org/10.2478/jos-2020-0035
Andridge, R. R., & Thompson, K. J. (2015). Assessing nonresponse bias in a business survey: Proxy pattern-mixture analysis for skewed data. Annals of Applied Statistics, 9(4), 2237–2265. https://doi.org/10.1214/15-AOAS878
Andridge, R. R., West, B. T., Little, R. J., Boonstra, P. S., & Alvarado-Leiton, F. (2019). Indices of non-ignorable selection bias for proportions estimated from non-probability samples. Journal of the Royal Statistical Society Series C: Applied Statistics, 68(5), 1465–1483. https://doi.org/10.1111/rssc.12371
Bailey, M. A. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede
Boonstra, P. S., Little, R. J., West, B. T., Andridge, R. R., & Alvaredo-Leiton, F. (2021). A simulation study of diagnostics for bias in non-probability samples. Journal of Official Statistics, 37(3), 751–769. https://doi.org/10.2478/jos-2021-0033
Bradley, V. C., Kuriwaki, S., Isakov, M., Sejdinovic, D., Meng, X-L., & Flaxman, S. (2021). Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature, 600, 695–700. https://doi.org/10.1038/s41586-021-04198-4
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In H. Wainer (Ed.), Drawing inferences from self-selected samples (pp. 115–142). Springer-Verlag. https://doi.org/10.1007/978-1-4612-4976-4_10
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. Journal of the American Statistical Association, 88(423), 984–993. https://doi.org/10.2307/2290790
Greenlees, W. S., Reece, J. S., & Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association, 77, 251–261. https://doi.org/10.2307/2287228
Heckman, J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables, and a simple estimator for such models. Annals of Economic and Social Measurement, 5, 475–492. https://www.nber.org/system/files/chapters/c10491/c10491.pdf
Keiding, N., & Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys (with discussion). Journal of the Royal Statistical Society Series A: Statistics in Society, 179(2), 319–376. https://doi.org/10.1111/rssa.12136
Little, R. J. (1993). Pattern‑mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421), 125–134. https://doi.org/10.1080/01621459.1993.10594302
Little, R. J. (1994). A class of pattern‑mixture models for normal missing data. Biometrika, 81(3), 471–483. https://doi.org/10.1093/biomet/81.3.471
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (3rd ed.). Wiley.
Little, R. J., West, B. T., Boonstra, P. S., & Hu, J. (2020). Measures of the degree of departure from ignorable sample selection. Journal of Survey Statistics and Methodology, 8(5), 932–964. https://doi.org/10.1093/jssam/smz023
Little, R. J., & Zhang, N. (2011). Subsample ignorable likelihood for regression analysis with missing data. Journal of the Royal Statistical Society Series C: Applied Statistics, 60(4), 591–605. https://doi.org/10.1111/j.1467-9876.2011.00763.x
Meng, X.-L. (2016). Discussion of paper by Keiding & Louis. Journal of the Royal Statistical Society Series A: Statistics in Society, 179(2), 351–352. https://doi.org/10.1111/rssa.12136
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF
Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359), 538–543. https://doi.org/10.1080/01621459.1977.10480610
West, B. T., & Andridge, R. R. (2023). An evaluation of 2020 pre-election polling estimates using new measures of non-ignorable selection bias. Public Opinion Quarterly, 87(S1), 575–601. https://doi.org/10.1093/poq/nfad018
West, B. T., Little, R. J., Andridge, R. R., Boonstra, P., Ware, E. B., Pandit, A., & Alvarado-Leiton, F. (2021). Assessing selection bias in regression coefficients estimated from nonprobability samples with applications to genetics and demographic surveys. Annals of Applied Statistics, 15(3), 1556–1581. https://doi.org/10.1214/21-aoas1453
©2023 Roderick J. Little. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.