With ever-decreasing participation and response rates, a question for polling and survey statistics more broadly has been the central role of random sampling to ensure external validity of the observed sample. Bailey (2023) calls for a paradigm shift, putting Meng’s error decomposition and its data defect correlation in central roles. The article is well aligned with prior work on internal versus external validity and self-selection that came about due to web-based enrollment in surveys and studies (Keiding & Louis, 2016). The prior work raises the question of representation when attempting to generalize results beyond the context in which the data was collected, whereas the current work calls into question the ability to even generalize to the current context given non-ignorable nonresponse.
Paradigm shifts rely on strong lines of demarcations and an uncompromising stance on the existing literature to advocate for a fundamental change in approach. This article and Bailey's upcoming book follow a recent trend in calling attention to the issues of self-selection and, in particular, data quality as a fundamental issue that needs to be addressed in modern survey statistics and polling research. Data quality is often underappreciated in the statistical literature, which emphasizes novel data analytic techniques. Bailey's (2023) article reminds me of Rubin (2008) and how in general design trumps analysis.
My hesitancy in vociferously joining the chorus is that often a call to arms for modern tools is interpreted as a need to overthrow the orthodoxy. Traditional statistical methods can be a helpful complement to the ever-expanding modern toolkit. I believe Bailey (2023) would support the mantra “probability sampling as aspiration, not prescription” (Meng, 2022). I agree with this mindset. This not only leads to data defect correlation as a natural data-quality metric, but also instills in the data analyst a real embrace of sensitivity analysis as a central tenet of modern data analysis when using these data sources. The goal of the rest of this comment is to assess the usefulness of Meng’s decomposition in three more general settings, and discuss some of Bailey’s proposed solutions, how they connect to older solutions to similar problems, and how it all connects to the problem of data quality—an increasingly important issue in modern statistics.
Bailey (2023) focuses on the error decomposition from Meng (2018) to motivate the future of polling in which the data defect correlation plays the central role. The approach is motivated by the comparison of sample mean
We start by investigating the interplay between imperfect testing and selection bias based on Dempsey (in press). This example is motivated from the COVID-19 pandemic and the known inaccuracies for RT-PCR tests (Arevalo-Rodriguez et al., 2020; Cohen et al., 2020). Researchers often assume measurement error leads to parameter attenuation. When paired with selection bias, however, the two sources become entangled, and resulting errors can be magnified, muted, or even switch signs. Assuming a binary outcome with known sensitivity and specificity, the adjusted estimator
where
Equation 1 extends the original error decomposition by Meng (2018) to account for imperfect testing. The first three terms continue to represent data quality, data quantity, and problem difficulty, respectively. The final term is an imperfect testing adjustment, which is a complex function of the sampling rate differential, the odds ratio, and the ratio of measurement error interaction with prevalence and sampling rates’ interaction with prevalence.
Analysts may collect survey data longitudinally. As the error decomposition is multiplicative, one may claim the ratio of estimates on consecutive days
where
Figure 2a presents the true ratio and the potential biased estimators under a susceptible-exposed-infected-recovered (SEIR) model (Newman, 2002; Parshani et al., 2010; Pastor-Satorras & Vespignani, 2001) for the epidemic dynamics, with state evolution given by
where
Journalists, public health experts, and government officials were all interested in cross-population comparisons to understand the impact of countries’ COVID-19 mitigation policies. Here, for simplicity, we focus on comparing the estimated effective reproductive rate and assume the two time-series are aligned so that
A critical question is whether Meng’s decomposition can be extended beyond simple averages. Here, we consider general parameter estimation via estimating equations. Let
for some
Taking a Taylor series expansion around
Then we can rewrite the approximate error as
where
Bailey’s (2023) article emphasizes the importance of the Law of Large Populations: “Among studies sharing the same (fixed) average data defect correlation
where
The simplest way to achieve this is to assume
Under the asymptotic regime where
I think these two asymptotic regimes are at the heart of the recent debate over whether we can view the data defect correlation as a ‘universal constant.’ I prefer keeping such terms in physics where the notion is more suitable. The key question is whether the ddc is a reasonable alternative to the standard bias/variance view of statistical estimates. In the context of nonprobabilistic samples, I believe its utility lies in placing the correlation of the sampling mechanism and outcome at the center. In my opinion, unbiasedness is often overemphasized at the cost of high variance. Meng’s decomposition leads to a helpful reversal of roles for thinking about statistical analysis of survey data and potential solutions.
A concept emphasized by Meng (2018) but not as prominent in Bailey (2023) is effective sample size (
A common refrain is that these calculations demonstrate an issue in using nonprobabilistic samples to make any inferential statements. Here, we demonstrate that while the effective sample size is small for direct comparisons, there is a potential for meaningfully large effective sample sizes when we consider relative difference in two means. Here we consider differences in vaccination rates over successive time periods as our target. Let
where
I end by emphasizing a central element of Bailey (2023)—the need for new statistical tools to deal with the data defect correlation. The data analyst has no means to estimate
In many circumstances, however, ground truth cannot be observed. Consider, for instance, COVID-19 active infection rates in the United States. By the middle of 2020, testing was readily available to most of the Indiana population. Working with the state of Indiana, Dempsey et al. (2020) was able to obtain demographic breakouts of both testing and COVID-19 cases. On the other hand, population-level ‘ground truth’ data was not available as testing required self-selection into reporting. Between April 25 and 29, 2020, Indiana conducted statewide random molecular testing of persons ages
Of course, probabilistic samples still come with caveats. Nonresponse did occur in the COVID-19 prevalence study (Yiannoutsos et al., 2021). Their analysis proposed a Bayesian model-based method to handle missing data. Of course, Bailey is right to emphasize the common inadequacy of approaches that rely on the missing-at-random assumption to hold given only demographic information. I agree this is inadequate, but I do not agree that “random sampling is, for all practical purposes, dead.” From my perspective, this goes back to a common issue in statistical applications. The statistician is consulted only after the study is completed. The analyst is left to rely on the common refrain without additional justification: ‘We acknowledge that this analysis makes the assumption that missing data are missing at random conditional on the measured covariates.’
Here, I highlight two orthogonal components to improve upon random sampling. First, I believe statisticians need to place a larger emphasis on data collection, that is, what covariate information are we collecting on each individual? Too often I see demographic data and a few outcomes as the only measured variables. Second, I agree with Bailey that we need to consider alternate designs. Extending random sampling to include instrumental variable methods such as randomized response instruments is noteworthy. However, designing incentives is a highly nontrivial task and is context specific. See Gelman et al. (2003) and Singer et al. (1999) for an example of the complicated nature of designing incentives to improve response rates. As for other potential designs, I would note that proximal methods from the causal inference literature may also be suitable Zivich et al. (2023). These methods require the analyst to classify a subset of measured covariates into three bucket types: (1) variables that may be common cause of selection bias and outcome; (2) selection-inducing proxies; and (3) outcome-inducing confounding proxies. Nonresponse causes issues with direct application of these ideas; however, thinking about proxies can improve data collection, which can in turn improve estimation. I see this as a great lens for thinking about future survey design.
Given well-thought-out survey design, I believe the traditional analyses such as selection and pattern-mixture modeling (Little, 2008) and sensitivity analysis (Little et al., 2019; Robins, 1997) can go a long way in correcting for nonresponse bias. I would like to highlight two specific directions. First, there are a variety of subselection methods that seem promising. See Little and Zhang (2011) where subsampling led to weaker missing data mechanism assumptions; see Meng (2022) for an alternative use aimed at reducing the data defect correlation. Second, intensive follow-up with nonresponders has been an effective tool for improving inference. is right that random sampling is more akin to random contact, but oversampling nonresponders may yield fruit (Glynn et al., 1993). Subsampling was first introduced by Hansen and Hurwitz (2004) and is a standard part of survey sampling (Cochran, 2002; Groves, 1989; Thompson, 1992). These designs provide tools to control costs while targeting the right subset of individuals who require follow-up. Combining with randomized response tools may be a better way to sample nonresponders. I think this is a direction with potential—combining traditional survey statistics (subsampling) with modern tools (randomized response) to improve survey design.
We need an arsenal of experimental designs and data analytic methods to tackle the increasing problems of low response rates and self-selection. Self-selection (Keiding & Louis, 2016; Meng, 2018) leads to questionable external validity from nonprobabilistic samples. A critical issue is the lack of probabilistic samples in many contexts. Even when we can get large probabilistic samples, designs need to be improved to ensure external validity. Bailey's (2023) article is a call to arms against naïve reliance on random sampling, which often only yields random contact within a population. He asks us to think critically about the current practices in survey statistics and whether they are sufficient to helping us build generalizable knowledge.
This commentary discussed three directions in which the proposed paradigm shift can help us think critically and improve our understanding in important public health domains. While I think randomized response is an important direction, I argue we should not forget about the existing toolkit and point to other directions of equal importance. My key claim is to always remember that design trumps all. Careful statistical thinking needs to be in each step of the data analytic pipeline. I also want to emphasize the general limitations of probabilistic sampling during rapidly evolving crises. Thinking about how to carefully anchor analysis of nonprobabilistic data with well-designed probabilistic surveys is an important and fruitful direction for future work.
As the statistical toolkit grows, we should not toss out our simple compass for navigation. We should embrace the need for inferential triangulation (Hill, 1965), marrying the rich literature of context-free survey methods with context-specific reasoning. Trying multiple methods and building a body of evidence through multiple surveys can lead to inference toward the best explanation (Krieger & Davey Smith, 2016). Indeed, this view really emphasizes the necessary modesty we must have in our conclusions from one analysis of a single dataset. Bailey's (2023) article reminds me of an important point from John Milton’s work:
Walter Dempsey has no financial or non-financial disclosures to share for this article.
Arevalo-Rodriguez, I., Buitrago-Garcia, D., Simancas-Racines, D., Zambrano-Achig, P., Del Campo, R., Ciapponi, A., Sued, O., Martinez-García, L., Rutjes, A. W., Low, N., Bossuyt, P. M., Perez-Molina, J. A., & Zamora, J. (2020). False-negative results of initial RT-PCR assays for COVID-19: A systematic review. PLOS ONE, 15(12), 1–19. https://doi.org/10.1371/journal.pone.0242958
Bailey, M. A. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede
Bailey, M. A. (in press). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press. https://www.cambridge.org/core/books/polling-at-a-crossroads/796BAA4A248EA3B11F2B8CAA1CD9E079
Beesley, L. J., Fritsche, L. G., & Mukherjee, B. (2020). An analytic framework for exploring sampling and observation process biases in genome and phenome-wide association studies using electronic health records. Statistics in Medicine, 39(14), 1965–1979. https://doi.org/10.1002/sim.8524
Beesley, L. J., & Mukherjee, B. (2022). Statistical inference for association studies using electronic health records: Handling both selection bias and outcome misclassification. Biometrics, 78(1), 214–226. https://doi.org/10.1111/biom.13400
Cochran, W. G. (1977). Sampling techniques (3rd ed.). John Wiley.
Cohen, A. N., Kessel, B., & Milgroom, M. G. (2020). Diagnosing SARS-CoV-2 infection: The danger of over-reliance on positive test results. medRxiv. https://doi.org/10.1101/2020.04.26.20080911
Cori, A., Ferguson, N., Fraser, C., & Cauchemez, S. (2013). A new framework and software to estimate time-varying reproduction numbers during epidemics. American Journal of Epidemiology, 178(9), 1505–1512. https://doi.org/10.1093/aje/kwt133
Dempsey, W. (in press). Addressing selection bias and measurement error in covid-19 case count data using auxiliary information. The Annals of Applied Statistics.
Dempsey, W., Liao, P., Kumar, S., & Murphy, S. A. (2020). The stratified micro-randomized trial design: Sample size considerations for testing nested causal effects of time-varying treatments. The Annals of Applied Statistics, 14(2), 661–684. https://doi.org/10.1214%2F19-aoas1293
Fraser, C. (2007). Estimating individual and household reproduction numbers in an emerging epidemic. PLoS One, 2(1), Article e758. https://doi.org/10.1371/journal.pone.0000758
Gelman, A., Stevens, M., & Chan, V. (2003). Regression modeling and meta-analysis for decision making: A cost-benefit analysis of incentives in telephone surveys. Journal of Business & Economic Statistics, 21(2), 213–225. https://doi.org/10.1198/073500103288618909
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. Journal of the American Statistical Association, 88(423), 984–993. https://doi.org/10.1080/01621459.1993.10476366
Groves, R. M. (1989). Survey errors and survey costs. John Wiley & Sons.
Hansen, M. H., & Hurwitz, W. N. (2004). The problem of nonresponse in sample surveys. The American Statistician, 58(4), 292–294. https://doi.org/10.1198/000313004X6328
Hill, A. (1965). The environment and disease: Association or causation? Proceedings of the Royal Society of Medicine, 58(5), 295–300.
Irons, N. J., & Raftery, A. E. (2021). Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys. PNAS, 118(31), Article e2103272118. https://doi.org/10.1073/pnas.2103272118
Keiding, N., & Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 319–376. https://doi.org/10.1111/rssa.12136
Krieger, N., & Davey Smith, G. (2016). The tale wagged by the DAG: Broadening the scope of causal inference and explanation for epidemiology. International Journal of Epidemiology, 45(6), 1787–1808. https://doi.org/10.1093/ije/dyw114
Leung, G. (2020, April 6). Lockdown can’t last forever. Here’s how to lift it. The New York Times. https://www.nytimes.com/2020/04/06/opinion/coronavirus-end-social-distancing.html
Little, R. (2008). Selection and pattern-mixture models. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Longitudinal data analysis (409–429). Chapman and Hall/CRC. https://doi.org/10.1201/9781420011579-30
Little, R., West, B., Boonstra, P., & Hu, J. (2019). Measures of the degree of departure from ignorable sample selection. Journal of Survey Statistics and Methodology, 8(5), 932–964. https://doi.org/10.1093/jssam/smz023
Little, R., & Zhang, N. (2011). Subsample ignorable likelihood for regression analysis with missing data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(4), 591–605. https://doi.org/10.1111/j.1467-9876.2011.00763.x
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 US presidential election. Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF
Meng, X.-L. (2022). Comment on “statistical inference with non-probability survey samples” - Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples. Survey Methodology, 48(2), 339–360. http://www.statcan.gc.ca/pub/12-001-x/2022002/article/00006-eng.htm
Newman, M. (2002). Spread of epidemic disease on networks. Physical Review Letters, 66(1), Article 016128. https://doi.org/10.1103/PhysRevE.66.016128
Parshani, R., Carmi, S., & Havlin, S. (2010). Epidemic threshold for the SIS model on random networks. Physical Review Letter, 104(25), Article 258701. https://doi.org/10.1103/PhysRevLett.104.258701
Pastor-Satorras, R., & Vespignani, A. (2001). Epidemic spreading in scale-free networks. Physical Review Letter, 86(14), Article 3200. https://doi.org/10.1103/PhysRevLett.86.3200
Robins, J. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data. Statistics in Medicine, 16(1), 21–37. https://doi.org/10.1002/(SICI)1097-0258(19970115)16:1<21::AID-SIM470>3.0.CO;2-F
Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2(3), 808–840. https://doi.org/10.1214/08-AOAS187
Singer, E., VanHoewyk, J., Gebler, N., Raghunathan, T., & McGonagle, K. (1999). The effects of incentives on response rates in interviewer-mediated surveys. Journal of Official Statistics, 15(2), 217–230.
Thompson, S. K. (1992). Sampling. John Wiley Sons.
van Smeden, M., Lash, T. L., & Groenwold, R. H. H. (2019). Reflection on modern methods: Five myths about measurement error in epidemiological research. International Journal of Epidemiology, 49(1), 338–347. https://doi.org/10.1093/ije/dyz251
Wang, L., Zhou, Y., He, J., Zhu, B., Wang, F., Tang, L., Eisenberg, M., & Song, P. X. K. (2020). An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China. medRxiv. https://doi.org/10.1101/2020.02.29.20029421
Yang, Y., Dempsey, W., Han, P., Deshmukh, Y., Richardson, S., Tom, B., & Mukherjee, B. (2023). Exploring the big data paradox for various estimands using vaccination data from the global COVID-19 Trends and Impact Survey (CTIS). arXiv. https://doi.org/10.48550/arXiv.2306.14940
Yiannoutsos, C. T., Halverson, P. K., & Menachemi, N. (2021). Bayesian estimation of SARS-CoV-2 prevalence in Indiana by random testing. PNAS, 118(5), Article e2013906118. https://doi.org/10.1073/pnas.2013906118
Zivich, P. N., Cole, S. R., Edwards, J. K., Mulholland, G. E., Shook-Sa, B. E., & Tchetgen Tchetgen, E. J. (2023). Introducing proximal causal inference for epidemiologists. American Journal of Epidemiology, 192(7), 1224–1227. https://doi.org/10.1093/aje/kwad077
©2023 Walter Dempsey. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.