Skip to main content
SearchLoginLogin or Signup

Paradigm Lost? Paradigm Regained? Comment on “A New Paradigm for Polling” by Michael A. Bailey

Published onOct 05, 2023
Paradigm Lost? Paradigm Regained? Comment on “A New Paradigm for Polling” by Michael A. Bailey
key-enterThis Pub is a Commentary on


With ever-decreasing participation and response rates, a question for polling and survey statistics more broadly has been the central role of random sampling to ensure external validity of the observed sample. Bailey (2023) calls for a paradigm shift, putting Meng’s error decomposition and its data defect correlation in central roles. The article is well aligned with prior work on internal versus external validity and self-selection that came about due to web-based enrollment in surveys and studies (Keiding & Louis, 2016). The prior work raises the question of representation when attempting to generalize results beyond the context in which the data was collected, whereas the current work calls into question the ability to even generalize to the current context given non-ignorable nonresponse.

Paradigm shifts rely on strong lines of demarcations and an uncompromising stance on the existing literature to advocate for a fundamental change in approach. This article and Bailey's upcoming book follow a recent trend in calling attention to the issues of self-selection and, in particular, data quality as a fundamental issue that needs to be addressed in modern survey statistics and polling research. Data quality is often underappreciated in the statistical literature, which emphasizes novel data analytic techniques. Bailey's (2023) article reminds me of Rubin (2008) and how in general design trumps analysis.

My hesitancy in vociferously joining the chorus is that often a call to arms for modern tools is interpreted as a need to overthrow the orthodoxy. Traditional statistical methods can be a helpful complement to the ever-expanding modern toolkit. I believe Bailey (2023) would support the mantra “probability sampling as aspiration, not prescription” (Meng, 2022). I agree with this mindset. This not only leads to data defect correlation as a natural data-quality metric, but also instills in the data analyst a real embrace of sensitivity analysis as a central tenet of modern data analysis when using these data sources. The goal of the rest of this comment is to assess the usefulness of Meng’s decomposition in three more general settings, and discuss some of Bailey’s proposed solutions, how they connect to older solutions to similar problems, and how it all connects to the problem of data quality—an increasingly important issue in modern statistics.

Meng’s Decomposition

Bailey (2023) focuses on the error decomposition from Meng (2018) to motivate the future of polling in which the data defect correlation plays the central role. The approach is motivated by the comparison of sample mean Yˉn\bar Y_n and population mean YˉN\bar Y_N. An interesting question is whether the decomposition extends to more general settings. Here we present three extensions that demonstrate how Meng’s original insight translates to new settings, providing a modern lens to classical problems in survey statistics.

Impact of measurement-error

We start by investigating the interplay between imperfect testing and selection bias based on Dempsey (in press). This example is motivated from the COVID-19 pandemic and the known inaccuracies for RT-PCR tests (Arevalo-Rodriguez et al., 2020; Cohen et al., 2020). Researchers often assume measurement error leads to parameter attenuation. When paired with selection bias, however, the two sources become entangled, and resulting errors can be magnified, muted, or even switch signs. Assuming a binary outcome with known sensitivity and specificity, the adjusted estimator (YˉnFP)/(1FPFN)(\bar Y_n - FP)/(1-FP-FN) when compared to the population mean YˉN\bar Y_N leads to a similar error decomposition:

ρR,Y×1ff×σY×[1Δ×YˉN1YˉN×FP(1YˉN)+FNYˉNf0(1YˉN)+f1(YˉN)×11FPFN](1)\begin{aligned} &\rho_{R, Y} \times \sqrt{\frac{1-f}{f}} \times \sigma_Y \times \\ &\left[ 1 - \Delta \times \frac{\bar Y_N}{1 - \bar Y_N} \times \frac{FP(1-\bar Y_N) + FN \cdot \bar Y_N}{f_0 (1- \bar Y_N) + f_1 (\bar Y_N)} \times \frac{1}{1-FP-FN} \right] \end{aligned} \tag{1}

where f=n/Nf = n/N, f1f_1 and f0f_0 are sampling fractions for positive and negative outcomes respectively, and Δ=f0f1\Delta = f_0 - f_1 is the sampling differential. Figure 1 shows that the final term, as a function of the relative frequency (f1/f0f_1/f_0) and log odds ratio, can be both positive and negative as well as a range of magnitudes (Beesley & Mukherjee, 2022; Beesley et al., 2020; van Smeden et al., 2019). Assuming no measurement error, the final term is equal to one so the relation between estimation and selection bias is simple, for example, if COVID-19-positive individuals were more likely to receive a test, then this implies upward bias in prevalence estimates. Under random testing (i.e., f0=f1f_0 = f_1), the final term is equal to (1FPFN)1(1-FP-FN)^{-1} so measurement error simply magnifies this error. When tests are imperfect and selection bias exists, this simple relationship no longer holds.

Figure 1. Imperfect testing adjustment contour plot as a function of relative frequency f1/f0f_1/f_0 (xx-axis) and odds ratio (yy-axis) for FP=0.024FP = 0.024 and FN=0.13FN = 0.13.

Equation 1 extends the original error decomposition by Meng (2018) to account for imperfect testing. The first three terms continue to represent data quality, data quantity, and problem difficulty, respectively. The final term is an imperfect testing adjustment, which is a complex function of the sampling rate differential, the odds ratio, and the ratio of measurement error interaction with prevalence and sampling rates’ interaction with prevalence.

Regrettable Rates

Analysts may collect survey data longitudinally. As the error decomposition is multiplicative, one may claim the ratio of estimates on consecutive days Yˉn,t/Yˉn,t1\bar Y_{n,t}/\bar Y_{n,t-1} may be a better estimate of the true change in population means YˉN,t/YˉN,t1\bar Y_{N,t}/\bar Y_{N,t-1} than the sample mean Yˉn,t\bar Y_{n,t} for the population mean YˉN,t\bar Y_{N,t}. We next demonstrate how selection bias impact such estimates, omitting measurement error for simplicity. Using a second-order Taylor series approximation, the error can be expressed approximately as

YˉN,tYˉN,t1×[ρIt,Yt1ftftCV(Yt)ρIt1,Yt11ft1ft1CV(Yt1)]×[1ρIt1,Yt11ft1ft1CV(Yt1)]\begin{aligned} \frac{\bar Y_{N,t}}{\bar Y_{N,t-1}} &\times \bigg[ \rho_{I_t,Y_{t}} \sqrt{\frac{1-f_t}{f_t}} CV (Y_t) -\rho_{I_{t-1},Y_{t-1}} \sqrt{\frac{1-f_{t-1}}{f_{t-1}}} CV (Y_{t-1}) \bigg] \\ &\times \left[ 1 - \rho_{I_{t-1},Y_{t-1}} \sqrt{\frac{1-f_{t-1}}{f_{t-1}}} CV (Y_{t-1}) \right] \end{aligned}

where ρRt,Yt\rho_{R_t, Y_{t}} is the data quality, ftf_t is the sampling fraction, and CV(Yt)=σYt/YˉtCV(Y_t) = \sigma_{Y_t}/ \bar Y_t is the coefficient of variation on day tt. The error magnitude depends on the true rate YˉN,t/YˉN,t1\bar Y_{N,t} / \bar Y_{N,t-1} so a large decrease will have a small error relative to a large increase. The second term represents potential cancellation, which can occur when data quality, sampling fraction, measurement error, and prevalence are constant across time.

Figure 2a presents the true ratio and the potential biased estimators under a susceptible-exposed-infected-recovered (SEIR) model (Newman, 2002; Parshani et al., 2010; Pastor-Satorras & Vespignani, 2001) for the epidemic dynamics, with state evolution given by

stt=βstit;ett=βstitσet;itt=σetγit;rtt=γit.(2) \begin{split} \frac{\partial s_t}{\partial t} &= - \beta s_t i_t; \quad \frac{\partial e_t}{\partial t} = \beta s_t i_t - \sigma e_t; \quad \\ \frac{\partial i_t}{\partial t} &= \sigma e_t - \gamma i_t; \quad \frac{\partial r_t}{\partial t} = - \gamma i_t. \end{split} \tag{2}

where st,et,its_t, e_t, i_t and rtr_t are the fraction of susceptible, exposed, infected, and removed (recovered or deceased) individuals in the population at time tt respectively. The SEIR model has been used extensively as a model for SARS-CoV-2 dynamics (Wang et al., 2020). We see the rate is overestimated prior to the peak and underestimated afterwards. Such biases may impact policymaking. Overestimation pre-peak may give policymakers more leverage in proposing aggressive actions to reduce prevalence. Underestimation post-peak puts pressure on policymakers to prematurely relax social distancing measures. Estimates at the peak time appear to have minimal bias. Similar results hold for the instantaneous effective reproduction number (Cori et al., 2013; Fraser, 2007), which many epidemiologists argue tracking is the best way to manage through the pandemic (Leung, 2020).

Figure 2. Potential bias in ratio and effective reproductive rate estimators under a susceptible-exposed-infected-recovered (SEIR) model with β=1.2\beta = 1.2, γ=0.15\gamma = 0.15, and σ=0.3\sigma = 0.3. Here, f=0.02f = 0.02, FP=0.024FP=0.024, FN=0.13FN=0.13, and a range of relative sampling fractions M=f1/f0M=f_1/f_0 are considered.

Rate comparisons

Journalists, public health experts, and government officials were all interested in cross-population comparisons to understand the impact of countries’ COVID-19 mitigation policies. Here, for simplicity, we focus on comparing the estimated effective reproductive rate and assume the two time-series are aligned so that t=0t=0 is the time of first case in each population, respectively. Consider two countries (A and B) in which the peak occurs 2 weeks prior for country A than country B. Figure 3 presents such a comparison where each country’s disease trajectory follows an SEIR model (A=black and B=grey). Figure 3b shows how biases interact in complex ways. At first, the difference is correctly estimated; then the gap is overestimated as country A sees a rapid rise in cases; then the magnitude of overestimation increases as country A sees declining case-count while country B sees rapidly increasing case-count; then country A’s rate is correctly estimated, while country B’s rate is underestimated as it sees declining case-count; finally, the gap disappears. While this may not always be the case, the analysis demonstrates how estimates can tell a more complex story than the truth (i.e., country A’s peak is 2 weeks prior to country B’s peak).

Figure 3. Left: fraction infected in two SEIR models with β=1.2\beta=1.2 and 0.90.9 respectively, σ=0.3\sigma=0.3, and γ=0.15\gamma=0.15 with same initial conditions. Right: comparison of R^t{\hat{R}}_t across time with FN=0.30FN=0.30, FP=0.024FP=0.024, and M=4M=4.

Toward a general decomposition

A critical question is whether Meng’s decomposition can be extended beyond simple averages. Here, we consider general parameter estimation via estimating equations. Let Zj=(Yj,Xj)Z_j = (Y_j, X_j) denote the outcome and covariates for individual jj. We assume a set of population parameters θRp\theta^\star \in \mathbb{R}^p that satisfy

0p=1Nj=1Nψ(Zj;θ):=EJ[ψ(ZJ;θ)]{\bf 0}_p = \frac{1}{N} \sum_{j=1}^N \psi(Z_j; \theta^\star) := E_J [ \psi(Z_J; \theta^\star) ]

for some ψ(z;θ)Rp\psi(z;\theta) \in \mathbb{R}^p. This captures a larger class of models than considered in prior work. Specifically, letting ψ(z;θ)\psi(z;\theta) equal (1) (yθ)-(y-\theta) yields the population mean, (2) yθ-|y-\theta| yields the population median, and (3) (yxθ)x-(y-x^\top \theta) x yields the regression estimator. Here, we consider the estimator θ^n\hat \theta_n that satisfies the empirical version:

0p=1nj=1nψ(Zj;θ^n)=1nj=1NIjψ(Zj;θ^n)=EJ[IJΨ(ZJ;θ^n)]EJ[IJ]{\bf 0}_p = \frac{1}{n} \sum_{j=1}^n \psi(Z_j; \hat \theta_n) = \frac{1}{n} \sum_{j=1}^N I_j \psi(Z_j; \hat \theta_n) = \frac{E_J [ I_J \Psi(Z_J; \hat \theta_n)]}{E_J[I_J]}

Taking a Taylor series expansion around θ\theta^\star implies the error approximately satisfies:

(θ^nθ)EJ[ψ(ZJ:θ)IJ]1EJ[ψ(ZJ:θ)IJ]\left( \hat \theta_n - \theta^\star\right) \approx -E_J[\nabla \psi(Z_J: \theta^\star) I_J ]^{-1} E_J[\psi(Z_J: \theta^\star) I_J ]

Then we can rewrite the approximate error as

EJ[ψ(ZJ:θ)IJ]1V(IJ)data quantity?Σ1/2problem difficulty Σ1/2EJ[ψ(ZJ:θ)IJ]V[IJ]data defect correlation(3) \underbrace{-E_J[\nabla \psi(Z_J: \theta^\star) I_J ]^{-1} \sqrt{V(I_J)}}_{\text{data quantity?}} \underbrace{\quad \Sigma^{1/2} \quad}_{\text{problem difficulty}} \\ \\ \ \\ \underbrace{\Sigma^{-1/2} \frac{E_J[\psi(Z_J: \theta^\star) I_J ]}{\sqrt{V[I_J]}}}_{\text{data defect correlation}} \tag{3}

where Σ=EJ[ψ(ZJ:θ)]1EJ[ψ(ZJ:θ)2]EJ[ψ(ZJ:θ)]1\Sigma = E_J[\nabla \psi(Z_J: \theta^\star)]^{-1} E_J[\psi(Z_J: \theta^\star)^{\otimes 2} ] E_J[\nabla \psi(Z_J: \theta^\star)]^{-1} denotes the population-level variance. To see that this is a generalization of the original decomposition, consider setting (1) where ψ(z;θ)=(yθ)\psi(z;\theta) = - (y-\theta). Then ψ(z;θ)=1\nabla \psi (z;\theta) = 1Σ=V(YJ)\Sigma = V(Y_J), and EJ[ψ(ZJ;θ)IJ]=EJ[IJ]E_J[\nabla \psi(Z_J;\theta^\star) I_J] = E_J[I_J]. Plugging into Equation 3 yields the original Meng decomposition. This raises the question as to whether each component retains its original interpretation in the general context. The final term can be interpreted as a data defect correlation since EJ[ψ(ZJ;θ)]=0E_J[\psi(Z_J;\theta^\star)]=0 by definition. The second term can still be interpreted as problem difficulty as it represents the population-level variance. The first term depends on θ\theta^\star and therefore may not always retain the simple data-quantity interpretation. Herein lies the complexity of the extension to the multidimensional parameter setting. To make this more concrete, consider setting (3), that is, the regression setting. In this case, ψ(z;θ)=x(yxθ)\psi(z;\theta) = - x (y - x^\top \theta), which implies ψ(z;θ)=xx\nabla \psi(z;\theta) = x x^\top, meaning the data quantity term takes the form EJ[XJXJIJ]1V(IJ)-E_J[X_J X_J^\top I_J ]^{-1} \sqrt{V(I_J)}. A fully nonparametric model does retain the original interpretation; however, it is less clear how to interpret in other scenarios. We think this is a natural direction for future consideration. Below we discuss implications of this general formulation for methods development in addressing the data defect correlation.

Law of Large Populations and DDC as a universal constant

Bailey’s (2023) article emphasizes the importance of the Law of Large Populations: “Among studies sharing the same (fixed) average data defect correlation E(ρI,Y)0E (\rho_{I,Y}) \not =0, the (stochastic) error of Yˉn\bar Y_n, relative to its benchmark under SRS, grows with the population size NN at the rate of N\sqrt{N}” (Meng, 2018). We revisit this statement in the context of a binary outcome to provide a bit of additional context. In this setting, the data defect correlation (ddc) can be expressed simply as

ρI,Y=ΔYˉN(1YˉNf(1f)(4)\rho_{I,Y} = \Delta \sqrt{\frac{\bar Y_N (1-\bar Y_N}{f(1-f)}} \tag{4}

where Δ=f1f0\Delta = f_1 - f_0 is the sampling differential. This shows that the ddc is a deterministic function of sampling differential (Δ\Delta), sampling rate (ff), and prevalence (YˉN\bar Y_N). As the LLP is a statement often viewed under some asymptotic regime (“\ldots grows with the population size NN\ldots”), we use Equation 4 to understand what is implied by “fixed average data defect.”

The simplest way to achieve this is to assume ff, f1f_1, and f0f_0 are constant. This implies an asymptotic regime with growing sample size n=n1+n0n = n_1 + n_0 such that fj=nj/Njf_j = n_j/N_j is held fixed for j=0,1j=0,1. Often, however, I see researchers interpret LLP under an asymptotic regime where sample size nn is fixed and population size grows, that is, a fixed recruitment of nn self-selected individuals from a growing population. Using the identity f=f1YˉN+(1YˉN)f0=f0(1+(M1)YˉN)f = f_1 \bar Y_N + (1-\bar Y_N) f_0 = f_0 \left( 1 + (M-1) \bar Y_N\right) where M=f0/f1M = f_0 /f_1, the ddc can be re-written as

ρI,Y=M11+(M1)YˉNf1fYˉN(1YˉN).\rho_{I,Y} = \frac{M-1}{1+(M-1) \bar Y_N} \sqrt{\frac{f}{1-f}} \sqrt{\bar Y_N (1-\bar Y_N)}.

Under the asymptotic regime where nn and MM are both fixed (i.e., the relative sample sizes are equal), then ρI,Y=O(N1/2)\rho_{I,Y} = O ( N^{-1/2} ), which implies the relative error is no longer a function of population size. With nn fixed, for the ddc to remain constant requires the relative frequency M=f0/f1=n0/n1×N1/N0M = f_0/f_1 = n_0/n_1 \times N_1/N_0 to be a function of population size; however, this would suggest an asymptotic regime where the populations substantially differ in their self-selection propensities as population size grows.

I think these two asymptotic regimes are at the heart of the recent debate over whether we can view the data defect correlation as a ‘universal constant.’ I prefer keeping such terms in physics where the notion is more suitable. The key question is whether the ddc is a reasonable alternative to the standard bias/variance view of statistical estimates. In the context of nonprobabilistic samples, I believe its utility lies in placing the correlation of the sampling mechanism and outcome at the center. In my opinion, unbiasedness is often overemphasized at the cost of high variance. Meng’s decomposition leads to a helpful reversal of roles for thinking about statistical analysis of survey data and potential solutions.

Effective sample sizes

A concept emphasized by Meng (2018) but not as prominent in Bailey (2023) is effective sample size (neffn_{\text{eff}}), which is defined as the size of a simple random sample that will produce the same mean-squared error as observed in the survey of interest. Using Meng’s decomposition, the effective sample size for YˉnYˉN\bar Y_n - \bar Y_N is f/(1f)E[ρI,Y2]f/(1-f) E [ \rho_{I,Y}^2 ]. In prior work (Yang et al., 2023), we compared COVID-19 vaccine uptake in India as measured in a small probability survey (CVoter, n=2,700)n=2,700) with a benchmark dataset (CoWIN). The effective sample size was never above n=25n=25. In other prior work (Dempsey, 2023), we computed effective sample size of COVID-19 testing in Indiana to be 168168. These calculations emphasize the limited information for making inferential statements regarding population averages.

A common refrain is that these calculations demonstrate an issue in using nonprobabilistic samples to make any inferential statements. Here, we demonstrate that while the effective sample size is small for direct comparisons, there is a potential for meaningfully large effective sample sizes when we consider relative difference in two means. Here we consider differences in vaccination rates over successive time periods as our target. Let Yˉn,t\bar Y_{n,t} denote the survey average and let YˉN,t\bar Y_{N,t} denote the population average at time tt. Then the effective sample size is given by

neff=CV(YˉN,t)2+CV(YˉN,t1)2(ρYt,It1ftftCV(YˉN,t)ρYt1,It11ft1ft1CV(YˉN,t1))2     (5)n_{eff} = \frac{CV(\bar Y_{N,t})^2 + CV(\bar Y_{N,t-1})^2}{\left( \rho_{Y_t, I_t} \sqrt{\frac{1-f_t}{f_t}} CV(\bar Y_{N,t}) - \rho_{Y_{t-1}, I_{t-1}} \sqrt{\frac{1-f_{t-1}}{f_{t-1}}} CV(\bar Y_{N,t-1})\right)^2} \ \ \ \ \ \tag{5}

where CV(Yˉ)CV(\bar Y) denotes the coefficient of variation σY/EJ[YJ]\sigma_Y / E_J[Y_J]. We apply Equation 5 to relative successive differences of vaccine uptake in India based on the COVID-19 Trends and Impact Survey (Yang et al., 2023). The study (n25,000n \approx 25,000) had effective sample size on the order of neff=103n_{\text{eff}} = 10^3 and 10610^6. This implies an orders of magnitude increase in effective sample size. Of course, the caveats of interpretation presented above still apply as effective sample size is smallest during periods of rapid change. Thus, while we must find ways to alleviate the issues for absolute comparisons, it is important to note the usefulness of nonprobabilistic samples for understanding temporal dynamics. This is especially true as probabilistic samples are time-consuming, often not available, and if they are collected are not available at the same level of temporal granularity.

Cast or Weigh Anchor and Set Sail?

I end by emphasizing a central element of Bailey (2023)—the need for new statistical tools to deal with the data defect correlation. The data analyst has no means to estimate ρI,Y\rho_{I,Y} without population-level ‘ground truth.’ In polling, election results are used to estimate ρI,Y\rho_{I,Y} in both Bailey (2023) and Meng (2018). In the context of COVID-19, vaccine uptake is observed via national reporting (CoWIn in India and CDC in the United States) (Yang et al., 2023), and can therefore serve as ground truth in these settings.

In many circumstances, however, ground truth cannot be observed. Consider, for instance, COVID-19 active infection rates in the United States. By the middle of 2020, testing was readily available to most of the Indiana population. Working with the state of Indiana, Dempsey et al. (2020) was able to obtain demographic breakouts of both testing and COVID-19 cases. On the other hand, population-level ‘ground truth’ data was not available as testing required self-selection into reporting. Between April 25 and 29, 2020, Indiana conducted statewide random molecular testing of persons ages 12\geq 12 years to assess prevalence of active infection to SARS-CoV-2 (Yiannoutsos et al., 2021). A stratified random sampling design was conducted using Indiana’s 10 public health preparedness districts as sampling strata. Moreover, COVID-19 death data was made available by the state. Dempsey et al. (2020) used these two sources to both estimate the data defect correlation and anchor the analyses using these two resources. Similarly, probabilistic sampling was performed intermittently in several states, which lead to their natural use as anchors in the analysis of nonprobability samples. See Irons and Raftery (2021) for an example of anchoring in a Bayesian analysis. An issue that becomes apparent when considering more general decompositions is that population estimates θ\theta^\star will usually not be available. I envision future work will consider anchoring via marginal constraints on the estimating equations. My attitude in these settings is well summarized by the Greek philosopher Epictetus:

“Neither should a ship rely on one small anchor, nor should life rest on a single hope.”

Of course, probabilistic samples still come with caveats. Nonresponse did occur in the COVID-19 prevalence study (Yiannoutsos et al., 2021). Their analysis proposed a Bayesian model-based method to handle missing data. Of course, Bailey is right to emphasize the common inadequacy of approaches that rely on the missing-at-random assumption to hold given only demographic information. I agree this is inadequate, but I do not agree that “random sampling is, for all practical purposes, dead.” From my perspective, this goes back to a common issue in statistical applications. The statistician is consulted only after the study is completed. The analyst is left to rely on the common refrain without additional justification: ‘We acknowledge that this analysis makes the assumption that missing data are missing at random conditional on the measured covariates.’

Here, I highlight two orthogonal components to improve upon random sampling. First, I believe statisticians need to place a larger emphasis on data collection, that is, what covariate information are we collecting on each individual? Too often I see demographic data and a few outcomes as the only measured variables. Second, I agree with Bailey that we need to consider alternate designs. Extending random sampling to include instrumental variable methods such as randomized response instruments is noteworthy. However, designing incentives is a highly nontrivial task and is context specific. See Gelman et al. (2003) and Singer et al. (1999) for an example of the complicated nature of designing incentives to improve response rates. As for other potential designs, I would note that proximal methods from the causal inference literature may also be suitable Zivich et al. (2023). These methods require the analyst to classify a subset of measured covariates into three bucket types: (1) variables that may be common cause of selection bias and outcome; (2) selection-inducing proxies; and (3) outcome-inducing confounding proxies. Nonresponse causes issues with direct application of these ideas; however, thinking about proxies can improve data collection, which can in turn improve estimation. I see this as a great lens for thinking about future survey design.

Given well-thought-out survey design, I believe the traditional analyses such as selection and pattern-mixture modeling (Little, 2008) and sensitivity analysis (Little et al., 2019; Robins, 1997) can go a long way in correcting for nonresponse bias. I would like to highlight two specific directions. First, there are a variety of subselection methods that seem promising. See Little and Zhang (2011) where subsampling led to weaker missing data mechanism assumptions; see Meng (2022) for an alternative use aimed at reducing the data defect correlation. Second, intensive follow-up with nonresponders has been an effective tool for improving inference. Bailey (2023) is right that random sampling is more akin to random contact, but oversampling nonresponders may yield fruit (Glynn et al., 1993). Subsampling was first introduced by Hansen and Hurwitz (2004) and is a standard part of survey sampling (Cochran, 2002; Groves, 1989; Thompson, 1992). These designs provide tools to control costs while targeting the right subset of individuals who require follow-up. Combining with randomized response tools may be a better way to sample nonresponders. I think this is a direction with potential—combining traditional survey statistics (subsampling) with modern tools (randomized response) to improve survey design.


We need an arsenal of experimental designs and data analytic methods to tackle the increasing problems of low response rates and self-selection. Self-selection (Keiding & Louis, 2016; Meng, 2018) leads to questionable external validity from nonprobabilistic samples. A critical issue is the lack of probabilistic samples in many contexts. Even when we can get large probabilistic samples, designs need to be improved to ensure external validity. Bailey's (2023) article is a call to arms against naïve reliance on random sampling, which often only yields random contact within a population. He asks us to think critically about the current practices in survey statistics and whether they are sufficient to helping us build generalizable knowledge.

This commentary discussed three directions in which the proposed paradigm shift can help us think critically and improve our understanding in important public health domains. While I think randomized response is an important direction, I argue we should not forget about the existing toolkit and point to other directions of equal importance. My key claim is to always remember that design trumps all. Careful statistical thinking needs to be in each step of the data analytic pipeline. I also want to emphasize the general limitations of probabilistic sampling during rapidly evolving crises. Thinking about how to carefully anchor analysis of nonprobabilistic data with well-designed probabilistic surveys is an important and fruitful direction for future work.

As the statistical toolkit grows, we should not toss out our simple compass for navigation. We should embrace the need for inferential triangulation (Hill, 1965), marrying the rich literature of context-free survey methods with context-specific reasoning. Trying multiple methods and building a body of evidence through multiple surveys can lead to inference toward the best explanation (Krieger & Davey Smith, 2016). Indeed, this view really emphasizes the necessary modesty we must have in our conclusions from one analysis of a single dataset. Bailey's (2023) article reminds me of an important point from John Milton’s work:

“The first and wisest of them all professed
To know this only, that he nothing knew.”
– Paradise Regained, John Milton (1671)

Disclosure Statement

Walter Dempsey has no financial or non-financial disclosures to share for this article.


Arevalo-Rodriguez, I., Buitrago-Garcia, D., Simancas-Racines, D., Zambrano-Achig, P., Del Campo, R., Ciapponi, A., Sued, O., Martinez-García, L., Rutjes, A. W., Low, N., Bossuyt, P. M., Perez-Molina, J. A., & Zamora, J. (2020). False-negative results of initial RT-PCR assays for COVID-19: A systematic review. PLOS ONE, 15(12), 1–19.

Bailey, M. A. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3).

Bailey, M. A. (in press). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press.

Beesley, L. J., Fritsche, L. G., & Mukherjee, B. (2020). An analytic framework for exploring sampling and observation process biases in genome and phenome-wide association studies using electronic health records. Statistics in Medicine, 39(14), 1965–1979.

Beesley, L. J., & Mukherjee, B. (2022). Statistical inference for association studies using electronic health records: Handling both selection bias and outcome misclassification. Biometrics, 78(1), 214–226.

Cochran, W. G. (1977). Sampling techniques (3rd ed.). John Wiley.

Cohen, A. N., Kessel, B., & Milgroom, M. G. (2020). Diagnosing SARS-CoV-2 infection: The danger of over-reliance on positive test results. medRxiv.

Cori, A., Ferguson, N., Fraser, C., & Cauchemez, S. (2013). A new framework and software to estimate time-varying reproduction numbers during epidemics. American Journal of Epidemiology, 178(9), 1505–1512.

Dempsey, W. (in press). Addressing selection bias and measurement error in covid-19 case count data using auxiliary information. The Annals of Applied Statistics.

Dempsey, W., Liao, P., Kumar, S., & Murphy, S. A. (2020). The stratified micro-randomized trial design: Sample size considerations for testing nested causal effects of time-varying treatments. The Annals of Applied Statistics, 14(2), 661–684.

Fraser, C. (2007). Estimating individual and household reproduction numbers in an emerging epidemic. PLoS One, 2(1), Article e758.

Gelman, A., Stevens, M., & Chan, V. (2003). Regression modeling and meta-analysis for decision making: A cost-benefit analysis of incentives in telephone surveys. Journal of Business & Economic Statistics, 21(2), 213–225.

Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. Journal of the American Statistical Association, 88(423), 984–993.

Groves, R. M. (1989). Survey errors and survey costs. John Wiley & Sons.

Hansen, M. H., & Hurwitz, W. N. (2004). The problem of nonresponse in sample surveys. The American Statistician, 58(4), 292–294.

Hill, A. (1965). The environment and disease: Association or causation? Proceedings of the Royal Society of Medicine, 58(5), 295–300.

Irons, N. J., & Raftery, A. E. (2021). Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys. PNAS, 118(31), Article e2103272118.

Keiding, N., & Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 319–376.

Krieger, N., & Davey Smith, G. (2016). The tale wagged by the DAG: Broadening the scope of causal inference and explanation for epidemiology. International Journal of Epidemiology, 45(6), 1787–1808.

Leung, G. (2020, April 6). Lockdown can’t last forever. Here’s how to lift it. The New York Times.

Little, R. (2008). Selection and pattern-mixture models. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Longitudinal data analysis (409–429). Chapman and Hall/CRC.

Little, R., West, B., Boonstra, P., & Hu, J. (2019). Measures of the degree of departure from ignorable sample selection. Journal of Survey Statistics and Methodology, 8(5), 932–964.

Little, R., & Zhang, N. (2011). Subsample ignorable likelihood for regression analysis with missing data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(4), 591–605.

Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 US presidential election. Annals of Applied Statistics, 12(2), 685–726.

Meng, X.-L. (2022). Comment on “statistical inference with non-probability survey samples” - Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples. Survey Methodology, 48(2), 339–360.

Newman, M. (2002). Spread of epidemic disease on networks. Physical Review Letters, 66(1), Article 016128.

Parshani, R., Carmi, S., & Havlin, S. (2010). Epidemic threshold for the SIS model on random networks. Physical Review Letter, 104(25), Article 258701.

Pastor-Satorras, R., & Vespignani, A. (2001). Epidemic spreading in scale-free networks. Physical Review Letter, 86(14), Article 3200.

Robins, J. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data. Statistics in Medicine, 16(1), 21–37.<21::AID-SIM470>3.0.CO;2-F

Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2(3), 808–840.

Singer, E., VanHoewyk, J., Gebler, N., Raghunathan, T., & McGonagle, K. (1999). The effects of incentives on response rates in interviewer-mediated surveys. Journal of Official Statistics, 15(2), 217–230.

Thompson, S. K. (1992). Sampling. John Wiley Sons.

van Smeden, M., Lash, T. L., & Groenwold, R. H. H. (2019). Reflection on modern methods: Five myths about measurement error in epidemiological research. International Journal of Epidemiology, 49(1), 338–347.

Wang, L., Zhou, Y., He, J., Zhu, B., Wang, F., Tang, L., Eisenberg, M., & Song, P. X. K. (2020). An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China. medRxiv.

Yang, Y., Dempsey, W., Han, P., Deshmukh, Y., Richardson, S., Tom, B., & Mukherjee, B. (2023). Exploring the big data paradox for various estimands using vaccination data from the global COVID-19 Trends and Impact Survey (CTIS). arXiv.

Yiannoutsos, C. T., Halverson, P. K., & Menachemi, N. (2021). Bayesian estimation of SARS-CoV-2 prevalence in Indiana by random testing. PNAS, 118(5), Article e2013906118.

Zivich, P. N., Cole, S. R., Edwards, J. K., Mulholland, G. E., Shook-Sa, B. E., & Tchetgen Tchetgen, E. J. (2023). Introducing proximal causal inference for epidemiologists. American Journal of Epidemiology, 192(7), 1224–1227.

©2023 Walter Dempsey. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

1 of 11
Another Commentary on A New Paradigm for Polling
Another Commentary on A New Paradigm for Polling
Another Commentary on A New Paradigm for Polling
No comments here
Why not start the discussion?