Skip to main content

# Introduction

# ‘Classical’ Selection Bias 101

# The Data Defect Correlation Is Not an Easily Interpreted Measure of Selection Bias

### Example 1. A Probit Selection Model for Univariate Data

### Example 2. A Normal Pattern-Mixture Model for Univariate Data

# The Data Defect Correlation Is Not a Universal Constant

# Approaches to Assessing Selection Bias

### Example 3. Proxy Pattern-Mixture Analysis

# Acknowledgment

# Disclosure Statement

# References

##### Connections

1 of 11

The “Law of Large Populations” Does Not Herald a Paradigm Shift in Survey Sampling

Published onSep 27, 2023

The “Law of Large Populations” Does Not Herald a Paradigm Shift in Survey Sampling

Data quality in sample surveys, and opinion polls in particular, is an increasing concern, given sampling-frame deficiencies, rising nonresponse rates, and perceived failures in election polls. Concern is evidenced by the recent report from the American Association for Public Opinion Research (AAPOR) on data quality metrics for online surveys (AAPOR, 2022).

Bailey’s (2023) statement that “random sampling is, for all practical purposes, dead” might be defensible for opinion polling, but not for the field of survey sampling as a whole. In areas such as auditing, probability sampling is still practical; in other settings, government statistical agencies like the U.S. Census Bureau and the National Center for Health Statistics, and survey research organizations like Westat, Research Triangle Institute, and the Institute for Social Research at Michigan, strive to conduct high-quality probability surveys. The ideal of random sampling remains a very important concept even when it is not attainable in practice.

Bailey argues that Meng’s (2018) discussion of the role of probability sampling in the era of ‘big data,’ and in particular Meng’s “Law of Large Populations,” heralds a paradigm shift in the assessment of polling data collected by nonrandom sampling methods. Meng has provided useful illustrations of the trade-off between bias and variance in survey data, emphasizing the role of probability sampling in the era of big data. See, for example, Bradley et al. (2021) and Meng’s (2016) discussion of Keiding and Louis (2016), a wide-ranging debate of the pros and cons of probability sampling in the context of epidemiologic studies. However, I have reservations about his proposed Law of Large Populations (Meng, 2018), as represented by what Bailey calls “Meng’s equation”:

${\bar{Y}}_{n} - {\bar{Y}}_{N} = \rho_{R,Y} \times \sqrt{\frac{N - n}{n}} \times \sigma_{Y} \tag{1}$

where

I argue that

Meng is a friend who is aware of my reservations, and I appreciate his generous invitation to write about them in this commentary.

The left side of Equation 1,

$b = (1 - f)({\bar{Y}}_{n} - {\bar{Y}}_{N - n}) \tag{2}$

where *f* seems a more intuitive measure of “data quantity” than *f* increases.

The parameter *precision*. In my very first course in statistics as an MS student at Imperial College, London, Sir David Cox distinguished between ‘*precision*’ (the variability of the estimate) and ‘*accuracy*’ (the quality of the estimate,) which involves both precision and negligible bias. A standard statistical measure of accuracy is the root mean squared error:

$\text{RMSE}\mathbf{=}\sqrt{\left( \mathbf{E(b)} \right)^{\mathbf{2}}\mathbf{+}\text{var}} \tag{3}$

where E(*b*) is the expected bias and var is the variance of the sample estimate. In a simple superpopulation model where units are assumed exchangeable and unit outcomes modeled as independent, a particular form of the RMSE is

$\text{RMSE} = \sqrt{\left( E(b) \right)^{2} + (1 - f)\sigma_{Y}^{2}/n} \tag{4}$

where (1 – *f*) is the finite population correction.

A characteristic of this formulation is that bias from nonrandom selection, unlike precision, remains relatively constant as a function of sample size *n*. One might argue that bias actually *increases* with *n*, because collection of a larger sample is harder to control and more subject to measurement error; but measurement error issues are not a part of Meng’s (2018) discussion. As *n* increases, the relative contribution of precision (as measured by the variance), to the RMSE decreases, and bias increasingly dominates. Thus, for ‘big data,’ it is bias, not precision, that is the key issue. Meng has illustrated the impact of even small amounts of bias on the RMSE of sample estimates.

Provided *f* is small, it is the sample size *n*, not the population size *N*, that controls both the RMSE and the relative contribution of bias and variance. For example, a random sample of *n* = 1,000 yields the same precision whether the population size is 100,000 or 100 million; for nonrandom forms of sampling, if

Equations 1 and 4 are both valid expressions, so the underlying mathematics is not an issue. The data defect correlation is a measure of selection bias, but I do not think it is very easy to interpret; the correlation usually measures association between continuous variables, and is not a natural measure of association when one of the variables, *R*, is binary*.* The fact that it a dimensionless quantity makes it potentially transportable across studies (but see the next section for a counterargument); it is less useful for assessing bias in a particular substantive setting. For example, I do not have a particularly strong intuitive notion of the difference between a correlation of 0.01 and 0.05. On the other hand, the more basic difference in means

The ability to address selection bias is limited without auxiliary information. Addressing selection bias more generally requires a superpopulation model for the joint distribution of *R* and *Y*, given auxiliary covariates *X* known for the whole population. Assuming for simplicity independence over population units *i*, two main approaches for formulating these models can be distinguished. *Selection* models factor the joint distribution of

$f_{Y,R}\left( r_{i},y_{i}\left| x_{i},\theta,\psi \right.\ \right) = f_{Y}\left( y_{i}\left| x_{i},\theta \right.\ \right)f_{R|Y}\left( r_{i}\left| x_{i},y_{i},\psi \right.\ \right), \tag{5}$

where the first factor characterizes the distribution of *pattern-mixture* models factor the joint distribution as

$f_{Y,R}\left( r_{i},y_{i}\left| x_{i},\xi,\omega \right.\ \right) = f_{Y|R}\left( y_{i}\left| x_{i},r_{i},\xi \right.\ \right)f_{R}\left( r_{i}\left| x_{i},\omega \right.\ \right), \tag{6}$

where the first distribution characterizes the distribution of

The data defect correlation *U* crosses a threshold. This models the joint distribution in Equation 6 as:

$\left\lbrack \begin{pmatrix}
y_{i} \\
u_{i} \\
\end{pmatrix}\left| x_{i},\theta,\psi,\sigma_{Y}^{2}, \right.\ \rho_{U,Y} \right\rbrack \sim_{\text{ind}}\\
G_{2}
\left( \begin{pmatrix}
\theta_{0} + \theta^{T}x_{i}) \\
\psi_{0} + \psi^{T}x_{i}) \\
\end{pmatrix}, \begin{pmatrix}
\sigma_{Y}^{2} & \rho_{U,Y}\sigma_{Y} \\
\rho_{U,Y}\sigma_{Y} & 1 \\
\end{pmatrix} \right), \\ \hspace{1in} \\
r_{i} = 1\text{\ when\ }u_{i} > 0, \tag{7}$

where *U* is scaled to have variance 1. The correlation *Y* and *U*. The model (7) implies the following probit selection model:

Suppose that for unit *i*,

$(y_{i}|x_{i},\theta,\sigma_{Y}^{2})\sim_{\text{ind}}G(\theta_{0} + \theta^{T}x_{i},\sigma_{Y}^{2}) \\
(r_{i}|x_{i},y_{i},\psi,\lambda)\sim_{\text{ind}}\text{BERN} \left( \Phi(\psi_{0} + \psi^{T}x_{i} + \lambda y_{i}) \right), \tag{8}$

where BERN denotes the Bernoulli distribution,

The parameter *X*; the probit transformation addresses the fact that *R* is binary*.* Selection is ignorable if *R* and/or *Y* on *X*, by setting one or more regression coefficients to zero. Results are then highly sensitive to whether these assumptions are correct.

The following pattern-mixture model can be viewed as a natural generalization of the expression in Equation 2:

An alternative to (8) is the following pattern-mixture model:

$(y_{i}|r_{i} = r,x_{i},\xi,\sigma^{(0)},\sigma^{(1)})\sim_{\text{ind}}G(\xi_{0}^{(r)} + \xi x_{i},\sigma^{(r)2}) \\
(r_{i}|x_{i},\omega)\sim_{\text{ind}}\text{BERN}\left( \Phi(\omega_{0} + \omega x_{i}) \right). \tag{9}$

The difference in means for selected and unselected cases, namely *X*.

Little and Rubin (2019, chap. 15) argue that the pattern-mixture model (Example 2) is easier to interpret than the selection model (Example 1). In particular, in (9), the selection effect is characterized by *Y* by one unit on the probit of the probability of selection, adjusting for the covariates. In Greenlees et al. (1982), the probit is replaced by a logit model, but the interpretation of

Other useful features of the pattern-mixture model are that it is often much easier to fit than the probit selection model, given assumptions to render the parameters estimable, and that it can include covariates that are only available in aggregate data form. Also, imputations of the missing values are based on the predictive distribution of *Y* given *X* and *R* = 0, which is modeled directly in the pattern-mixture factorization. For sensitivity analysis for nonignorable selection, pattern-mixture models are easier to implement and (in my view) easier to interpret for nonstatisticians.

I now return to the simpler case without auxiliary covariates. The data defect correlation has another feature that limits its value as a universal constant: its value varies depending on the selection fraction *f.*

If *N* is increased holding *N*. But Equation 1 implies that

$\rho_{R,Y} = \frac{({\bar{Y}}_{n} - {\bar{Y}}_{N - n})}{\sigma_{Y}} \times \sqrt{f(1-f)}$

the standardized selection bias multiplied by *n*, *N* is increased. Thus, the interpretation of *N*—it is different for a population size of 100,000 and for a population size of 100 million. For this reason, it does not make sense (as Meng and Bailey imply) to treat *N*. On the other hand, the difference in selected and unselected means is independent of *n* and *N* aside from precision considerations.

As a numerical example, suppose *Y* is binary, and

If

If *N* = 10^{6}, then

The data defect correlation is very different in these cases, but the selection bias is approximately the same.

Joint models for *R* and *Y* all suffer from lack of identifiability of parameters. In the context of nonresponse, a close relative of the problem of selection, Little and Rubin (2019, chap. 15) describe five strategies for missing not-at-random models:

Collect data on a subsample of nonrespondents, and use information from this sample to weight the selected units or impute the values of unselected units. Simulation studies in Glynn et al. (1986) in the context of nonresponse suggest that even a small nonrespondent subsample can markedly reduce sensitivity of inference to nonignorable nonresponse.

Use Bayesian modeling, including a prior distribution for unidentified parameters. An early example is Rubin (1977).

Impose restrictions to identify parameters. Two applications of this approach in the context of the Heckman (1976) model, one successful and one not successful, are described in Little and Rubin (2019, Examples 15.11 and 15.12). Bailey’s proposal for randomized response designs is an example, though this approach seems more tuned to missing data than to selection effects.

Selectively discard data based on assumed missing not-at-random assumptions. See in particular subsample ignorable likelihood methods (Little & Zhang, 2011).

Conduct a sensitivity analysis, varying values of the parameters measuring deviations from ignorable selection. Example 3 below illustrates one such approach. The approach was originally developed in the context of nonresponse (Andridge & Little, 2011), but has recently been adapted to develop indices of selection bias for nonrandom samples for means (Boonstra et al., 2021; Little et al., 2020) and regression coefficients (West et al., 2021). Extensions to a binary outcome are provided in Andridge and Little (2020) and Andridge et al. (2019).

Let *Y* be a continuous survey variable recorded on the selected sample, and *Z* by a single proxy variable *X* that has the highest correlation with *Y*. This proxy variable can be estimated by regressing *Y* on *Z* using the selected sample, and taking *X* to be the predicted values of *Y*, available for both selected and unselected units. This regression should include important predictors of *Y*, as well as interactions and nonlinear terms where appropriate. Let *Y* and *X* among the selected cases, which we assume is positive. If the *X* is called a strong proxy for *Y*, and if *X* is called a weak proxy for *Y*. The proposed method is based on a bivariate pattern-mixture model (Little, 1994) for the distribution of (*X*,*Y*) for selected and unselected units:

$\left( (X,Y)|R = r \right)\sim G_{2}\left( (\mu_{x}^{(r)},\mu_{y}^{(r)}),\Sigma^{(r)} \right)$

where *f* is an arbitrary function. With some additional assumptions, missingness can also depend on auxiliary variables uncorrelated with *X.* The maximum likelihood estimate of the population mean of *Y* is

$\widehat{\mu}(\phi) = {\bar{Y}}_{n} + g({\widehat{\rho}}^{(1)})(s_{y}/s_{x})({\bar{X}}_{N} - {\bar{X}}_{n}),g({\widehat{\rho}}^{(1)}) = \left( \frac{\phi + (1 - \phi){\widehat{\rho}}^{(1)}}{1 - \phi + \phi{\widehat{\rho}}^{(1)}} \right),$

where *Y* in the selected sample, *X* in the selected sample and population, *Y* and *X* in the selected sample, and *X* and *Y*. The proposed sensitivity analysis calculates estimates for three values of

Bayesian inference for this model is also relatively straightforward. For recent applications, see Andridge and Thompson (2015), West and Andridge (2023), and Andridge (2023).

Sensitivity to nonignorable selection bias in the above analysis is reduced when *Y*; a weakness in many current surveys is that auxiliary variables are confined to demographic variables that do not have this property. An important role for existing high-quality probability surveys is to provide good auxiliary information for other nonprobability samples. One practical step is to include in the nonprobability survey any variables that (a) are good predictors of the survey content and (b) are available in large probability samples like the American Community Survey. These variables can then be incorporated as auxiliary variables in a proxy pattern-mixture analysis.

I appreciate useful suggestions on this article by the editor, and my colleagues Michael Elliott and Yajuan Si.

Roderick J. Little has no financial or non-financial disclosures to share for this article.

American Association for Public Opinion Research. (2022). *Data quality metrics for online samples: Considerations for study design and analysis*. https://aapor.org/publications-resources/reports/

Andridge, R. R. (2023). *Using proxy pattern-mixture models to explain bias in estimates of COVID-19 vaccine uptake from two large surveys*. ArXiv. https://doi.org/10.48550/arXiv.2307.16653

Andridge, R. R., & Little, R. J. (2011). Proxy-pattern mixture analysis for survey nonresponse. *Journal of Official Statistics*, *27*(2), 153–180. https://doi.org/10.2478/jos-2020-0035

Andridge, R. R., & Little, R. J. (2020). Proxy pattern-mixture analysis for a binary survey variable subject to nonresponse. *Journal of Official Statistics*, *36*(3), 703–728. https://doi.org/10.2478/jos-2020-0035

Andridge, R. R., & Thompson, K. J. (2015). Assessing nonresponse bias in a business survey: Proxy pattern-mixture analysis for skewed data. *Annals of Applied Statistics*, *9*(4), 2237–2265. https://doi.org/10.1214/15-AOAS878

Andridge, R. R., West, B. T., Little, R. J., Boonstra, P. S., & Alvarado-Leiton, F. (2019). Indices of non-ignorable selection bias for proportions estimated from non-probability samples. *Journal of the Royal Statistical Society Series C:* *Applied Statistics*, *68*(5), 1465–1483. https://doi.org/10.1111/rssc.12371

Bailey, M. A. (2023). A new paradigm for polling. *Harvard Data Science Review*, *5*(3). https://doi.org/10.1162/99608f92.9898eede

Boonstra, P. S., Little, R. J., West, B. T., Andridge, R. R., & Alvaredo-Leiton, F. (2021). A simulation study of diagnostics for bias in non-probability samples. *Journal of Official Statistics*, *37*(3), 751–769. https://doi.org/10.2478/jos-2021-0033

Bradley, V. C., Kuriwaki, S., Isakov, M., Sejdinovic, D., Meng, X-L., & Flaxman, S. (2021). Unrepresentative big surveys significantly overestimated US vaccine uptake. *Nature*, *600*, 695–700. https://doi.org/10.1038/s41586-021-04198-4

Glynn, R. J., Laird, N. M., & Rubin, D. B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In H. Wainer (Ed.), *Drawing inferences from self-selected samples* (pp. 115–142). Springer-Verlag. https://doi.org/10.1007/978-1-4612-4976-4_10

Glynn, R. J., Laird, N. M., & Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. *Journal of the American Statistical Association*, *88*(423), 984–993. https://doi.org/10.2307/2290790

Greenlees, W. S., Reece, J. S., & Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. *Journal of the American Statistical Association, 77, *251–261. https://doi.org/10.2307/2287228

Heckman, J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables, and a simple estimator for such models. *Annals of Economic and Social Measurement*, *5*, 475–492. https://www.nber.org/system/files/chapters/c10491/c10491.pdf

Keiding, N., & Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys (with discussion). *Journal of the Royal Statistical Society Series A: Statistics in Society*, *179*(2), 319–376. https://doi.org/10.1111/rssa.12136

Little, R. J. (1993). Pattern‑mixture models for multivariate incomplete data. *Journal of the American Statistical Association*, *88*(421), 125–134. https://doi.org/10.1080/01621459.1993.10594302

Little, R. J. (1994). A class of pattern‑mixture models for normal missing data. *Biometrika*, *81*(3), 471–483. https://doi.org/10.1093/biomet/81.3.471

Little, R. J., & Rubin, D. B. (2019). *Statistical analysis with missing data* (3rd ed*.*). Wiley.

Little, R. J., West, B. T., Boonstra, P. S., & Hu, J. (2020). Measures of the degree of departure from ignorable sample selection. *Journal of Survey Statistics and Methodology*, *8*(5), 932–964. https://doi.org/10.1093/jssam/smz023

Little, R. J., & Zhang, N. (2011). Subsample ignorable likelihood for regression analysis with missing data. *Journal of the Royal Statistical Society Series C: Applied Statistics*, *60*(4), 591–605. https://doi.org/10.1111/j.1467-9876.2011.00763.x

Meng, X.-L. (2016). Discussion of paper by Keiding & Louis. *Journal of the Royal Statistical Society Series A: Statistics in Society*, *179*(2), 351–352. https://doi.org/10.1111/rssa.12136

Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. *Annals of Applied Statistics*, *12*(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF

Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. *Journal of the American Statistical Association*, *72*(359), 538–543. https://doi.org/10.1080/01621459.1977.10480610

West, B. T., & Andridge, R. R. (2023). An evaluation of 2020 pre-election polling estimates using new measures of non-ignorable selection bias. *Public Opinion Quarterly*, *87*(S1), 575–601. https://doi.org/10.1093/poq/nfad018

West, B. T., Little, R. J., Andridge, R. R., Boonstra, P., Ware, E. B., Pandit, A., & Alvarado-Leiton, F. (2021). Assessing selection bias in regression coefficients estimated from nonprobability samples with applications to genetics and demographic surveys. *Annals of Applied Statistics*, *15*(3), 1556–1581. https://doi.org/10.1214/21-aoas1453

©2023 Roderick J. Little. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Another Commentary on
A New Paradigm for Polling

Is It Time for a New Paradigm in Biodiversity Monitoring? Lessons From Opinion Polling

Another Commentary on
A New Paradigm for Polling

Assuming a Nonresponse Model Does Not Make It True

Another Commentary on
A New Paradigm for Polling

Paradigm Lost? Paradigm Regained? Comment on “A New Paradigm for Polling” by Michael A. Bailey