Skip to main content
SearchLoginLogin or Signup

A New Paradigm for Polling

Published onJul 27, 2023
A New Paradigm for Polling


Scientific fields operate within paradigms that define problems and solutions for a community of researchers. The dominant paradigm in polling centers on random sampling, which is unfortunate because random sampling is, for all practical purposes, dead. The pollsters who try to produce random samples fail because hardly anyone responds. And more and more pollsters do not even try. The field therefore has folded weighting-type adjustments into the paradigm, but this too is unfortunate because weighting works only if we assume away important threats to sampling validity, threats that loom particularly large in the growing nonprobability polling sector. This article argues that the polling field needs to move to a more general paradigm built around the Meng (2018) equation that characterizes survey error for any sampling approach, including nonrandom samples. Moving to this new paradigm has two important benefits. First, this new paradigm elevates new insights, including the fact that survey error increases with population size when individuals’ decisions to respond are correlated with how they respond. This insight helps us understand how small sampling defects can metastasize into large survey errors. Second, the new paradigm points the field toward new methods that more directly identify and account for sampling defects in a nonrandom sampling environment. This article describes the intuition and potential power of these new tools, tools that are further elaborated in Bailey (2024).

Keywords: survey nonresponse, survey design, survey methods

Media Summary

Low response rates and low-cost internet polls have for all practical purposes killed the random sampling paradigm that built the public opinion field. This article argues that the polling field needs to move to a more general paradigm built around the Meng (2018) equation that characterizes survey error for any sampling approach, including nonrandom samples. Moving to this new paradigm elevates new insights and points the field toward new methods that address more of the challenges of the contemporary polling environment. The article summarizes work that uses randomized response instruments that provide a systematic way to determine whether the people who respond to polls differ from those who do not, even after controlling for demographics. Such work has found that polls in the Midwest understated Trump support and overstated the liberalism of Democratic voters.

1. Toward a New Polling Paradigm

A scientific paradigm provides a model for articulating problems, solutions, and future research directions for a community of practitioners (Kuhn, 1970). In polling, the main paradigm has long revolved around random sampling, a tool that provides an elegant way to make inferences about a large population based on information from a relatively small, randomly chosen subset of people.

Because it is incredibly difficult to randomly sample in the contemporary polling environment, most pollsters augment random sampling with weighting and related tools such as quota sampling and multilevel regression with poststratification. These weighting-type adjustments make the nonrandom samples resulting from nonresponse look like they came from a random sample, but with a cost: the techniques require us to assume that the decision to respond is independent of the content of response once the weighting variables have been controlled for.

I argue in this article that the weighting-augmented random sampling paradigm is ill-suited for the contemporary polling environment. First, the random sampling heart of the paradigm is hardly relevant today given low response rates and nonprobability samples. Nonetheless, polls are routinely ‘pollwashed’ in ways that make them appear to have inherited the precision and distributional properties of random sampling even though they have not. Second, weighting-type adjustments bear the weight for fixing nonrandomness in modern polling, but are built on assumptions that are quite restrictive, especially in the current environment in which respondents are often recruited via nonrandom mechanisms.

The field needs a better paradigm, one that moves beyond random sampling without relying on the strong assumptions involved in weighting. The simple decomposition of survey error provided by Meng (2018) provides the foundation for such a paradigm. Instead of reducing pollsters to explaining their work in terms of idealized and never-seen random samples, we can characterize survey error for any sampling approach, including nonrandom samples and samples that arise when survey response is related to survey content.

Shifting to a modern polling paradigm produces two important payoffs. First, the new paradigm provides intuition that is more relevant to current polling practice. A key element of the Meng equation is a so-called data defect parameter that characterizes the degree to which whether someone responds is related to how someone responds. This parameter tends to get lost in the dominant polling paradigms: random sampling essentially minimizes it, while weighting-type methods assume it away. The Meng equation makes clear that this parameter is centrally important and interacts with population size—and not, to be clear, sample size. Even a small data defect in sampling can create large survey errors when surveying large populations (Bradley et al., 2021). The Meng equation also helps us appreciate why random contact is worthwhile even when response rates are low.

The second payoff of the new paradigm is that it helps us chart a path forward for research on survey methods. A great deal of survey research—including research on nonrandom samples, as described in Wu (2023) —focuses on adjustments that assume there is no correlation between whether and how people respond after controlling for population-benchmarked variables. Given the critical role of data defects in the new paradigm, it is no longer tenable to focus so heavily on approaches that assume them away. Instead, the new paradigm points us toward tools that minimize, measure, and counteract any relationships between whether and how people respond to surveys. Bailey (2024) elaborates these benefits and provides additional context and tools.

To appreciate the challenges of the current paradigmatic ambiguity, consider two polls: one uses a probability-based sample by a respected newspaper with a one percent response rate. The other is a Twitter poll initiated by an unpredictable billionaire. Suppose they both have the same sample size and that demographic data is available so that the weighted results are ‘nationally representative.’ Most polling experts will have a strong preference for one of the polls, but random sampling provides little direct guidance, other than helping us appreciate that neither sample is random and hence both could be biased. Within the new paradigm, on the other hand, the Meng equation enables us to clearly show why the newspaper poll is higher in quality, as I discuss below.

The goal of this article is to provide an overview of a new way of thinking about polling that is better suited to the contemporary polling environment than today’s focus on weighting and other tools that assume ignorable nonresponse. Section 2 highlights what we already know: random sampling is a distant echo of polling as practiced. Section 3 presents the Meng equation, focusing on its distinctive intuition. Section 4 shows how a paradigm built around the Meng equation naturally points to new research agendas, providing two examples in which approaches motivated by the new paradigm are able to address important survey challenges.

2. Paradigm Lost

Modern polling began with a commonsense but not deeply theorized paradigm of more-is-better (Bailey, 2024; Converse, 2009). The exemplar of this approach was the Literary Digest, a magazine that sent millions of surveys to voters before presidential elections. They had a decent track record until 1936 when their polls infamously indicated that Republican Alf Landon would win in a landslide. He lost in a landslide, discrediting the early big-data approach to polling. Quota samplers such as George Gallup filled the void, showing how relatively small representative samples were more accurate. They did well until 1948 when they predicted Republican Thomas Dewey would defeat President Harry Truman. Truman won, famously hoisted the “Dewey Defeats Truman” newspaper and sent the polling community scrambling for a more robust paradigm.

Random sampling filled the gap (Neyman, 1934). Using standard statistical theory, one could characterize the statistical properties of the mean of a random sample in ways that enabled accurate and systematic reasoning about population attributes from samples in the hundreds or low thousands. Remarkably, the accuracy of random sampling depends on the sample size, not the population size. Fortuitously, widespread adoption of telephones made random sampling cheap to implement.

The theory assumes that everyone randomly contacted for a survey responds. This was never true, but response rates were high and the connection between response and political views was attenuated enough that random sampling provided a decent approximation to guide political polling.

Over the last several decades, the relevance of random sampling theory has declined, largely due to accelerating levels of nonresponse. In the late 1990s, 60% of those contacted for political polls did not respond; today, that number is often 95% or higher (Cohn, 2022, October 12; Kennedy & Hartig, 2019).

The first problem that low response creates is that it attenuates—and probably breaks—the connection between survey theory and practice. No one thinks that the 1% of people who respond when contacted are truly a random sample of the population. The field therefore accommodated large-scale nonresponse by augmenting random sampling with weighting. Weighting involves placing more weight on respondents from groups who are underrepresented in a sample relative to their population proportion and placing less weight on respondents from groups who are overrepresented relative to their population proportion. Weighting requires identifying variables that affect response and the attribute being surveyed from among those variables for which the pollster knows the totals in the population. Typically, these variables are demographic variables such as age, race, gender, income, region, and education.

The shift from the random sampling paradigm to the random sampling-plus-weighting paradigm is so pervasive that it is unremarkable to many pollsters, even as they acknowledge the many decisions that must be made when weighting data (Gelman, 2007). Weighting is not costless, however, as it requires pollsters to assume that nonresponse is ignorable, meaning that the decision to respond is independent of the content of response once we have controlled for the weighting variables. This assumption implies that the people who choose to reply are representative samples given the covariates used in the weighting. A violation of this assumption means that nonresponse is non-ignorable, meaning that even after weighting, poll respondents differ from nonrespondents.

The assumption that the response mechanism is ignorable is also referred to as a mechanism that produces data that is ‘missing at random.’ Little and Rubin (2014, p. 22) note that virtually every approach to dealing with missing data makes this strong assumption. This list includes mutltilevel regression with poststratification (so called MrP models) (Gelman & Hill, 2007) and nearest-neighbor imputation (YouGov, 2014).

Nonignorable nonresponse is concerning in many contexts.

  • Virtually every postmortem of the 2016 and 2020 U.S. presidential elections raised the possibility that weighting failed to properly adjust for the possibility that voters favoring Trump were less likely to respond, especially in the Midwest; see, e.g., Clinton et al. (2021), Kennedy et al. (2018).

  • Surveys of voting typically overestimate turnout, likely in part due to non-ignorable nonresponse (Jackman & Spahn, 2019).

  • Bradley et al. (2021) provide evidence that the type of people who get vaccinated are more likely to respond to some polls (especially ones based on nonrandom samples) even after controlling for demographics.

  • In marketing, evidence suggests that people’s willingness to provide product feedback depends on their experience with the product (Schoenmueller et al., 2020).

Low response rates have created another problem that has been harder to ignore: rising costs. It is now very expensive to field probability-based polls because pollsters need to wade through dozens of nonrespondents before they reach a single respondent, leading some to doubt the viability of the approach (Cohn, 2022, October 12). An increasing number of pollsters therefore have moved to nonprobability samples that are created by finding people willing to answer polls via ads, outreach to mailing lists and other, often opaque and sometimes novel, methods (Clinton et al., 2021; Wang et al., 2015). Pollsters use weighting-type adjustments to produce samples that are representative with regard to demographic benchmarks.

While true random sampling produces estimates with clear measures of quality, the field has struggled to operationalize quality in the post–random sampling world. Some pollsters anachronistically use the language of random sampling to imply that their polls have the properties of a random sample, a process I call ‘pollwashing.’ One way to do this is to report margins of error even though the theoretical basis of a margin of error is undone by nonresponse (and especially massive nonresponse, to say nothing of a nonrandom sample) (Shirani-Mehr et al., 2018).

Another tool for pollwashing is for pollsters to claim their samples are ‘nationally representative’ (Jamieson et al., 2023). In random sampling, a sample is probabilistically representative of a target population in expectation. In weighting, a sample can be made to share certain distributional characteristics with the population for variables used in the weighting protocol. This provides the survey with an aura of accuracy even for polls that have at best a modest claim at being truly representative in the way that an actual random sample would be. It is easy to see how this concept can stretch the concept of representativeness to the breaking point. Consider, for example, an opt-in internet poll on a candidate’s website. The data could be weighted to be nationally representative with respect to demographics, but no serious pollster would consider the sample representative in the sense that a true random sample would be. University of Michigan polling expert Raphael Nishimura summed it up nicely: “For the laymen, [representative sample] sounds like a well-defined technical sampling term, but it’s not. This is just as vague and meaningless as saying that a sample is ‘robust,’ ‘statistically valid’ or ‘awesome’ ” (Nishimura, 2023).

Pollwashing extends even to sample size. In a random sample, the survey average converges to the population average as the sample size increases, making sample size a useful metric for precision. In nonrandom samples, however, large samples guarantee little. We’ve known this since the 1936 Literary Digest fiasco, yet surveys continue to report sample sizes for their ‘nationally representative’ samples with the implication that more is better. When samples are nonrandom, however, our intuition that more is better—one of the core insights for random sampling—fails. Bradley et al. (2021) and others have shown that if response is correlated with opinion, the sample size can be wildly unreflective of the amount of information in a sample. I address this point below, as well.

Given the lack of a clear measure for assessing polls, some in the field use predictive accuracy as measure of polling quality (Silver, 2023). The danger with this approach is that if surveys have a systematic error, then a tool that counteracts that bias—however crudely—will do well. Survey firms with Republican bias were relatively accurate in 2016 and 2020. Were their methods better? Or were they biased in a fortuitous way for those elections? Many of these same firms performed poorly in 2018 and 2022, suggesting limits to polling accuracy as a measure of quality. Perhaps with enough time and a stable polling environment, track records may prove meaningful, but rather than waiting for polling methods to be exposed in an election, a better aspiration is to have a paradigmatic set of standards against which to judge polling methods.

3. Paradigm Found

What remains once we have ruled out metrics of survey quality such as demographic representativeness or large sample sizes or predictive accuracy? In this section, I articulate a framework that answers this question. The framework builds on Meng’s (2018) surprisingly simple and completely general characterization of survey error. It helps us contextualize survey error across protocols and points to shared standards and future research directions.

The framework is built on a simple model of the sample mean of a variable of interest, YY, from a sample of nn i.i.d. observations drawn from a population of size NN. (The logic extends to other statistical quantities such as regression coefficients.) I denote the observed sample mean among respondents as Yn\overline{Y}_n where the lowercase nn subscript indicates the number of people in the sample (i.e., people for whom R=1R=1). The difference between the mean of YY in the R=1R=1 group and the entire population is

YnSample averageYNPopulation average.\underbrace{\overline{Y}_n}_{\text{Sample average}} - \underbrace{\overline{Y}_N}_{\text{Population average}}.

At this point, we are not doing any statistical modeling; we are simply calculating the difference between the average value of YY for people with R=1R=1 and the average value of YY for the entire sample. Following the simple steps described in Meng (2018) and the appendix, we can re.-write sampling error in a way that decomposes it into three conceptually interesting quantities. I present the case with no covariates, but the logic carries over when there are covariates. (One may wish to consider the equation as applied within weighting demographics, for example.)

YnYN=ρR,Ydata defect correlation×Nnndata quantity×σYdata difficulty.(1)\overline{Y}_n-\overline{Y}_N =\underbrace{\rho_{R,Y}}_{\text{data defect correlation}} \times \underbrace{\sqrt{\frac{N-n}{n}}}_{\text{data quantity}} \times \underbrace{\sigma_Y}_{\text{data difficulty}}.\hspace{} (1)

The first term on the right-hand side of the Meng Equation is ρR,Y\rho_{R, Y}, the correlation in the population between RR and YY. This quantity can be taken to reflect quality of data with regard to sampling. Bradley et al. (2021) refer to this quantity as the “data defect correlation” (sometimes referred to as the confounding correlation). The higher this correlation, the more response is correlated with outcome. When ρR,Y=0\rho_{R, Y}=0 the response mechanism is ignorable; when ρR,Y0\rho_{R, Y} \neq 0, respondents have systematically different values of YY than nonrespondents.

Because the Meng equation is an accounting identity, we know that if ρR,Y=0\rho_{R, Y} = 0, then the mean of the sample will literally equal the mean of the population. This fact points to central insight of random sampling: if RR is based on a truly random process then ρR,Y\rho_{R, Y} will be expected to be quite close to zero. Remember, though, that the Meng equation is an accounting identity, so even when a sample is randomly chosen, it is unlikely that the correlation of RR and YY will literally equal zero, hence the sample mean will generally not equal the population mean. Meng shows that as long as the data defect correlation is on the order of 1N\frac{1}{\sqrt{N}} (as it is with random sampling), then the response mechanism can be treated as ignorable.

The second term on the right-hand side of the Meng equation is Nnn\sqrt{\frac{N-n}{n}}. It relates to the size of the population (capital NN) and the size of the sample (lowercase nn). Describing survey quality in terms of both NN and nn runs strongly counter to the intuitions of random sampling, but is crucial to understanding nonrandom samples. I explore this term in detail momentarily.

The final term on the right-hand side of the Meng equation is σY\sigma_Y, the square root of the variance of YY. Meng (2018) refers to this quantity as data difficulty in the sense that errors will be smaller if YY varies only a little in the population. In an extreme case, YY is the same for everyone in a population, which would mean σY=0\sigma_Y = 0 and the sample mean will equal the population mean. Generally, this source of polling error is taken as a given for any given survey item.

The Meng equation is very general, relatively simple, and remarkably insightful. It motivates intuitions that provide a more robust starting point for thinking about modern polling than random sampling. Here I focus on three important insights that are largely absent in the weighting-augmented random sampling paradigm, but clear in the new paradigm.

3.1. The Importance of ρ\rho

Survey error is the product of three terms, meaning that we need to think of survey error as a combination of factors. If any one of the terms is zero, then survey error is zero, whatever the value of the other terms. The data quantity term is zero only if the sample (nn) is equal to the population (NN) and the data difficulty term is zero only if YY does not vary at in the population, neither of which is plausible in most polling contexts.

The only term that we can realistically drive toward zero with survey methods is ρR,Y\rho_{R,Y}. Random sampling does this in expectation via randomization. Weighting lowers ρR,Y\rho_{R,Y} by conditioning on covariates that are observed in the sample for which we have known population-level information. The easiest way to conceptualize this is to consider a case of cell weighting in which the population is broken into cells based on demographics. A single cell may contain college-educated Hispanic women over 65 years old, for example. The Meng equation applies to the estimates within each cell. It is possible that not accounting for education may induce a nonzero data defect correlation in the overall population because, for example, people with more education may respond at higher rates. Within each cell, however, it could be that there is no systemic difference between those who respond and those who do not. In this case, conditioning on covariates enables low-error sampling estimates within cells and, because we know population proportions for each cell, an analyst using weighting can combine the estimates proportionately to produce a low-bias population estimate. Even as weighting can reduce sampling bias, it needs to make a strong assumption that the correlation of response and outcome is small in magnitude within cells, something that is not true if respondents differ from nonrespondents conditional on covariates.

Neither of these methods are satisfying in the contemporary context in which nonresponse renders random samples virtually impossible and in which we would rather not solve problems by assuming them away. As seen in Section 2, there are many plausible scenarios in which response and outcome are correlated, even after controlling for observable covariates with known population proportions.

3.2. Population Size Matters

The data quantity term in the middle of the right-hand size of the Meng equation is a function of sample size nn and population size NN. That sample size matters fits comfortably with our random sampling–inflected intuition: as long as ρR,Y0\rho_{R,Y} \neq 0 and σY0\sigma_Y \neq 0, survey error will decline as the sample size nn increases.

Notice that the sample size is doing something quite different than it does in random sampling. The expected mean from a random sample is the true value no matter what the sample size is. The power of a larger sample in random sampling is to reduce the sampling variance of the mean. In the Meng equation, a larger sample is associated with smaller error. Roughly speaking, the Meng equation says that when ρR,Y0\rho_{R,Y}\neq 0, sampling error diminishes as the sample size gets larger.

Sampling error also depends on the size of the population, NN. One of the incredible properties of random sampling is that it decouples the size of the population from the properties of the estimator. A (truly) random sample of 1,000, for example, will be equally accurate in expectation for a given data difficulty for any target population, be it a small state in the United States or the entire country of India.

In nonrandom samples, however, population size matters. Figure 1 based on Bailey (2024) displays samples of 20 from a relatively large and a relatively small population. Each square is a person. On the xx-axis is RR^*, the latent propensity to respond for each person. We observe a response R=1R = 1 if R>kR^*> k which varies across the two panels. The key point for the intuition is that higher latent propensities to respond are associated with higher probabilities of response. On the yy-axis is a feeling thermometer rating of, for example, President Biden; this is YY, the survey response of interest. The upward tilt of the shape in Figure 1 suggests that ρR,Y>0\rho_{R, Y} > 0, meaning that the people with higher propensities to respond have higher ratings of Biden.

Figure 1. Sample of 20 from large (left) and small (right) populations.

The blue squares are respondents. In the large population panel on the left, there are 328 people, 20 of whom respond (about 6%). These respondents are quite unrepresentative. Every one of them rates Biden above 40 and their average rating is 68, which is much higher than the population average of 40.

In the small population panel on the right, there are 40 people. As with the panel on the left, the sample size is 20. The respondents are also unrepresentative, but the magnitude of the unrepresentativeness is much smaller because the pollster had to go deeper into the pool to get 20 responses. This means that less unrepresentative people made their way into the sample, leading us to see ratings of Biden values as low as 25 and an average rating of Biden among respondents of 50, which is higher than the population average of 40, but not as far off as for the large population example.1 In other words, the example shows how a sample of size nn will produce smaller error from a smaller population when the data defect correlation is not zero.

3.3. Random Contact

Building from the Meng equation, we can articulate a third important insight for contemporary polling (Bailey, 2024). First, let us distinguish:

  • Random sample: What random sampling theory is built on, but is not delivered by probability-based polling.

  • Random contact: What probability-based polling actually does. Pollsters using probability-based polls randomly contact people who may or may not respond.

Given that random contact is very expensive and nonetheless produces nonrandom samples, it is easy to sympathize with pollsters who have given up on random contact. However, random contact does important work even with very low response rates, an intuition that is hard to see in the current random sampling-plus-weighting paradigm.

To show this, I first show graphically how randomly choosing whom to contact unlinks the connection between sampling error and population size. After that, I use Meng’s equation to reconsider how non-ignorable nonresponse affects error when contact (but not response) is randomized.

Figure 2. Random contact.

Figure 2 from Bailey (2024) starts with the ‘large’ population in panel (a) of Figure 1. We know from Figure 1 that a sample of 20 respondents will produce a highly skewed sample, with an average YY of 68, which is far from the population average of YY, which is 40. Each box still represents a person with their value of YY (e.g., a feeling thermometer for a politician) on the y-axis and their propensity to respond on the x-axis. The filled-in grey boxes are randomly selected individuals contacted by the pollster. The open boxes are people the pollster does not contact.

Random contact does not imply that those who respond are a random sample. After all, people choose to pick up the phone or respond to an email and this process can be influenced by many nonrandom factors, including factors correlated with YY, the feature we are trying to estimate in the population. The panel on the right of Figure 2 shows who responds among those randomly contacted. This sample continues to be unrepresentative.

Even though the sample is skewed, random contact has done something very important. The sample of 20 respondents from the random contact survey is not as unrepresentative as the sample of 20 respondents from the large population panel of Figure 1. We no longer get the nn most responsive people in the whole population (which is wildly unrepresentative for a large population), but instead hear from the nn most responsive people in a smaller representative sample. The respondents in the random contact case depicted in the right panel of Figure 2 have an average value of YY of 56, which is larger than the population average, but not as bad as the sample average of 68 that emerged from the no-random-contact case depicted in the left side panel of Figure 1. The random contact converted the large population into a smaller one population.

In other words, while random contact does not eliminate error associated with a positive value of ρR,Y\rho_{R,Y}, it decouples sampling error from population size. In terms of the equation, Meng (2021) and Bailey (2024) show that survey error in a random contact survey is

YnYN=ρR,Y×1prprdata quantity×σY(2)\overline{Y}_n-\overline{Y}_N =\rho_{R,Y}\times \underbrace{\sqrt{\frac{1-p_r}{p_r}}}_{\text{data quantity}} \times\sigma_Y \hspace{1.5in} (2)

where prp_r is response rate (see the appendix for the derivation). The crucial difference from Equation 1 is that the data quantity term depends on the response rate, prp_r, and *not* on population size NN. Since populations can be very large, this is very useful (although identifying the correct target population for the random contact is a challenge, see, e.g., Jackman and Spahn, 2019).

3.4. How Current Practice Fits Into the New Paradigm

One of the nice features of the new paradigm based on the Meng equation is that it is general enough to encompass the multiple approaches to surveys that dominate the field. We have already seen that random sampling is a mechanism that drives ρR,Y\rho_{R,Y} toward zero in expectation. It continues to be an amazing tool, but is a special case in a more general framework.

Weighting approaches succeed if ρR,Y\rho_{R, Y} goes to zero within each demographic subgroup (or, more precisely, conditional on observed covariates with known population distributions). Weighting protocols then essentially patch together subgroup estimates with good properties via the population proportions of weighting groups.

The new paradigm also makes it harder to ignore potential weaknesses of weighting. Most polls are reported with no effort to measure ρR,Y\rho_{R,Y}, meaning that the results are predicated on a leap of faith that ρR,Y\rho_{R,Y} will be close to zero conditional on covariates. Given that ρR,Y\rho_{R,Y} interacts with population size, this can be quite a leap: even a small nonzero value of ρR,Y\rho_{R,Y} can lead things to go wrong very quickly, something that is hard to see from within the random sampling paradigm and that is particularly concerning when samples are not generated via random contact.

4. The New Paradigm in Practice

Equation 1 helps us appreciate the scale of the sampling problem we face in a post–random sampling world. It does not, however, provide specific guidance on estimating ρR,Y\rho_{R, Y} and associated standard errors needed for inference.

While some believe that there is little to be done to measure or undo nonzero ρR,Y\rho_{R,Y}, there is in fact a vibrant and growing literature that models, measures, and counters ρR,Y\rho_{R,Y}. These models cannot avoid relying on assumptions, of course, but they do not require response to be ignorable and they can produce uncertainty estimates that allow us to rule non-ignorability in or out in many reasonable data contexts.

In this section I describe two such approaches. Both rely on response instruments, which are variables that affect the probability of response but do not directly affect the outcome of interest. Sun et al. (2018) show that a broad class of weighting, imputation, and doubly robust models can work if a response instrument is available. Bailey (2024) shows examples of how even parametric models that do not literally require response instruments tend to perform much better when a response instrument is available.

The first example uses an observational response instrument, which is convenient but that suffers from the usual concerns the literature has about observational instruments related to whether they actually have no effect on YY. The second example uses a randomized response instrument, which is easier to defend on theoretical grounds. In some circumstances, randomized instruments are practically difficult to implement, but we shall see that they are not that difficult to create in a survey context. For both observational and randomized contexts, the response instruments must affect response and, as is typical in instrumental variable–type approaches, statistical power rises as the magnitude of the effect of the instrument on response rises.

The intuition behind response rates is straightforward. If response is ignorable conditional on covariates, then the expected value of YY should be independent of response propensity conditional on covariates for an individual and, hence, across a number of i.i.d. draws. However, if response is non-ignorable, the expected value of YY differs conditional on covariates for across those with high and low response propensities. Hence, if we are able to observe observations from high- and low-response contexts, we can assess whether YY differs and infer whether we are looking at data produced by an ignorable or a non-ignorable response mechanism. I provide a graphical illustration of this intuition when I discuss randomized response instruments below. Sun et al. (2018) provide a formal proof of the conditions under which population quantities are statistically identified when one has a response instrument.

Example 1: Using Observational Data to Account for ρ\rho

Like many polls, the 2020 American National Election Study (ANES) overestimated Biden’s support. Biden won the popular vote by 4.4 percentage points, but the 2020 ANES preelection poll reported that Biden led Trump by 11.8 percentage points. Weighting did not help, as Biden’s margin was 12.6 percentage points when responses were weighted.2

Figure 3. Presidential preferences in 2020 ANES survey, by interest in politics.

Signs that ρR,Y\rho_{R,Y} was not zero were hiding in plain view. Following Bailey (2024), Figure 3 displays presidential preferences in ANES data by political interest. Support for Biden and Trump was essentially equal among those respondents “not at all” or “not very” interested in politics. Among those “very interested” in politics, however, there was a huge gap as 61.9% of such respondents supported Biden.3

If people who are more interested in politics are more likely to answer a poll about politics—which hardly seems unreasonable—then the ANES may have too many people interested in politics and thereby produced a sample that was more pro-Biden than the population. While it seems natural to model likely lower support for Biden among nonrespondents, pollsters did not do so because weighting-type adjustments are not feasible for a variable like political interest, which does not have a known population-level distribution.

Grounding our thinking about polling in the Meng equation, however, it becomes harder to dismiss the possibility that ρR,Y0\rho_{R,Y} \neq 0 due to a nonweighting factor that affected both whether and how people responded to a poll. Rather than shrug our shoulders and say that polling is hard, we can model response and outcome in a way that allows for ρR.Y0\rho_{R.Y} \neq 0. Peress (2010), for example, did this when he modeled survey measures of turnout in the 1980s. As is often the case, surveys at that time overestimated turnout: even though only 50% of adults turned out to vote at that time, 70% of ANES respondents voted.4 ANES turnout declined in the weighted data to around 60%. Peress incorporated information about response-interest, information that is akin to the political interest variable discussed plotted above, and was able to bring estimates to within 1% of actual turnout in 1980 and 1988 and within 2% in 1984.

Directly modeling and estimating ρR,Y\rho_{R,Y} was the crucial element that powered the Peress model. His model, like others in this spirit, jointly modeled RR—the decision to respond—and YY—the content of response. He linked the two equations via a ρ\rho parameter capturing the degree to which unmeasured factors affected both RR and YY. The model was identified by including a variable in the RR equation that was not included in the YY equation.

Using such models requires new thinking and, of course, does not eliminate the necessity of assumptions. However, instead of assuming that nonrespondents have the same political interest as respondents—as required in standard weighting-type adjustments—these models allow us to incorporate information in the data that suggest nonrespondents differ from respondents in important ways. In recent years, such models have made use of advances in copulas (Gomes et al., 2019), moment estimators (Burger & McLaren, 2017; Sun et al., 2018), and other approaches in ways that allow them to be more robust to distributional assumptions and other concerns.

Example 2: Randomized Response Instruments

The challenge with observational models is that is is often hard to definitively defend the assumption that one or more variables affect RR but not YY. We are not helpless in the face of these concerns however, as we can use the power of randomization to create variables that affect RR but not YY by design. Specifically, we can create randomized response instruments that reflect treatments that affect whether or not someone responds. There are many ways to do this. A pollster can, for example, randomly divide potential respondents into two pools and provide one group incentives to response. Cohn (2022, November 8) did this by offering money to the treatment group, thereby lifting the response rate by 30 percentage points relative to the control group that was contacted with conventional incentive-less protocols. I describe another approach below.

There are several attractive features of creating randomized response instruments. First, doing so builds on our long-standing inclination to use sampling design to solve sampling problems. After all, in random sampling the design of the survey has long been accepted as a better way to solve sampling error than by increasing sampling size in a nonrandom way. Second, this approach is simple to implement as the pollster need only identify a protocol that affects survey response, a task that is familiar among pollsters who have long explored how to increase response rates.

Figure 4, based on Bailey (2024), illustrates the logic of randomized response instruments. The purpose of the figure is to highlight how and why such instruments can allow us to assess whether the response mechanism is ignorable or not. The configuration of population values is reasonable, but should not be taken by itself as a general characterization of the world. Reality may deviate from the features in the figure and readers are encouraged to review Bailey (2024), which includes tools such as control functions, copulas, and specification searches that can enable modern non-ignorable nonresponse–oriented tools to work under a broad range of circumstances. These tools do not work under all circumstances, though. Sun et al. (2018) provides a formal treatment of identification and Bailey (2024) provides a practical discussion of how to deal with threats to validity in these models.

Figure 4. Traces of non-ignorable nonresponse in observable data.

As with Figures 1 and 2, the panels in Figure 4 plot the response interest and values of YY of a hypothetical population. The blue dots indicate people who respond to the survey. In the panels on the left, the shapes tilt as in Figures 1 and 2, suggesting that ρ>0\rho>0 because people interested in responding tend to have higher values of YY. In the panels on the right, the joint distributions create shapes that are flat, which suggest that ρ=0\rho = 0 because there is no relationship between interest in responding and YY. The top panels show instances in which the response rate is around 7%. The panels on the bottom show instances in which the response rate is around 38%.

When ρ>0\rho > 0 as in the panels on the left, Y\overline{Y} varies depending on the response rate. In the panel on the top left, Y=68\overline{Y}=68, which is the same as we saw in the left panel of Figure 1. In the panel on the bottom left, Y=57\overline{Y} = 57. When ρ=0\rho = 0 as in the panels on the right, however, Y=40\overline{Y}=40 in both the low- and high-response surveys. In other words, when response is ignorable, the estimates will be the true value in expectation, whatever the sample size (even as the precision varies based on sample size).

The figure shows how variation in Y\overline{Y} associated with response rate reflects information about ρ\rho. The randomized response instrument induces variation in response rate (something that is easily verifiable), meaning that conceptually at least, it is easy to assess whether or not ρ=0\rho=0. If Y\overline{Y} is the same in treatment and control groups, we have evidence that ρ=0\rho=0. If Y\overline{Y} differs across treatment and control groups, we have evidence that ρ0\rho\neq0. In other words, even though ρ\rho is often characterized in terms of unobserved variables, it is not the case that it never leaves a trace. As response rates vary, observed patterns in YY will reflect ρ\rho.

While the logic is straightforward, estimation requires models such as those described in Bailey (2024). These models range from widely known Heckman models, to copula models, to methods of moments estimators. Bailey (2024) argues that the quality of data often dominates the choice of model, meaning that the key step is typically creating a good randomized-response instrument; with that in hand, the models tend to produce similar results.

For example, Bailey (2023) presents doubly robust estimates that use a randomized response instrument to create estimates that are robust to non-ignorable nonresponse. The double robustness comes from incorporating both weighting- and imputation-based approaches in a way that if either or both of the weighting and imputation models is correct, the estimate will be consistent. Note that both the weighting and imputation models allow for non-ignorable nonresponse and can produce quite different estimates from conventional weighting or imputation when there is evidence of non-ignorable nonresponse. Bailey (2024) analyzes this data using parametric and other methods, finding similar results.

Bailey (2024) implements the approach with an Ipsos poll of U.S. voters in 2019. The response instrument was created by randomly assigning potential respondents to high- and low-response protocols. In the low-response protocol, respondents were asked whether they want to discuss politics, sports, health, or movies. Only those that chose politics were retained in the respondent pool for the models discussed here. In the high-response protocol, people were asked the political questions in the standard way and, therefore, had a much higher likelihood of providing answers. The response instrument was strong—changing response rates by 60 percentage points—and thereby provided enough statistical power to estimate ρ\rho and, in turn, to create population estimates that were purged of malign effects of ρ\rho. This protocol is feasible in many survey contexts; other randomized response instruments such as Cohn (2022, November 8) have produced large effects on response as well. If one is not able to design a randomized response instrument with large effects on response, the methods described here will likely be underpowered.

As is typical in polls, the patterns varied by question and across party. Here I provide three examples to give a flavor of the results.

  • First, the survey asked people how likely they were to vote, giving them five response categories ranging from “absolutely certain to vote” to “will not vote.” In the raw data, 78% of respondents said they were certain to vote. When weighted using conventional weights, 75% of respondents said they were certain to vote. A doubly robust estimate based on weighting and imputation models that used a randomized response instrument to model potential non-ignorable nonresponse found strong evidence of non-ignorable nonresponse (pp <0.01), suggesting a strong relationship between willingness to respond and expressing certainty about voting. Such a pattern occurs when the answers in the low- response group differ clearly from the answers in the high-response group, as they did in the case. This model then produced an estimate that 55% of people were certain to vote. Since it is not entirely clear how to map answers to a five-category question to actual turnout (which was 67%), it is a bit difficult to say for certain that accuracy increased. At a minimum, the raw and weighted results seem to overestimate turnout given that they indicated 75% or more people were certain to vote when 67% actually voted and that people in the other categories voted as well. The selection model results, in contrast, moderated the estimate and was consistent with the idea that raw and weighted data in surveys of turnout tend to overestimate turnout.

  • Second, the poll asked people about support for President Trump. In the whole sample, the raw, conventionally weighted and non-ignorable nonresponse doubly robust models produced similar results. There was, however, interesting variation by region. Among whites in the Midwest—a group for whom polls have tended to underestimate Trump support—raw support for Trump was 45%, a number that fell to 43% when the data was weighted. In the selection model, in contrast, the parameter associated with non-ignorable nonresponse in the doubly robust model was unlikely to have arisen by chance (pp < 0.05), which, in turn, led the selection model to estimate higher Trump support among whites in the Midwest (5%). Because the poll was conducted more than a year before the election, it is hard to gauge accuracy, but it is interesting to note the strong signal in the selection model that conventional polls were underestimating Trump’s support in the Midwest.

  • Third, pollsters worry about ρ\rho on sensitive questions as it may be the case that people with certain opinions on such matters are less likely to respond. For example, on race it is possible that social pressure may make people with more conservative views on race less likely to provide their opinions to a pollster. For example, on a question about whether it was appropriate for black athletes to kneel during the national anthem, the observed percent conservative among Democrats was about 17%, a number that fell slightly when conventionally weighted. However, when analyzed with the non-ignorable nonresponse doubly robust model, the estimated percent of Democrats with the more conservative answer rose to 33%, almost double the percent estimated by conventional weights.

5. The Future of Polling

Random sampling is dead. Weighting cannot revive it and the field risks losing coherence as it devolves into a mélange of pollsters using bespoke tools evaluated on past performance rather than common theoretically justified standards. It is time to update our paradigmatic foundations so that they encompass not only the random sampling or assumption-driven weighting methods of the past and present, but also the myriad methods in development that produce nonrandom samples.

Such a new paradigm is indeed available, one that builds on the Meng equation. It is quite general—general enough, in fact, to be used in ecology (Boyd et al., 2023), the mathematics of multidimensional integration (Hickernell, 2018), and particle physics (Courtoy et al., 2023). The equation characterizes sampling error for any poll, yet is specific enough to provide guidance about sources of this error. This new paradigm provides not only a common language that applies to contemporary polling, but also produces unfamiliar insights. Central to this new paradigm is the correlation between whether and how people respond. When this correlation is nonzero, it interacts with population size, meaning that for large populations, even a small correlation can devastate survey accuracy.

The new paradigm also points the field in a different direction than it is currently headed. Currently, most survey research relies on weighting-type tools that assume away the correlation between whether and how people respond, conditional on observable covariates with known population distributions. Such tools are useful, of course, but cover only a limited range of possible conditions, a limitation that is becoming more striking as the polling field moves further away from its random-sampling roots.

This article has provided an overview for the kind of work that naturally emerges in the new paradigm. The general theme is that any nonrandom sample needs to minimize, measure, and/or account for ρ\rho. I showed examples that do this with observational data and, even better, with randomized-response instruments. To take non-ignorable nonresponse seriously does not mean that we expect to find it everywhere. Indeed, Bailey (2024) provides examples in which surveys designed and analyzed to address non-ignorable nonresponse find no evidence that whether and how people respond is correlated. For some survey questions and subgroups, however, these new tools produce estimates that differ importantly from weighted results. As summarized here and elaborated in Bailey (2024), using randomized-response instruments and tools that allow for non-ignorable nonresponse lead to different and arguably better estimates of turnout, Trump support, and racial conservatism.

Much work remains to be done as the selection models that measure and account for ρ\rho involve new survey designs and analytical tools. Some may find these models unfamiliar or complicated, but we are long past the time for wishing for a simple solution to contemporary survey nonresponse; after all, modern weighting is quite complex and only works by assuming away much of the nonresponse problem. And the fast-growing nonprobability polling industry uses complicated and often opaque protocols.

With a paradigm that better applies to the contemporary polling environment, more of the field will be drawn to this important work, and they can build from a common foundation that more directly applies to the complicated polling environment of today.


I am grateful for helpful comments from Xiao-Li Meng, Jon Ladd, and anonymous reviewers. All errors are mine.

Disclosure Statement

Michael Bailey has no financial or non-financial disclosures to share for this article.


Bailey, M. A. (2024). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press.

Bailey, M. A. (2023, July 12). Doubly robust estimation of non-ignorable non-response models of political survey data [Paper presentation]. Fortieth annual meeting of the Society for Political Methodology at Stanford University, Stanford, CA, United States.

Boyd, R. J., Powney, G. D., & Pescott, O. L. (2023). We need to talk about nonprobability samples. Trends in Ecology & Evolution, 38(6), 521–531.

Bradley, V. C., Kuriwaki, S., Isakov, M., Sejdinovic, D., Meng, X.-L., & Flaxman, S. (2021). Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature, 600(7890), 695– 700.

Burger, R. P., & McLaren, Z. M. (2017). An econometric method for estimating population parameters from non-random samples: An application to clinical case finding. Health Economics, 26(9), 1110–1122.

Clinton, J., Agiesta, J., Brenan, M., Burge, C., Connelly, M., Edwards-Levy, A., Fraga, B., Guskin, E., Hillygus, D. S., Jackson, C., Jones, J., Keeter, S., Khanna, K., Lapinski, J., Saad, L., Shaw, D., Smith, A., Wilson, D., & Wlezien, C. (2021). Task force on 2020 pre-election polling: An evaluation of the 2020 general election polls. American Association for Public Opinion Research.

Cohn, N. (2022, October 12). Who in the world is still answering pollsters’ phone calls? New York Times.

Cohn, N. (2022, November 8). Are the Polls Still Missing ‘Hidden’ Republicans? Here’s What We’re Doing to Find Out? New York Times.

Converse, J. M. (2009). Survey research in the United States: Roots and emergence 1890-1960. Transaction Publishers.

Courtoy, A., Huston, J., Nadolsky, P., Xie, K., Yan, M., & Yuan, C. (2023). Parton distributions need representative sampling. Physical Review D, 107(3), Article 034008.

Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2), 153–164.

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

Gomes, M., Radice, R., Brenes, J. C., & Marra, G. (2019). Copula selection models for non-Gaussian outcomes that are missing not at random. Statistics in Medicine, 38(3), 480–496.

Hickernell, F. J. (2018). The trio identity for quasi-Monte Carlo error. In A. B. Owen & P. W. Glynn (Eds.), Monte Carlo and quasi-Monte Carlo methods. Springer.

Jackman, S., & Spahn, B. (2019). Why does the American National Election Study overestimate voter turnout? Political Analysis, 27(2), 193–207.

Jacobson, G. C. (2022). Explaining the shortfall of Trump voters in the 2020 pre- and post-election surveys [Unpublished manuscript]. Department of Political Science, University of California, San Diego.

Jamieson, K. H., Lupia, A., Amaya, A., Brady, H. E., Bautista, R., Clinton, J. D., Dever, J. A., Dutwin, D., Goroff, D. L., Hillygus, D. S., Kennedy, C., Langer, G., Lapinski, J. S., Link, M., Philpot, T., Prewitt, K., Rivers, D., Vavreck, L., Wilson, D. C., & McNutt, M. K. (2023). Protecting the integrity of survey research. PNAS Nexus, 2(3), Article pgad049.

Kennedy, C., Blumenthal, M., Clement, S., Clinton, J., Durand, C., Franklin, C., McGeeney, K., Miringoff, L., Olson, K., Rivers, D., Saad, L., Witt, G. E., & Wlezien, C. (2018). An evaluation of the 2016 election polls in the United States: AAPOR task force report. Public Opinion Quarterly, 82(1), 1–33.

Kennedy, C., & Hartig, H. (2019). Response rates in telephone surveys have resumed their decline. Pew Research Center.

Kuhn, T. S. (1970). The structure of scientific revolutions. University of Chicago Press.

Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. Wiley.

Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (1): Law of large populations, big data paradox, and the 2016 presidential election. The Annals of Applied Statistics, 12(2), 685–726.

Meng, X.-L. (2021). Data defect index: A unified quality metric for probabilistic sample and nonprobabilistic sample [Presentation]. Harvard University.

Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625.

Nishimura, R. [@rnishimura]. (2023, March 29). To add my reasons on why we should avoid using "representative sample": For the laymen, it sounds like a [Tweet]. Twitter.

Peress, M. (2010). Correcting for survey nonresponse using variable response propensity. Journal of the American Statistical Association, 105(492), 1418–1430.

Schoenmueller, V., Netzer, O., & Stahl, F. (2020). The polarity of online reviews: Prevalence, drivers and implications. Journal of Marketing Research, 57(5), 853–877.

Shirani-Mehr, H., Rothschild, D., Goel, S., & Gelman, A. (2018). Disentangling bias and variance in election polls. Journal of the American Statistical Association, 113(522), 607–614.

Silver, N. (2023, March 13). Fivethirtyeight’s pollster ratings.

Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., & Tchetgen-Tchetgen, E. J. (2018). Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28(4), 1965–1983.

Wang, W., Rothschild, D., Goel, S., & Gelman, A. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31(3), 980–991.

Wu, C. (2023). Statistical inference with non-probability survey samples [Unpublished manuscript]. Department of Statistics, University of Waterloo.

YouGov. (2014). Sampling and weighting methodology for the February 2014 Texas statewide study.


Derivation of the Meng equation

Begin by rewriting the sample average in terms of RR:

Yn=i=1NRiYin=i=1NRiYii=1NRi=i=1NRiYiNi=1NRiN=RYR\overline{Y}_n = \frac{\sum_{i=1}^NR_iY_i}{n} = \frac{\sum_{i=1}^NR_iY_i}{\sum_{i=1}^NR_i} = \frac{\frac{\sum_{i=1}^NR_iY_i}{N}}{\frac{\sum_{i=1}^NR_i}{N}} = \frac{\overline{RY}}{\overline{R}}

where RY\overline{RY} and R\overline{R} are the population averages of R×YR \times Y and RR, respectively.

The difference between the mean of YY in the R=1R=1 group and the entire population is YnYN\overline{Y}_n - \overline{Y}_N:

YnYN=RYRYN=RYYNRR=covar(R,Y)R\begin{aligned} \overline{Y}_n -\overline{Y}_N & = & \frac{\overline{RY}}{\overline{R}} - \overline{Y}_N \\ & = & \frac{\overline{RY} - \overline{Y}_N\overline{R}}{\overline{R}} \\ & = & \frac{covar(R, Y)}{\overline{R}} \end{aligned}

where covar(R,Y)covar(R, Y) is the covariance of RR and YY.

Correlation (ρ\rho) is the covariance divided by the product of the standard deviations of the two variables (σR\sigma_R and σY\sigma_Y, respectively); hence covar(R,Y)=ρR,YσRσYcovar(R, Y) = \rho_{R, Y} \sigma_R\sigma_Y where ρR,Y\rho_{R, Y} is the population correlation of RR and YY. Substituting for covar(R,Y)covar(R, Y) yields

YnYN=ρR,YσRσYR=ρR,YσRRσY\begin{aligned} \overline{Y}_n -\overline{Y}_N & = & \frac{\rho_{R, Y} \sigma_R\sigma_Y}{\overline{R}}\\ & = & \rho_{R, Y} \frac{\sigma_R}{\overline{R}}\sigma_Y \end{aligned}

Because RR is binary, σR=p(1p)=nN(1nN)\sigma_R = \sqrt{p(1-p)}= \sqrt{\frac{n}{N}(1-\frac{n}{N})}. In addition, R=nN\overline{R}=\frac{n}{N}. Substituting for σR\sigma_R and R\overline{R} and doing some algebra yields the Meng equation.

Derivation of sampling error in random contact case

For the random contact case, we assume i=1NYiN=iCYiNc\frac{\sum_{i=1}^{N}Y_i}{N} = \frac{\sum_{i \subset C}Y_i}{N_c} where CC is the set of people contacted and NcN_c is the number of people contacted. The probability someone responds given that they were contacted is pr=iCRiNc=nNc=Rcp_r = \frac{\sum_{i \subset C}R_i}{N_c} = \frac{n}{N_c} = \overline{R}_c.

YcYN=iCRiYiiCRii=1NYiN=iCRiYiiCYiNciCRiiCRi=RcYcYcRcRc=covar(R,Y)pr=ρR,YσRprσY\begin{aligned} \overline{Y}_c-\overline{Y}_N& = \frac{\sum_{i \subset C}R_iY_i}{\sum_{i \subset C}R_i} - \frac{\sum_{i=1}^{N}Y_i}{N}\\ & = \frac{\sum_{i \subset C}R_iY_i - \frac{\sum_{i \subset C}Y_i}{N_c}\sum_{i \subset C}R_i}{\sum_{i \subset C}R_i} \\ & = \frac{\overline{R_cY_c} - \overline{Y}_c\overline{R}_c}{\overline{R}_c} \\ & = \frac{covar(R, Y)}{p_r} \\ & = \rho_{R, Y}\frac{\sigma_R}{p_r}\sigma_Y \end{aligned}

Since RiR_i is binary, its standard deviation is σR=pr(1pr)\sigma_R = \sqrt{p_r(1-p_r)}. A bit of algebra yields Equation 2.

©2023 Michael Bailey. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

1 of 12
A Commentary on this Pub
A Commentary on this Pub
A Commentary on this Pub
A Commentary on this Pub
A Commentary on this Pub
No comments here
Why not start the discussion?