Skip to main content
SearchLoginLogin or Signup

Rejoinder: Building a Paradigm That Allows for the Possibility of Non-Ignorable Nonresponse

Published onNov 09, 2023
Rejoinder: Building a Paradigm That Allows for the Possibility of Non-Ignorable Nonresponse
·

I am grateful for the incisive and thought-provoking comments from an eminent set of scholars across a remarkably broad range of disciplines and perspectives. In this rejoinder I emphasize two themes in responding to the commentaries. First, the commentaries provide a useful opportunity to respond to those who are satisfied with the current paradigm. The problem is not that serious survey researchers do not know what non-ignorable nonresponse is; the problem is that the missing at random (MAR)-based weighting practices they rely on are becoming less viable as surveys become further removed from random sampling. The range of disciplines reflected in the commentaries suggests such issues may transcend field.

Second, the commentaries helpfully push me to explain and reflect on the assumptions needed to tackle non-ignorable nonresponse. I have done some of this elsewhere (Bailey, 2024) and surely this will be an ongoing conversation. Even as we recognize that serious efforts to combat non-ignorable nonresponse will require assumptions, however, we should not lose sight of the fact that the dominant approach makes a major—and arguably less defensible—assumption that nonresponse is missing at random. The MAR assumption is restrictive, empirically refutable and, in my view, a poor basis for future work in an environment that seems to move further away from random sampling every minute.

This response to the commentaries proceeds as follows. In the first section, I respond to the view that we do not need a new paradigm in polling. In the second section, I examine, explain, and explore assumptions on missing not at random (MNAR)-based survey tools in light of prompting in the commentaries. In the third section, I discuss the influence of population size on survey error, focusing on points raised about the data defect correlation that plays a central role in the Meng equation. The fourth section addresses the possibility that some people may never respond to polls. The fifth section discusses measurement error. In the sixth section, I note the broad range of disciplines represented in the comments and explore ideas about how survey nonresponse may play out in potentially similar ways in political polling and public health.

Survey Research Needs a New Paradigm

Kuhn (1970, p. viii) argued that a scientific paradigm is a set of beliefs and practices that “provide(s) model problems and solutions to a community of practitioners.’’ While there is no papal authority to determine what the accepted beliefs and practices are, I believe a reasonable characterization is as follows: the main survey paradigm defines the central problem as generalizing from a sample to a population. The canonical solution is random sampling.

The reality of nonresponse commonly led to samples that were demonstrably inconsistent with random sampling theory (these were anomalies in Kuhn’s terminology). The field responded by using weights to make samples look more like those that random sampling should produce. Recently, many pollsters have abandoned random sampling. They have kept the weights, though, or use methods built on the same assumptions as weights (methods such as targeted [quota] sampling or multilevel regression with poststratification).

So even as there are some fissures in the contemporary polling paradigm, weighting and other MAR-based adjustments dominate the standard toolkit.

The dominant paradigm is problematic in two ways. First, both strands of the dominant paradigm place little emphasis on “data defects” (to use Meng's [2018] terminology) as a central problem. Certainly, serious researchers of all stripes know that correlation of response and outcome causes bias, but this issue is downplayed as one of multiple possible sources of bias. It is, to be sure, correct that there are many sources of survey error. But seldom do we see the challenge of nonresponse defined as forcefully as the Meng equation, an equation that clearly shows how data defects can overwhelm the informativeness of survey data.

Second, and more importantly, the set of practices shared by the dominant paradigm(s) pay minimal attention to the data defect. There is (appropriately) considerable attention paid to the nuts and bolts of selecting a sampling frame and eliciting response in ways that minimize distortions (Groves et al., 2009). There is also appropriate recognition of the role that effective selection of weighting variables can reduce non-ignorable nonresponse (Mercer et al., 2018).

The dominant paradigm lacks, however, a set of expectations and practices regarding modeling, diagnosing, and correcting for non-ignorable nonresponse that can persist even after weighting. Both probability and nonprobability pollsters rely on MAR-based models. It is extremely rare to see them present evidence regarding the data defect. In part this is because they have not clearly defined the problem of survey error in a nonrandom sampling environment. In part, this is because they resist updating their methods.

The Current Paradigm Pays Insufficient Attention to Non-Ignorable Nonresponse

Murray Edelman (2023) argues that the dominant paradigm is fine because it has long recognized the possibility of non-ignorable non-response, noting that nonresponse has long been an important node in the total survey error framework.

Edelman (2023) is correct that serious survey researchers are aware of the problems highlighted in the Meng equation. In fact, the challenge that has been widely—and ironically and awkwardly—labeled as non-ignorable nonresponse. This was true from the earliest days of polling: the editors of Literary Digest recognized the problem when they lamented that Republicans seemed more inclined to respond to the survey. “Do Republicans live nearer mailboxes?” they asked (Squire, 1988, p. 127). Recognizing the potential for non-ignorable nonresponse did not make the Literary Digest paradigm robust. The fact that they did nothing about the problem is what made their paradigm ripe for replacement.

The problem with the current paradigm then is not that the non-ignorable nonresponse is unknown. The problem is that the dominant paradigm defines the problem of survey research in a way that downplays non-ignorable nonresponse and, more importantly, relies on weighting and other tools that assume non-ignorable nonresponse goes away given standard weighting variables.

With Meng's (2018) definition of the problem facing survey research, such passivity regarding a central source of polling error in the contemporary era becomes hard to accept.

Comparing Polling Across the Paradigms

A recent poll illustrates how the existing and new paradigms differ. ABC News and the Washington Post surveyed 1,000 U.S. adults in September 2023 and found that Donald Trump led President Biden by 10 percentage points among voters and by 19 points among people under 40 (Balz et al., 2023). These are serious pollsters who use the weighting tools that dominate the current paradigm.

The pollsters noted that these results were an outlier because most other polls had the race essentially tied at 44% each. From a random sampling perspective, one could view their results as the kind of thing that occurs from time to time due to sampling variation. Such an explanation only goes so far, however: if Trump had 44% support there would be a less than 1% chance of seeing his support at 52% or higher.

The pollsters have weighted the sample to appropriate benchmarks, so the remaining explanations are (1) non-ignorable nonresponse, (2) respondents were misrepresenting their views on this survey, but not on other surveys, or (3) Trump was, in fact, far ahead. These explanations reflect vastly different potential states of the world. Explanation (2) is tricky under any paradigm—and likely best suited toward the corner of the total survey error world focused on creation of valid questions. Explanations (1) and (3) are polar opposite and it is frustrating to witness the helplessness of the dominant polling paradigm to adjudicate between them.

Surveys such as these would look different if grounded in the paradigm I am proposing. Motivated by the urgency of the Meng equation, the pollster would be concerned that non-ignorable nonresponse could persist after weighting. Their critique of their seemingly anomalous results would home in on non-ignorability. Using conventional tools, the ABC/Washington Post poll focused only on whether they had too few young people in the sample. From the perspective of the new paradigm, the focus would be on whether the young people they had in the sample (and up-weighted) were representative of young people in general.

Most importantly, the way in which the poll was implemented would be different if grounded in the new paradigm. We would specify a model that allows for non-ignorable nonresponse and implement protocols that would estimate this centrally important parameter. I am agnostic as to the specific protocol, but one could follow, for example, the easily implemented steps in Chapter 12 of Bailey (2024). Those steps provided evidence inconsistent with the null hypothesis that response is MAR and suggested that conventional survey estimates of Trump’s support among Midwesterners in 2019 were too low. I think all consumers of the 2023 ABC/Washington Post survey would love tests that could shed light on whether Trump’s 19-point lead among young people was real or due in part to non-ignorable nonresponse.

Another basis for resisting change in polling is to focus on Roderick Little's (2023) point that the existing literature does indeed present tools to fight non-ignorable nonresponse. The fourth section, “Approaches to Assessing Selection Bias,” of his comment helpfully lists five strategies for estimating MNAR models. I agree that tools exist. My point is that they are underrecognized in the dominant way problems are defined and underutilized in the dominant toolkit. One of my goals in writing the original piece was to get more people to realize that we have more capacity to counter non-ignorable nonresponse than some people realize. I present a broadly similar list in Chapter 11 of Bailey (2024). My claim is that the dominant practice is to eschew such models, a claim that is buttressed by the paucity of examples where polls are analyzed with the tools described by Little.

Further, I would like to highlight as strongly as possible that my point is not simply that tools exist to address non-ignorable nonresponse, but that these tools require new and different types of data. Little (2023) makes this case when discussing the canonical Heckman model. He notes that “In practice, λ\lambda is only properly estimable if restrictions are placed on the regressions of RR and/or YY on XX, by setting one or more regression coefficients to zero. Results are then highly sensitive to whether these assumptions are correct.”

I couldn’t agree more. In Chapters 8 and 10 of Bailey (2024) I go to great lengths to argue that Heckman models without a valid instrument are prone to failure. However—and this is an absolutely central point—if we have a randomized response instrument, then these assumptions are likely to be correct and, as Little (2023) has shown, the model is ‘properly estimable.’ In other words, we have long had the models and now we need to push home the point that these models need the right data. (And, as I discuss shortly, we need continued examination of the underlying assumptions and limitations of these approaches as well.)

Once we have such data, my experience is that the results tend to be quite similar across models chosen. For what it is worth, I find the MNAR inverse propensity weighting (IPW) model of Sun et al. (2018) that I apply in Bailey (2023) to be the easiest to interpret. I find the Heckman model the easiest to implement. Copula and MNAR imputation models tend to produce similar results but with a bit more complexity. Others may vary in matters of taste; I would like to see examples where substantive conclusions vary across these MNAR-type models.

Assumptions in Context

While Edelman (2023) finds little new in the new paradigm, Sharon Lohr (2023) finds much that is new and objectionable. In particular, she expresses skepticism about several assumptions used in models that measure and account for non-ignorable nonresponse.

Before discussing points of difference, I want to note that I am heartened that much of Lohr's (2023) commentary is consistent with my argument. In the section entitled “Advantages of a Probability Sample,” Lohr rewrites and explains the Meng equation. I find her version less clear than Meng's (2018) original articulation, but I am pleased that she and I share interest in the insights from it. Lohr's (2023) section entitled “Weighting and Nonresponse Modeling” argues for random contact—which is exactly my argument in Figure 2 and the related discussion. I’m puzzled why she does not note our agreement on this point.

The thrust of Lohr's (2023) commentary, however, is about assumptions “Many ... conclusions and recommendations ... rest on strong implicit assumptions about participation mechanisms”

I agree! I am not sure if Lohr (2023) meant to be ironic in her title, “Assuming a Response Mechanism Does Not Make It True.” The title aptly captures my central point: assuming MAR does not make it true. The MAR-style weighting that dominates the survey research today assumes a specific response mechanism (that response is unrelated to content conditional on covariates). This MAR assumption is highly debatable—one that we know in some cases is not true and strongly suspect in other cases may not be true. Given that the random sampling paradigm that justifies MAR is an increasingly distant echo of what we do, and given the context, the consequences of sticking with or going back to the weighting advocated by Lohr are grim, as highlighted by Meng (2018). There is evidence in Bailey (2023) and Chapter 10 of Bailey (2024) that answers to turnout questions exhibit clear signs of non-ignorable nonresponse; I am wary of throwing away this intuitive and strong evidence in favor of the strong but potentially wrong MAR assumption.

Lohr (2023) helpfully pushes me to be more explicit in identifying assumptions and exploring their consequences, something that I attempt—probably imperfectly—in Chapter 11 of Bailey (2024). The MNAR approaches I describe certainly involve assumptions, assumptions that may be incorrect and could, therefore, lead us astray. In a moment, I make the case that they are arguably better assumptions than MAR. And even if one remains skeptical of that claim, I believe the field needs models based on different assumptions given the well-documented limitations of the MAR assumptions.

MNAR-based models posit a functional form for the dependence of the outcome of interest and response propensity. Sometimes the dependence is modeled via correlated errors as in the Heckman model. Copula models that allow for many other functional forms beyond bivariate normality (Gomes et al., 2019), as I discuss in Chapter 9 of Bailey (2024).

Alternatively, MNAR models can allow for YY to directly affect RR such as the model I discuss in Bailey (2023) following Sun et al. (2018):

Pr(R=1Z,X,Y)=g(γ0+γZZ+γXX+γYY+γZXZX+γXYXY+γZYZY)(2.1)\begin{aligned}Pr(R=1|Z, X, Y) = g(\gamma_0 & + \gamma_{Z}Z + \gamma_{X}X + \gamma_YY + \gamma_{ZX}ZX \\ & + \gamma_{XY}XY + \gamma_{ZY}ZY) \end{aligned} \text{(2.1)}

where RR is a binary variable indicating response, ZZ is a randomized response treatment, XX is a covariate (which for shorthand I’ll refer to as a demographic control), and YY is the outcome of interest being measured in the survey (e.g., presidential approval).

Assumptions

MAR: γY=γXY=γZY=0MNAR: γZY=0\begin{aligned} \text{MAR: } &&\gamma_Y = \gamma_{XY} = \gamma_{ZY}= 0\\ \text{MNAR: } &&\gamma_{ZY}= 0 \end{aligned}

The MAR model assumes that γY=γXY=γZY=0\gamma_Y = \gamma_{XY} = \gamma_{ZY}= 0. The MNAR models I use in Bailey (2023) assume γZY=0\gamma_{ZY}= 0. In other words, those using MAR make multiple assumptions: they assume away the possibility that YY affects response (by assuming γY=0)\gamma_Y = 0), that YY could affect response differently by demographic group (by assuming that γXY=0)\gamma_{XY}=0), and that the response instrument could affect different demographic groups differently (by assuming that γZX=0\gamma_{ZX} = 0). The substantial heterogeneity of effects allowed in MNAR is perhaps obscured in the notation and discussion by Professor Lohr (2023)—effects conveniently assumed away in the MAR-based weighting widely in use today.

I show in Bailey (2024) and Bailey (2023) that randomized response-based models sometimes provide evidence to support the MAR assumption and other times provide evidence of violations of MAR. Using these models to extrapolate to full population estimates requires humility, of course, but rejecting them and using MAR-based weighting is based on an equally and arguably stronger assumption about the response mechanism that is inconsistent with the evidence I present.

Two particular concerns about assumptions loom large in Lohr's (2023) discussion. Lohr’s first concern is implicit in her discussion and may be difficult for a casual reader to identify. In her Figure 1 she presents hypothetical distributions that could explain her hypothetical data equally well. The important thing to recognize is that her scenarios in panel (a) and (b) reject the explicit assumption of the randomized response approach that the response instrument does not directly affect YY. Her figures illustrate why the problem is hard and why simply using the same covariates in selection and outcome models is problematic. Hence, I take her figures as further justification for the necessity of randomized response instruments that forcefully improve identification.

In other words, I agree that if the assumptions about randomized response instruments are not met, then the conclusions based on these assumptions will not follow.

The conversation then, must focus on whether these assumptions are reasonable. I think everyone accepts that there are protocols that can increase or decrease response rates; hence there is little debate about the plausibility of the assumption that randomized response instruments affect response.

The disagreement is about whether it is possible to affect response rates in a way that does not affect YY. I am more optimistic that it is possible to create a protocol that affects response propensity but not YY. As I note in Bailey (2024), pollsters have spent countless hours trying to lower nonresponse; if they believe it is never possible to affect nonresponse without changing YY, then these efforts were pointless from the start. In the new paradigm, pollsters should focus more on reducing the data defect correlation than on increasing the response rate.

Lohr (2023) is more skeptical of this assumption, focusing in particular on the assumption that the effect of the randomized response instrument ZZ on RR does not depend on YY. Sun et al. (2018) formally explain why this assumption is a sufficient (but not necessary) condition for MNAR models; Bailey (2023) illustrates the point by noting that the estimating equation approach used in MNAR-weighting is not feasible if the effect of ZZ varies with the value of YY.

Before responding directly to Lohr's (2023) concerns, we should be clear about the precise nature of the assumption. Lohr notes that “An incentive will often raise response rates more for some population subgroups than others.” Her language here would benefit from additional precision. It is completely fine for the incentive to raise response rates more (or less) for population subgroups that are defined by measured covariates (e.g., education, age, etc). Bailey (2024) and Bailey (2023) show examples of exactly this.

Lohr's (2023) concern, then, is narrowly targeted toward the assumption that YY and ZZ cannot interact in the response equation. We should of course note and recognize the possibility that this assumption is incorrect. But if one is going to throw out models based on concern about this, then I wonder how one can defend MAR-based models that assume YY has no effect on RR, including any effect that could be particular to a subgroup (e.g., γXY\gamma_{XY} above). I see some tension in rejecting that YY has any effect on RR (as her defense of MAR implies), but then going to the mats to argue that YY does indeed affect RR in a way that interacts with ZZ.

Instrument Properties

Sharon Lohr (2023) and Shiro Kuriwaki (2023) rightfully point out that response-instrument–based approaches to non-ignorable nonresponse rely on assumptions and require context.

Any response-instrument is indeed ‘local,’ meaning that the effects identified can be particular to the form of the instrument and the individuals upon whom it has an effect. A different response instrument could pull a different set of individuals into the respondent pool. And Kuriwaki (2023) notes that response instruments may fail if they induce non-monotonic changes in nonresponse (e.g., the treatment induces some people who respond in control status now do not respond under treatment).

Clarifying these assumptions is useful, but I think several things should be kept in mind. First, as is clear in Equation 2.1., it is possible to allow for instruments to have effects that depend on covariates. Chapter 12 of Bailey (2024) explores subgroups (which is one way of allowing for heterogeneous effects) and finds that while the effect of ZZ on response is generally similar, the effect of non-ignorable nonresponse varies in politically plausible ways.

Second, using multiple instruments can help us explore the landscape by providing us multiple local estimates. This could be done within a single survey (although it would likely require a large sample size to be adequately powered) or can be done by aggregating across multiple surveys. For example, multiple surveys using variations of randomized response instruments find that non-ignorable nonresponse inflates turnout estimates in MAR-weighted analysis (Bailey, 2024).

Third, once one has found local evidence of non-ignorable nonresponse, it is hard to put the genie back in the bottle. We may not know the full (nonlocal) nature of non-ignorable nonresponse, but the MNAR approach models can at least present evidence that the MAR assumption is locally incorrect. If we see such evidence, it seems hard to go back to MAR, and at a minimum, then, we can take the results based on a randomized response instrument as a warning flag that MAR-based weighting may be leading us astray.

In fact, the methods I advocate could often support the use of the MAR-based weighting preferred by Lohr (2023) because they could show no evidence of non-ignorable nonresponse. That is, I think it is entirely likely that for most questions and most subpopulations we will find no evidence that response interest and content of response are correlated. Bailey (2024) argues that MAR-based weights are more efficient under these circumstances —and the decision to use them would be based on evidence rather than hope.

Finally, even as we should remain cautious about the assumptions needed for response instruments to be informative, we can use the results to provide a sense of the scale of effects. For example, if we assume that the local evidence of non-ignorable nonresponse found among Midwesterners for Trump approval in Chapter 12 of Bailey (2024) and Bailey (2023) generalizes then the estimated Trump approval was 50%, which was higher than the 40% approval estimated using MAR-weights. Trump did indeed outperform polls in the Midwest in 2020. Perhaps Lohr would prefer the lower estimates produced by conventional weights, but it would be helpful for her to acknowledge that at a minimum there is evidence to suggest the conventional MAR-assumption is at least locally inconsistent with the evidence.

The points raised by Lohr (2023) and Kuriwaki (2023) guide us away from some problematic applications of random response instruments. We should be mindful that the joint distribution of RR and YY can vary depending on survey protocols. It is common for survey researchers to experiment with different modes by randomly assigning potential respondents to different modes (e.g., phone versus text versus mail). These modes often induce different response rates, perhaps raising hope that such studies could be used to assess non-ignorable nonresponse. However, given that the modes themselves may be pulling in different groups and inducing different response patterns, it is unclear if we could, as required, exclude the response model from the outcome equation. In my view, the ideal randomized response intervention for non-ignorable nonresponse will have a treatment that induces differential response rates but is otherwise minimally different across treatment and control groups.

The Role of Population Size

Lohr (2023) and Little (2023) express skepticism about the relationship between population size and survey error. This point is one of the central ways in which a paradigm built on the Meng equation differs from the current paradigm.

In Figure 1 in the original article, I illustrate how survey error can be larger in a large population than a small one. Lohr (2023) interprets my heuristic example in Figure 1 as a general statement that the joint distribution of YY and RR is immutable, writing “There is no reason to believe, however, that the distribution of the latent variables RR^* would be the same for the two samples. The two surveys can only achieve the same sample size if the response rate for the small population is N1/N2N_1/N_2 times greater than the response rate for the large population.”

First, it is important to note Figure 1 is an illustrative example, not a general statement. The point is to show a case in which the counterintuitive connection between population size and survey error exists. I note in footnote 1 that the figures are not all-else-equal due to the complexities of the difference between ρR,Y\rho_{R, Y} (the correlation of binary RR and YY), complexities noted by Little (2023) and that I come back to shortly. In fact, in order to keep the visuals the same, the ρR,Y\rho_{R, Y} in the low-population panel is actually higher (which should lead to more survey error!). In other words, Figure 1 illustrates a case in which the small sample produces less error even when the data defect correlation is worse than in the large population panel. I present a figure below based on simulations in which I hold constant the correlation of latent response propensity and YY (which is equivalent to having the same ‘tilted fish’ figures as in my illustrative example in Figure 1 in the original article).

Second, I stand by the broader point that the relationship between population size and survey error is intuitive (even as it makes polling professionals raised on random sampling uncomfortable). For example, in Chapter 1 of Bailey (2024) I describe a classroom example that may resonate with many faculty in the field. First, I think it is reasonable to believe that voluntary response by students in a classroom is often non-ignorable: the more knowledgeable students are more likely to respond. Second, compare the following two circumstances. In one, you are teaching a 300-person class and ask for five volunteers to answer questions. In the second, you are teaching a 20-person class and ask for five volunteers to answer questions. If you think the five volunteers from the 300-person class will on average be more knowledgeable (and hence further from the class average) than the five volunteers from the 20-person class, you share the intuition that population size can affect survey error.

Roderick Little’s (2023) commentary pushes back against the relevance of population size for survey error for different reasons. Little rewrites bias in a “more intuitive” way as a function of (1f)(1-f) where f=nNf= \frac{n}{N}. What is intuitive is a matter of taste, but I needed to convert 1f1-f to NnN\frac{N-n}{N} to interpret it easily.

Little (2023) argues that sample size controls bias, not population size. He offers a random sampling example with populations of 100,000 and 100 million. Of course, he is correct for the random sampling case; this is beyond dispute.

The crux of the issue is whether survey bias increases with population size in nonrandom sampling. Little (2023) argues that in this case, bias is “relatively” independent of population size. The word “relatively” is doing a lot of work here and I think it is worth unpacking Little’s argument.

Little (2023) is making an important point related to the fact that the data defect correlation in the Meng equation is not easy to interpret and that “correlation usually measures association between continuous variables, and is not a natural measure of association when one of the variables, RR, is binary.’’ Among other things, this creates the difficulties I noted in footnote 1 and above in response to Lohr's (2023) concerns.

I find the following example to be a simple way to appreciate Little’s (2023) point. Suppose RR is binary and equals 1 if RR^* (latent propensity to respond) is greater than 0 and R=0R = 0 otherwise. Suppose that RR^* and YY are strongly correlated. In general, this will induce RR (the binary variable) and Y to be correlated, but the magnitude of the correlation of RR and YY will be a nonlinear function of the correlation of RR^* and YY and response rates. If everyone in the population responds, the correlation of RR and YY must be zero even if the correlation of RR^* and YY is strong. In other words, because the Meng error decomposition is exact, the bias must be zero when we have observations from everyone in the population even if the association between response interest and YY would cause problems in other response environments.

Figure 1 presents results from simple simulations that highlight how the point of the Meng equation persists even as we recognize the subtleties of the correlation term. For various population sizes indicated on the xx-axis, I simulated a sample of 100 observations from a process in which correlation of response propensity RR^* and YY was 0.5. The dark red line shows the bias; as suggested by the Meng equation, bias rises with population size. This occurs even as the joint distribution of RR^* and YY remains fixed with a constant correlation. However, consistent with Little’s (2023) comments that ρR,Y\rho_{R,Y} (the correlation of the binary RR and YY) does indeed vary as NN changes. Pointing this out does not change the fundamental point that bias is rising with population size.

Figure 1. Simulated bias and correlation parameters for different population sizes.

Two things are clear in the figure. First, and most important to the paradigm I am advocating, bias increases with population size. This is a signature element of the new paradigm and runs counter to the random sampling intuition that population size does not matter. Second, Little’s (2023) contribution is to point out that the data defect parameter can be decreasing in population size, off-setting some of the effect of population size.

One can see both these points by examining Little’s (2023) bias equation (Equation 2 in his comment, which I rewrite slightly):

Bias=NnN(YnYNn)=NnNΔ\begin{aligned} Bias &= \frac{N-n}{N}(\overline{Y}_n - \overline{Y}_{N-n})\\ &= \frac{N-n}{N}\Delta \end{aligned}

where we let Δ=YnYNn\Delta = \overline{Y}_n - \overline{Y}_{N-n} and assume for this discussion Δ>0\Delta >0. Note first that Δ\Delta varies with population size for a given sample size when there is non-ignorable nonresponse; in particular, the difference between the means of YY in the sample and the nonrespondents increases as the population goes up, meaning ΔN0\frac{\partial \Delta}{\partial N} \ge 0. (Refer to Figure 1 in the original article to see a heuristic example; more generally, note that if we observe 100 people from a population of 200, the sample and nonrespondents will not differ as much as if we observe a sample of 100 from 1 million people.)

Even in Little’s (2023) formulation, bias increases in population size:

biasN=nN2Δ+NnNΔN>0\begin{aligned} \frac{\partial \text{bias}}{\partial N} = \frac{n}{N^2}\Delta + \frac{N-n}{N}\frac{\partial \Delta}{\partial N} > 0 \end{aligned}


So even as Little is entirely correct that ρR,Y\rho_{R,Y} in the Meng equation depends on population and sample size, so too does YnYNn\overline{Y}_n - \overline{Y}_{N-n} in his formulation. And in both the Meng (2018) and Little (2023) formulations of bias, two points are clear:

  1. bias increases as population size increases when there is non-ignorable nonresponse

  2. the magnitude of the relationship between bias and NN depends on the extent of non-ignorable nonresponse (captured by ρR,Y\rho_{R,Y} in the Meng formulation and YnYNn\overline{Y}_n - \overline{Y}_{N-n} in the Little formulation).

I am grateful to Little (2023) for pushing for more clarity on this point. For now, I leave it as an open question as to which of the formulations best allows us to clarify these subtleties.

Never Responders Are Always a Problem, MAR or MNAR

Lohr (2023) and Kuriwaki (2023) note that never responders are a problem: that is, there are some people who may never respond and it is possible that these individuals may systematically (and potentially dramatically) differ from the subset of people who potentially respond. Heuristically, we could use the ‘tilted fish’ imagery of Figure 1 but include a group on the far left (the least inclined to respond) who behave quite unlike the shared distribution depicted in the figures. Perhaps the least inclined to respond have very high values of YY on average, for example.

Manski (1990) provides the canonical development of such thinking. In that and subsequent work, he and others identify the minimal inferences we can make in light of the possibility that the nonrespondents can in theory have any values of YY. The general conclusion is that the bounds on what we know are large—and very large in the context of the tiny response rates we have for random contact surveys and, if it is conceivable, even larger for nonrandom sampling (which will depend on population size, not sample size).

As rich as the Manski bounds tradition is, it is not promising as a practical guide to inference—and is far removed from the assumptions and frameworks used by those still using traditional MAR- based weighting. At the risk of repeating myself, I will note that MAR-based weighting has a similar and arguably more serious problem with never-responders. A MAR-based model assumes the YY values of these individuals is completely explainable via the responses of demographically similar respondents.

What then are the conditions under which we can generalize to a population if some people may be very unlikely to respond? To the extent that there are people who literally would never respond to the protocols of a survey, we will always be subject to Manski-type concerns. However, I argue that we can generalize if we posit that there are finite types of individuals defined by X,Y,τX, Y, \tau where XX is observable covariates, YY is the outcome of interest, and τ\tau is an unobservable (such as, in many contexts, social trust) and that all types have nonzero probabilities of responding. Suppose, for example, that each of these variables is discrete, allowing us to partition the population into cells of different types. If the probability of nonresponse is nonzero for each type, I believe that for a sufficiently large sample size we can produce general estimates. This may not be literally (or even approximately) true, but such is the nature of assumption needed. (And, to be clear, MAR-based models need such assumptions as well.)

Measurement Error

Walter Dempsey (2023) presents a model in which survey error interacts with non-ignorable nonresponse in complicated, interesting, and potentially important ways. (Roderick Little [2023] also notes the absence of measurement error in the Meng equation.) The possibility of incorrect test results looms large in epidemiology (Yianand et al., 2021) and this is a good example in which a problem that plays a major role in one field may prove useful in other fields. In my experience, political polling has not generally paid much attention to this issue even though it is, of course, possible that our instruments misfire. As noted by Dempsey, many (including me) would have assumed that measurement error ‘only’ increases standard errors and attenuates relationships between variables. The more insidious nature of measurement error explained by Dempsey (2023) is intriguing and a bit daunting.

My plea here is for Dempsey and colleagues to develop an accessible ‘story’ that illustrates why and how survey error can interact with non-ignorable nonresponse. This is consistent with John Eltinge’s (2023) welcome advocacy of clear, convincing, and nuanced communication about polling challenges. When I first saw the Meng equation I found the relation between population size and survey error to be puzzling, as this went against the random sampling-based survey theory I had grown up on. Without intuition, it was easy to suspect that some sorcery in the derivation had produced a relationship that was not real, or at least not as general as it was being presented. Upon further reflection, I came up with Figure 1 in my article that illustrated how a larger population size could produce larger error. (And, incidentally, the process of creating these figures helped me understand the point—raised by Lohr [2023], Little [2023], and others—that the data defect parameter in the Meng equation itself can depend on response rates. I tried to address this point in footnote 1, but it is subtle and perhaps deeper than I let on.) The panels in Figure 1 are not literally all-else-equal comparisons, but rather illustrate how a small sample (that actually has a higher data defect correlation) can produce a smaller error.

Cross and Interdisciplinarity

The comments illustrate the advantages of inter- and cross-disciplinary work as scholars in biodiversity (Rob Boyd, 2023), physics (Aurore Courtoy and Pavel Nadolsky, 2023), public health (Walter Dempsey, 2023), medicine (Pavlos Msaouel, 2023), and ecology (Oliver Pescott, 2023) face similar challenges. Progress in dealing with nonresponse proceeds down different paths and different rates across fields and it is unlikely that we will solve the huge challenges of the field without an ensemble effort.

The most obvious hurdle to randomized testing was its massive cost. But more than resources would be required because even a well-resourced random testing effort would face substantial nonresponse. In Iceland, an early randomized testing effort yielded a 34% response rate (Gudbjartsson et al., 2020). In the Indiana case cited by Dempsey (2023), randomized testing yielded a 23% response rate (Yianand et al., 2021).

Non-ignorable nonresponse is concerning in randomized testing efforts. The people asked to test may be a random sample, but the respondents could be skewed because the people with symptoms or exposure may be more likely to agree to be tested. (In other contexts—such as HIV testing—people with symptoms/exposure may be less likely to agree to be tested.)

Public health officials have three options to deal with nonresponse in randomized testing efforts. One is that they can put the resources into getting perfect or nearly perfect compliance. The random testing efforts that yield 30% compliance are already prohibitively expensive for most locales. Getting to 80 or 90% compliance— let alone 100%—is for all intents and purposes infeasible. A second option is to give up and do no random testing. In general, this gives up on a potentially valuable source of individual and geographic associations with disease. (Testing of sewer waste may be a theoretically elegant—if practically distasteful—solution to monitoring COVID-19 prevalence.)

The third option is for public health officials to develop models that work with the random contact with imperfect and potentially non-ignorable nonresponse reality they face. Chapter 13 of Bailey (2024) offers ideas on what this would entail. I am sure much more can be done. The tools used by Marra et al. (2017) and McGovern et al. (2015) for HIV random testing and by Bradley et al. (2021) for COVID-19 vaccination rates are great examples of new paradigm thinking in public health.

Conclusion

I wish that I could have been able to address all the points in the commentaries in the depth they deserve. For example, Andrew Gelman and Gustavo Novoa (2023) document the relationship between interest in politics and presidential preferences over time in American National Election Study (ANES) polling. I would love to see the patterns they identified mapped against polling error in each year. My visual inspection suggests that polling error is not easily predicted by these patterns. While I continue to find the ANES pattern in 2020 a good illustration of possible non-ignorable nonresponse in that year, the Gelman and Novoa analysis raises the possibility that political interest may not be a generally useful predictor of polling error. If that is indeed what the evidence shows, this would justify hesitation about observational response instruments and reinforce the case for randomized response instruments.

I very much appreciate the vigor and variety of the commentaries. I hope that even those who are reluctant to embrace my provocation that random sampling is dead will acknowledge that the gap between random sampling and surveys as practiced is large and growing. Sticking with the intuition and kluges that masked gaps between random sampling theory and practice has become less tenable as those gaps have grown.

There is much to learn. No one—certainly not me—has solved the challenge of inference from nonrandom samples. But we do have many promising theoretical and practical tools that more directly apply to our nonrandom samples. I eagerly await our progress.


Disclosure Statement

Michael A. Bailey has no financial or non-financial disclosures to share for this article.


References

Bailey, M. A. (2023). Doubly robust estimation of non-ignorable non-response models of political survey data [Paper presentation.] Summer meeting of the Society for Political Methodology at Stanford University, Stanford, CA.

Bailey, M. A. (2024). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press.

Balz, D., Clement, S., & Guskin, E. (2023, September 24). Post-ABC poll: Biden faces criticism on economy, immigration and age. The Washington Post. https://www.washingtonpost.com/politics/2023/09/24/biden-trump-poll-2024-election/

Boyd, R. J. (2023). Is it time for a new paradigm in biodiversity monitoring? Lessons from opinion polling . Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.707dff70

Bradley, V. C., Kuriwaki, S., Isakov, M., Sejdinovic, D., Meng, X.-L., & Flaxman, S. (2021). Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature, 600(7890), 695–700. https://doi.org/10.1038/s41586-021-04198-4

Courtoy, A., & Nadolsky, P. (2023). From data defects to response functions: A view from particle physics. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.edaeee0c

Dempsey, W. (2023). Paradigm lost? Paradigm regained? Comment on “A New Paradigm for Polling” by Michael A. Bailey. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.49b9b14a

Edelman, M. (2023). Not really a new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.33fb61ba

Eltinge, J. (2023). Methodological context and communication on data quality: Discussion of Bailey (2023). Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.4e03a698

Gelman, A., & Novoa, G. (2023). Challenges in adjusting a survey that overrepresents people interested in politics. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.d2379c69

Gomes, M., Radice, R., Brenes, J. C., & Marra, G. (2019). Copula selection models for non-Gaussian outcomes that are missing not at random. Statistics in Medicine, 38(3), 480–496. https://doi.org/10.1002/sim.7988

Groves, R. M., Jr., F. J. F., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology (2nd edition). Wiley-Interscience.

Gudbjartsson, D. F., Helgason, A., Jonsson, H., Magnusson, O. T., Melsted, P., Norddahl, G. L., Saemundsdottir, J., Sigurdsson, A., Sulem, P., Agustsdottir, A. B., Eiriksdottir, B., Fridriksdottir, R., Gardarsdottir, E. E., Georgsson, G., Gretarsdottir, O. S., Gudmundsson, K. R., Gunnarsdottir, T. R., Gylfason, A., Holm, H., . . . Stefansson, K. (2020). Spread of SARS-CoV-2 in the Icelandic population. medRxiv. https://doi.org/10.1056/nejmoa2006100

Kuhn, T. S. (1970). The structure of scientific revolutions. University of Chicago Press.

Kuriwaki, S. (2023). An instrumental variable for non-ignorable nonresponse. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.cfb82819

Little, R. J. (2023). The “Law of Large Populations” does not herald a paradigm shift in survey sampling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.6b049957

Lohr, S. L. (2023). Assuming a nonresponse model does not make it true. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.2b901b7f

Manski, C. F. (1990). Nonparametric bounds on treatment effects. The American Economic Review, 80(2), 319–323.

Marra, G., Radice, R., Bärnighausen, T., Wood, S. N., & McGovern, M. E. (2017). A simultaneous equation approach to estimating HIV prevalence with nonignorable missing responses. Journal of the American Statistical Association, 112(518), 484–496. https://doi.org/10.1080/01621459.2016.1224713

McGovern, M., Barnighausen, T., Marra, G., & Radice, R. (2015). On the assumption of bivariate normality inselection models: A copula approach applied to estimating HIV prevalence. Epidemiology, 26(2), 229–237. https://doi.org/10.1097/ede.0000000000000218

Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (1): Law of large populations, big data paradox, and the 2016 presidential election. The Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF

Mercer, A. W., Lau, A., & Kennedy, C. (2018, January 26). For weighting online opt-in samples, what matters most? Pew Research Center. https://www.pewresearch.org/methods/2018/01/26/for-weighting-online-opt-in-samples-what-matters-most/

Msaouel, P. (2023). The role of sampling in medicine. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.bc6818d3

Pescott, O. L. (2023). Seek a paradigm and distrust it? Statistical arguments and the representation of uncertainty. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.a02188d0

Squire, P. (1988). Why the 1936 Literary Digest poll failed. Public Opinion Quarterly, 52(1), 125–133. https://doi.org/10.1086/269085

Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., & Tchetgen-Tchetgen, E. J. (2018). Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28, 1965–1983. https://doi.org/10.5705%2Fss.202016.0324

Yianand, C. T., Halverson, P. K., & Menachemi, N. (2021). Bayesian estimation of SARS-CoV-2 prevalence in Indiana by random testing. Proceedings of the National Academy of Science, 118(5), 521–531. https://doi.org/10.1073/pnas.2013906118


©2023 Michael A. Bailey. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?