Towards Principled Unskewing : Viewing 2020 Election Polls Through a Corrective Lens from 2016

We apply the concept of the data defect index (Meng, 2018) to study the potential impact of systematic errors on the 2020 pre-election polls in twelve Presidential battleground states. We investigate the impact under the hypothetical scenarios that (1) the magnitude of the underlying non-response bias correlated with supporting Donald Trump is similar to that of the 2016 polls, (2) the pollsters’ ability to correct systematic errors via weighting has not improved significantly, and (3) turnout levels remain similar to 2016. Because survey weights are crucial for our investigations but are often not released, we adopt two approximate methods under different modeling assumptions. Under these scenarios, which may be far from reality, our models shift Trump’s estimated two-party voteshare by a percentage point in his favor in the median battleground state, and increases twofold the uncertainty around the voteshare estimate.


Polling Is Powerful, But Often Biased
Polling is a common and powerful method for understanding the state of an electoral campaign, and in countless other situations where we want to learn some characteristics about an entire population but can only afford to seek answers from a small sample of the population. However, the magic of reliably learning about many from a very few depends on the crucial assumption of the sample being representative of the population. In layperson's terms, the sample needs to look like a miniature of the population. Probabilistic sampling is the only method with rigorous theoretical underpinnings to ensure this assumption holds (Meng, 2018). Its advantages have also been demonstrated empirically, and is hence the basis for most sample surveys in practice (Yeager et al., 2011). Indeed, even non-probabilistic surveys use weights to mimic probabilistic samples (Kish, 1965).
Unfortunately, representative sampling is typically destroyed by selection bias, which is rarely harmless. The polls for the 2016 U.S. Presidential election provide a vivid reminder of this reality. Trump won the toss-up states of Florida and North Carolina but also states like Michigan, Wisconsin, and Pennsylvania, against the projections of most polls. Poll aggregators erred similarly; in fact Wright and Wright (2018) suggest that overconfidence was more due to aggregation rather than the individual polling errors. In the midterm election two years later, the direction and magnitude of errors in polls of a given state were heavily correlated with their own error in 2016 (Cohn, 2018b). Systematic polling errors were most publicized in the 2016 U.S. Presidential election, but comparative studies across time and countries show that they exist in almost any context (Jennings & Wlezien, 2018;Kennedy et al., 2018;Shirani-Mehr et al., 2018).
In this article we demonstrate how to incorporate historical knowledge of systematic errors into a probabilistic model for poll aggregations. We construct a probabilistic model to capture selection mechanisms in a set of 2016 polls, and then propagate the mechanisms to induce similar patterns of hypothetical errors in 2020 polls. Our approach is built upon the concept of the data defect index (d.d.i, see Meng, 2018), a quantification of non-representativeness, as well as the post-election push for commentators to "unskew" polls in various ways. The term "unskew" is used colloquially by some pollsters, loosely indicating a correction for a poll that over-represents a given party's supporters. Our method moves towards a statistically principled way of unskewing by using past polling error as a starting point by decomposing the overall survey error into several interpretable components. The most critical term of this identity is an index of data quality, which captures total survey error not taken into account in the pollster's estimate.
Put generally, we provide a framework to answer the question, "What do the polls imply if they were as wrong as they were in a previous election?" If one only cared about the point estimate of a single population, one could simply subtract off the previous error. Instead, we model a more complete error structure as a function of both state-specific and pollster-specific errors through a multilevel regression, and properly propagate the increase in uncertainty as well as the point estimate through a fully Bayesian model.
Our approach is deliberately simple, because we address a much narrower question than predicting the outcome of 2020 election. In particular, we focus on unskewing individual survey estimates, not unskewing the entire forecasting machinery. Therefore our model must ignore non-poll results, such as economic conditions and historical election results, which we know Just Accepted Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 3 would be predictive of the outcome (Hummel & Rothschild, 2014). Our model is also simplified (restricted) by using only point estimates reported by pollsters instead of their microdata, which generally are not made public. Since most readers focus on a state's winner, we model the voteshare only as a proportion of the two-party's votes, implicitly netting out third party votes or undecided voters. We largely avoid the complex issue of overtime opinion change over the course of the campaign (Linzer, 2013), although we show that our conclusions do not change even with modeling a much shorter time period where opinion change is unlikely. We do not explicitly model the correlation of prediction errors across states, which would be critically important for reliably predicting any final electoral college outcome.
Finally, the validity of our unskewing is only as good as the hypothetical we posit, which is that the error structure has not changed between 2016 and 2020. For example, several pollsters have incorporated lessons from 2016 (Kennedy et al., 2018) in their new weighting, such as adding education as a factor in their weighting (Skelley & Rakich, 2020). Such improvement may likely render our "no-change" hypothetical scenario overly cautious. On the other hand, it is widely believed that 2020 turnout would be significantly higher than in 2016. Because polling biases are proportionally magnified by the square root of the size of the turnout electorate (Meng, 2018), our use of 2016 turnout in place for 2020 may lead to systematic under-corrections. Whereas we do not know which of these factors has stronger influence on our modeling (a topic for further research), our approach provides a principled way of examining the current polling results (i.e., the now-cast), with a confidence that properly reflects the structure of historical polling biases.
The rest of the article is organized as follows. First, we describe two possible approaches to decompose the actual polling error into more interpretable and fundamental quantities. Next, we introduce key assumptions which allow us to turn these decompositions into counterfactual models, given that polling bias (to be precisely defined) in 2020 is similar to that in the previous elections. Finally, we present our findings for twelve key battleground states, and conclude with a brief cautionary note.

Capturing Historical Systematic Errors
Our approach uses historical data to assess systematic errors, and then use them as scenariosi.e., what if the errors persist-to investigate how they would impact the current polling results. There are at least two kinds of issues we need to consider: (1) defects in the data; and (2) our failure to correct them. Below we start by using the concept of "data defect correlation" of Meng (2018) to capture these considerations, and then propose a (less ideal) variation to address a practical issue that survey weights are generally not reported.

Data Defect Correlation.
Whereas it is clearly impossible to estimate the bias of a poll from itself, the distortion caused by nonsampling error is a modelable quantity. Meng (2018) proposed a general framework for quantifying this distortion by decomposing the actual error into three determining factors: data quality, data quantity, and problem difficulty. Let I index a finite population with N individuals, and Y I be the survey variable of interest (e.g., Y I = 1 if the Ith individual plans to vote for Trump, and Y I = 0 otherwise). We emphasize that here the random variable is not Y, but I, because the sampling errors are caused by how opinions vary among individuals as opposed to uncertainties in individual opinions Y I for fixed I (at the time

Just Accepted
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 4 of survey). Following the standard finite population probability calculation (e.g., see Kish, 1965), the population averageȲ N then can be written as the mean E(Y I ), where the expectation is with respect to the discrete uniform distribution of I over the integers {1, . . . , N}.
To quantify the data quality, we introduce the data recording indicator R I , which is one if the Ith individual's value Y I is recorded (i.e., observed) and zero otherwise. Clearly, the sample size (of the observed data) then is given by n = ∑ N =1 R . Here R I captures the overall mechanism that resulted in individual answers being recorded. If everyone responds to a survey (honestly) whenever they are selected, then R merely reflects the sampling mechanism. However, in reality there are almost always some non-respondents, in which case R captures both the sampling and response mechanisms. For example, R I = 0 can mean that the Ith individual was not selected by the survey or was selected but did not respond. To adjust for non-response biases and other imperfection in the sampling mechanism, survey pollsters typically apply a weighting adjustment to redistribute the importance of each observed Y to form an estimator, say, for the population meanȲ N in the form of where w indicates a calibration weight for respondent to correct for known discrepancies from the target population by the observed sample. Typically weights are standardized to have mean 1 such that the denominator in (2.1) is n, a convention that we follow.
Let ρ = Corr(w * I , Y I ), the population (Pearson) correlation between w * I ≡ w I R I and Y I (again, the correlation is with respect to the uniform distribution of I), and σ be the population standard deviation of Y I . Define n w to be the effective sample size induced by the weights (Kish, 1965), that is, where s 2 w is the sample-not population-variance of the weights (but with divisor n instead of n − 1). It is then shown in Meng (2018) that the actual estimation error where f w = n w /N is the effective sampling fraction. Identity (2.2) tells us that (i) the larger the ρ in magnitude, the larger the estimation error, and (ii) the direction of error is determined by the sign of ρ. Probabilistic sampling and weighting ensures ρ is zero on average, with respect to the randomness introduced by R. But this is no longer the case when (a) R I is influenced by Y I because of individuals' selective response behavior or (b) the weighting scheme fails to fully correct this selection bias. Hence, ρ measures the ultimate defect in the data due to (a) and (b) manifesting itself in the estimatorȲ n,w . Consequently, it is termed a data defect correlation (DDC).
There are two main advantages of using ρ to model data defect instead of directly shifting the current polls by the historical actual error e Y (e.g., from 2016 polls). First, it disentangles the more difficult systematic error from the well-understood probabilistic sampling errors due to the sample size (encoded in f w ) and the problem difficulty (captured by σ). Using e Y for correcting biases therefore would be mixing apples and oranges, especially for polls with varying sizes.

Just Accepted
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 5 This can been seen more clearly by noting that (2.2) implies that the mean-squared error (MSE) ofȲ n,w can be written as where the expectation is with respect to the report indicator R (which includes the sampling mechanism). We note that the term in the brackets is simply the familiar variance of sample mean under simple random sampling (SRS) with sample size n w (Kish, 1965). It is apparent then that the quantity E R (ρ 2 ), termed data defect index (ddi) in Meng (2018), captures any increase (or decrease) in MSE beyond SRS per individual in the population (because of the multiplier N).
Second, in the case where the weights are employed only for ensuring equal probability sampling, ρ measures the individual response behavior, which is a more consequential and potentially more stable measure over time. The fact that ddi captures the design gain or defect per individual in population reflects its appropriateness for measuring individual behavior. This observation is particularly important for predictive modeling, where using quantities that vary less from election to election can substantively reduce predictive errors.
2.2. Weighting Deficiency Coefficient. Pollsters create weights through calibration to known population quantities of demographics, aiming to reduce sampling or non-response bias especially for non-probability surveys (e.g., Caughey et al., 2020). The effective sample size n w is a compact statistic to capture the impact of weighting in the DDC identity of (2.2), but is often not reported. If n w does not vary (much) across polls, then its impact can be approximately absorbed by a constant term in our modeling, as we detail in the next section. However this simplification can fail badly when n w varies substantially between surveys, particularly in ways that are not independent of ρ.
To circumvent this problem, we consider an alternative decomposition of e Y by using the familiar relationship between correlation and regression coefficient, which leads to Here η is the population regression coefficient from regressing w * I on Y I , i.e., Identity (2.3) then follows (2.2) and that . Being a regression coefficient, the term η governs how much variability in w * I can be explained by Y I and thus it expresses the association between the variable of interest and a weighted response indicator. Since the weights are intended to reduce this association, it can be interpreted as how efficiently the survey weights correct for a biased sampling mechanism. This motivates terming η as the weighting deficiency coefficient (WDC): the larger it is in magnitude, the less successful the weighting scheme is in reducing the data defect caused by the selection mechanism.
Therefore, if we assume our ability in using weights to correct bias, captured by the regression coefficient η, has not been substantially improved since 2016, then we can bypass the need of

Just Accepted
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 6 knowing n w for either election when we conduct a scenario analysis. Importantly, we must recognize that this is different from assuming ρ remains the same. Since fixing either ρ or η is a new way of borrowing information from the past, we currently do not have empirical evidence to support which one is more realistic in practice. However, the fact that fixing η bypasses the need of knowing the effective sample size makes it practically more appealing when this quantity is not available.
We recognize that n w can be recovered from a pollster's methodology report when the reported margin of error (MoE) of a poll estimator µ is based on the standard formula that accounts for the design effect through the effective sample size n w : MoE = 2 µ(1 − µ)/n w . In that case, n w = 4 µ(1 − µ)/MoE 2 . We therefore strongly recommend pollsters to either report how the MoE is computed or the effective sample size due to weighting.

Regression Framework with Multiple Polls.
Whereas we cannot estimate ρ or η from a single poll, identities (2.2) and (2.3) suggest that we can treat them as regression coefficients from w σ and X = f −1 σ 2 respectively, when we have multiple polls. That is, the variations of f w or s 2 w in polls provide a set of design points X k , where k indexes different polls, for us to extract information about ρ or η, if we believe ρ (or η) is the same across different polls. By invoking a Bayesian hierarchical model, we can also permit ρ's or η's to vary with polls, with the degree of similarity between them controlled by a prior distribution.
As we pointed out earlier, a current practical issue for using ρ is that pollsters do not report s 2 w (or n w ). In addition to switching to work with η, if s 2 w 's do not vary much with the polls, then we can treat (1 − f + s 2 w ) as a constant when f 1, which typically is the case in practice. This leads to replacing X by X = f −1/2 σ, as an approximate method (which we will denote by DDC*) for working with ρ, since a constant multiplier of the X can be absorbed by the regression coefficient. In contrast, when working with η, we can set X exactly as X = f −1 σ 2 , which is the square of the approximate X for ρ.
For applications to predicting elections, especially those where the results are likely to be close, we can make a further simplification to replace σ = Ȳ N (1 −Ȳ N ) by its upper bound 0.5, which is achieved whenȲ N = 0.5. The well-known stability of p(1 − p) ≈ 1/4 when 0.35 ≤ p ≤ 0.65 makes this approximation applicable for most polling situations where predictive modeling is of interest.
Accepting these approximations, the only critical quantity in setting up the design points is f = n/N. This is in contrast to traditional perspectives of probabilistic sampling, where what matters is the absolute sample size n, not the relative sampling rate f = n/N (more precisely the fraction of the observed response). The fundamental reason for this change is precisely the identity (2.2), which tells us that as soon as we deviate from probabilistic sampling, the population size N will play a critical role in the actual error.
When we deal with one population, N is a constant for different polls, and hence it can be absorbed into the regression coefficient. However, to increase the information in assessing DDC or WDC, we will model multiple populations (e.g., states) simultaneously. Therefore, it is important to determine what N represents, and in particular, to consider "What is the population

Just Accepted
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 7 in question?" This is a challenging question for election polls because we seek to estimate the sentiment of people who will vote, but this population is unknowable prior to election day (Cohn, 2018a;Rentsch et al., 2019). Pollsters use likely voter models to incorporate them in their sampling frames, and others use the population of registered voters. But these models are predictions: e.g., not all registered voters vote, and dozens of state allow for same day registration.
Because total turnout is unknown before the election, we use the 2016 votes cast for the President in place of this quantity: N = N pre . A violation to this assumption in the form of larger turnout in 2020 could affect our results, because our regression term includes N for both DDC* and WDC and we are putting priors on the bias in 2020. In the past six Presidential elections, a state's voter turnout (again, as a total count of votes rather than a fraction) sometimes increased as much as 20 percent. Early indicators suggest a surge in 2020 turnout (McCarthy, 2019). With higher turnout, we would rescale our point estimates of voteshare by approximately √ N/N pre for DDC* and N/N pre for WDC models, and further increase uncertainty. The problem of predicting voter turnout is important yet notoriously difficult (Rentsch et al., 2019), so incorporating this into our approach is a major area of future work. More generally, this underscores once again the importance of assumptions in interpreting our scenario analyses.

Building a Corrective Lens from the 2016 Election
To perform the scenario analysis as outlined in Section 1, we apply the regression framework to both 2016 and 2020 polling data. We use the 2016 results to build priors to inform and constrain the analysis for 2020 by reflecting the scenarios we want to investigate.
For any given election (2016 or 2020), our data include a set of poll estimates of Trump's twoparty voteshares (against Clinton in 2016 and Biden in 2020). Because we use data from multiple states (indexed by i) and multiple pollsters (indexed by j), we will denote each poll's weighted estimate as y ijk (instead ofȲ n,w ) for polls k = 0, 1, . . . , K ij , where K ij is the total number of polls conducted by pollster j in state i. We allow K ij to be zero, because not every pollster conducts polls in every state. Accordingly, we set X ijk = f −1/2 ijk (for DDC*) or X ijk = f −1 ijk (for WDC), where f ijk = n ijk /N i . (For simplicity of notation, we drop the subscript w here since all y ijk 's use weights.) 3.1. Data. Various news, academic, and commercial pollsters run polls, make public their weighted estimates, if not their individual respondent-level data. For each poll, we consider its reported two-party estimate for Trump and sample size. While some pollsters (such as Marquette) focus on a single state, others (such as YouGov) poll multiple states. We focus on twelve battleground states and 17 pollsters that polled in those states in both 2016 and 2020. When a poll reports both an estimate weighted to registered voters and to likely voters, we use the former. We downloaded these topline estimates from FiveThirtyEight, which maintains a live feed of public polls. Table 1 summarizes the contours of our data. After subsetting to 17 major pollsters and limiting to pollsters with a FiveThirtyEight grade of C or above, we are left with 458 polls conducted in the three months leading up to the 2016 election (November 8, 2016), as well as 150 polls in 2020 taken during August 3, 2020 to October 21, 2020. Each of these battleground states were polled by 2 to 12 unique pollsters, totalling 4 to 21 polls. Each state poll canvasses about 500 to Just Accepted 1000 respondents, and the total number of 2020 respondents in a given state range from 5,233 respondents in Iowa to 11,295 in Michigan.
Hindsight should allow us to calculate the DDC and WDC from 2016 (ρ pre and η pre , respectively) because we know the value of e Y . However, because s w needed by (2.2) is not available, as we discussed before, we set it to zero to obtain an upper bound in magnitude for DDC, which Just Accepted  Note: The "Actual" column in the first panel is Trump's actual two-party voteshares in 2016. "Error" is the difference between the average of the poll predictions and Trump's actual voteshares, and hence a negative error indicates underestimation. DDC* is the upper bound of the data defect correlation ρ obtained from (2.2) but by pretending n w = n (i.e., assuming s w = 0), WDC is the weighting deficiency coefficient η of (2.3). All values are simple averages across polls, hence they do not rule out the possibilities for the average value of DDC* or WDC to have a different sign from the sign of the average actual error.
we will denote by DDC*. This allows us to use the ddi R package (Kuriwaki, 2020) directly. We compute past WDC of (2.3) exactly because it was designed to bypass the need for s w . Table 2 provides the simple averages of the actual errors, DDC* and WDC from the 2016 polls we use.

Just Accepted
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 10 Table 2 shows two dimensions by which the WDC (and DDC*) differ: states and pollsters. Most battleground states underestimated Trump's voteshare, but state polls had relatively slight errors in Arizona and Georgia, and exhibited the opposite sign in Texas. The difference by states shown in part (A) may be due to varying demographics and population densities, which in turn systematically affect voting reporting behavior and ease of sampling. Differences by pollster, shown in part (B), can arise from variation in pollster's mode of the survey, sampling methods, and weighting methods. These differences can manifest themselves in systematic over-or underestimates of a candidate's share -so called "house effects." Thus, we consider these effects together when studying polling bias.
A simple inspection indicates that the average WDCs vary non-trivially by pollsters, although all of them underestimated Trump's voteshares by states (individual results not shown). An F-test on the 458 values of WDC by state or pollster groupings rejects the null hypothesis that state-specific or pollster-specific means are the same (p < 0.001). In a two-way ANOVA, state averages comprise 41 percent of the total variation in WDC and pollster averages comprise 16 percent.
It is worth remarking that the values of DDC* in Table 2 are small (and hence DDCs are even smaller), less than a tenth of a percentage point in total. Meng (2018) found that the unweighted DDCs for Trump's vote centered at about −0.005, or half a percentage point, in the Cooperative Congressional Election Study. The values here are much smaller because they use estimates that already weight for known discrepancies in sample demographics. Therefore, the resulting weighted DDC measures the data defects that remain after such calibrations.
3.2. Modeling Strategy. As we outlined in Section 2.3, our general approach is to use the 2016 data on DDC, or more precisely, DDC*, shown in Table 2 to model the structure of the error, and apply it to our 2020 polls via casting (2.2) as a regression. Similarly, we rely on (2.3) to use the structure of WDC (η) to infer the corrected (true) population average. Our Bayesian approach also provides a principled assessment of the uncertainty in our unskewing process itself.
Specifically, let µ i be a candidate's (unobserved) two-party voteshare in state i, and µ pre i the observed voteshare for the same candidate (or party) in a previous election. In our example, µ pre i denotes Trump's 2016 voteshare, listed in Table 2 (A). A thorny issue regarding the current µ i is that voters' opinions change over the course of the campaign (Gelman & King, 1993), though genuine opinion changes tend to be rare (Gelman et al., 2016). Whereas models for opinion changes do exist (e.g., Linzer, 2013), this scenario analysis amounts to contemplating a "timeaveraging scenario", which can still be adequate for the purposes of examining the impact of historical lessons. However, we still re-examine our data with a much shorter timespan of three weeks where underlying opinion change is much less likely (Shirani-Mehr et al., 2018). In any case, incorporating temporal variations can only increase our uncertainty above and beyond the results we present here.
We know from the study of uniform swing in political science that a state's voteshare in the past election is highly informative of the next election, especially in the modern era (Jackman, 2014). In every Presidential election since 1996, a state's two-party voteshare was correlated with the previous election at a correlation value 90% or higher. This motivates us to model a state's swing: δ i = µ i − µ by pollster j constitutes an estimate for the swing: δ ijk = y ijk − µ pre , where, as noted before, y ijk is the weighed estimate from the kth poll. Then (2.3) can be thought of as a regression: where η ijk and f ijk are respectively the realizations of η and f w in (2.3) for y ijk . Identity (3.1) then provides a regression-like setup for us to estimate δ i as the intercept, when we can put a prior on η ijk . For our scenario analysis, we will first use 2016 data to fit a (posterior) distribution for η pre ijk , which will be used as the prior for the 2020 η ijk . We then use this prior together with the 2020 individual poll estimates y ijk to arrive at a posterior distribution for δ i and hence for µ i .
Our modeling strategies for ρ and η are the same, with the only difference being that we use a Beta distribution for ρ (because it is bounded) but the usual Normal model for η. We will present the more familiar Normal model for WDC in the main text, and relegate the technical details of the Beta regression to the appendix.

Assuming No Selection Bias.
As a simple baseline for an aggregator, we formulate the posterior distribution of the state's voteshare when E(η ijk ) = 0, i.e., when the only fluctuation in the polls is sampling variability. In this case, equation (3.1) implies that δ ijk is centered around δ i , and its variability is determined by its sample size n ijk . Typically we would also model its distribution by a Normal distribution, but to account for potential other uncertainties (e.g., the uncertainties in weights themselves), we adopt a more robust Laplace model, with scale 1/2 to match the 1/4 upper bound on the variance of a binary outcome. This leads to our baseline (Bayesian) model where µ δ can be viewed as the swing at the national level. By using historical data, we found the priors τ δ ∼ 0.1 · log Normal(−1.2, 0.3) and µ δ ∼ N(0, 0.03 2 ) are reasonable weakly informative (e.g., it suggests the rarity for the swing to exceed 6 percentage points, which indeed has not happened since 1980; the average absolute swing during 2000-2016 was 2.7 percentage points).
The Bayesian model will lead naturally to shrinkage estimation on cross-state differences in swings (from 2016 election to 2020 election). On top of being a catch-all for possible deviations from the simple random sample framework, the Laplace model has the interpretation of using a median regression instead of mean regression (Yu & Moyeed, 2001).

Modeling the WDC of 2016.
In order to show how historical biases can affect the results from the above (overconfident) model, we assume a parametric form for η ijk of polls conducted in 2016 and then incorporate this into (3.2). Given the patterns in Table 2, we model η ijk using a multilevel regression with state and pollster-level effects for the mean and variance. We pool the variances on the log scale to transform the parameter to be unbounded. Specifically, we model: where (the generic) γ state , γ pollster , φ state , φ pollster are four normal random effects. That is, {γ state ∼ N(0, τ 2 γ ), and similarly for the other three random effects. Each of the four prior variances τ 2 is itself given a weakly-informative boundary-avoiding Gamma(5, 20) hyper-prior. The two intercepts γ 0 and φ 0 are given an (improper) constant prior.
The θ ij can be interpreted as the mean DDC for pollster j conducting surveys in state i, while the σ ij determines the variations in DDC in that state-pollster pair. The shrinkage effect induced by (3.3), on top of being computationally convenient, attempts to crudely capture correlations between states and pollsters when considering systematic biases, as was the case in the 2016 election.
3.5. Incorporating 2016 Lessons for 2020 Scenario Modeling. By (3.1), the setup in (3.3) naturally implies a model for estimating swing from 2016 to 2020: Although (3.4) mirrors the model for 2016 data, namely (3.3), there are two major differences. First, the main quantity of interest is now δ i , not η ijk . Second, we replace the normal random effect models for (3.3) by informative priors derived from the posteriors we obtained using the 2016 data. Specifically, let γ ν be any member of the collection of the γ-parameters, denoted by {γ 0 , γ state i , γ pollster j , i, j = 1, . . .} and similarly let φ ν be any component of the φ-parameters: , i, j = 1, . . .}. Then, for computational efficiency, we use the normal approximations as emulators to their actual posteriors derived from (3.3), that is, we assume where the "pre" superscripts denote previous election, and E mc and Var mc indicate respectively the posterior mean and posterior variance obtained via MCMC (in this case with 10, 000 draws).
We use the same weakly informative priors on δ i i.i.d.
∼ N(µ δ , τ 2 δ ) as in the last line of (3.2) to complete the fully Bayesian specification.
In using the posterior from 2016 as our informative prior for the current election, we have opted to put priors on the random effects as opposed to the actual θ ij or σ ij . This has the appealing effect of inducing greater similarity between polls where either the state or pollster is the same, but an alternative implementation gives qualitatively similar results.

Just Accepted
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016 13 3.6. More Uncertainties about the Similarity Between Elections. Overconfidence is a perennial problem for forecasting methods (Lauderdale & Linzer, 2015). As Shirani-Mehr et al. (2018) show using data similar to ours, the fundamental variance of poll estimates are larger than what an assumption of simple random sampling would suggest.
Here we relax our general premise that the patterns of survey error are similar between 2016 and 2020 by considering scenarios that reflect varying degrees of the uncertainty about the relevance of the historical lessons. Specifically, we extend our model to incorporate more uncertainty in our key assumption by introducing a (user-specified) inflation parameter λ to scale the variance of the γ ν . Intuitively, λ reflects how much we believe the similarity between 2016 and 2020 election polls (with respect to DDC* and WDC). That is, we generalize the first Normal model in (3.5) (which corresponds to setting λ = 1) to Although (3.6) does not explicitly account for improvement in 2020, by downweighting a highly unidirectional prior on the η, we achieve similar effect for our scenario analysis.
We remark in passing that our use of λ has the similar effect as adopting a power prior (Ibrahim et al., 2015), which uses a fractional exponent of the likelihood from past data as a prior for the current study. Both methods try to reduce the impact of the past data to reflect our uncertainties about their relevance to our current study. We use a series of λ's to explore different degrees of this downweighting. 3.7. Sensitivity Check. There are many possible variations on the relatively simple framework presented here. In the Appendix we address two possible concerns with our approach: variation by pollster methodology and variation of the time window we average over. A natural concern is that our mix of pollsters might be masking vast heterogeneity in excess of that accounted for by performance in the 2016 election cycle. In Figure B2 we show separate results dividing pollsters into two groups, using FiveThirtyEight's pollster "grade" as a rough measure of pollster quality/methodology. We find that our model results do not vary by this distinction in most states, which is reasonable given that our overall model already accounts for house effects.
Separately, one might be concerned that averaging over a three month period might mask considerable change in actual opinion. In Figure B3 and B4, we therefore implement our method on polls conducted only in the last three weeks of our 2020 data collection period, a time window narrow enough to arguably capture a period where opinion does not change. Overall, we can draw qualitatively similar conclusions. Models from data in the last 3 weeks have similar point estimates and only slightly larger uncertainty estimates.

A Cautionary Tale
Our key result of interest is the posterior distribution of µ i for each of the twelve battleground states. We compare our estimates with the baseline model that assumes no average bias, and inspect both the change in the point estimate and spread of the distribution.
Throughout, posterior draws are obtained using Hamiltonian Monte Carlo (HMC) as implemented in Rstan (Stan Development Team, 2020). For each model, Rstan produced 4 chains,  each with 5,000 iterations, with the first 2,500 iterations discarded as burn-in. This resulted in 10,000 posterior draws for Monte Carlo estimations. Convergence was diagnosed by looking at traceplots and using the Gelman-Rubin diagnostic (Gelman & Rubin, 1992), with R < 1.01. For numerical stability, we multiply the DDC* in 2016 by 10 2 and WDC by 10 4 when performing all calculations. As noted in our modeling framework, the WDC approach has practical advantages over using DDC*, so we present only those results in the main text (analogous results for DDC* are shown in the Appendix).  Figure B2 in the Appendix, we also present the DDC* models and find that these are almost indistinguishable from the WDC models in most states. We caution that this is contingent on the assumption of turnout remaining the same: for example in Minnesota, if the 2020 total number of votes cast were 5 percent higher than in 2016, our WDC estimates would further shift away from the baseline model to further compensate for the 2016 WDC.

Just Accepted
In general, as anticipated from our assumptions, the magnitude and direction of the shift roughly accords with the sample average errors as seen in 2016 (Table 2). Midwestern states that saw significant underestimates in 2016 exhibit an increase in Trump's voteshare from our model, while states like Texas where polling errors were small tend to see a smaller shift from the baseline.
A key advantage of our results as opposed to simply adding back the polling error by a constant is that we capture uncertainty through a fully Bayesian model. The posterior distributions show that, in most states, the uncertainty of the estimate increases under our models, compared to under the baseline model. We further focus on this pattern in Figure 2, where we show how much the posterior standard deviation of a state's estimate increases by our model over the baseline model, plotted against the state's population for the purposes of our modeling. In the median state, the standard deviation of our WDC model output is about 2 times larger than that of the baseline. Figure 2 shows that this increase in uncertainty is systematically larger in higher-populated states, such as Texas and Florida, than in lower-populated states such as Iowa or New Hampshire. This is precisely what the Law of Large Population suggests (Meng, 2018), because our regressor X includes the population size N in the numerator.  It is important to stress that the uncertainties implied by our models only capture a particular sort of uncertainty: that based on the distribution of the DDC* or WDC that varies at the pollsterlevel and state-level. There are many other sources of uncertainties that can easily dominate, such as a change in public opinion among undecided voters late in the campaign cycle, a non-uniform shift in turnout, and a substantial change in the DDC*/WDC of pollsters from 2016.

Just Accepted
While these uncertainties are essentially impossible to model, we can easily investigate the uncertainty in our belief of the degree of the recurrence of the 2016 surprise. Figure 3 shows the sensitivity of our scenario analysis results to the turning parameter λ introduced in Section 3.6 that indicates the relative weighting between the 2016 and 2020 data. For each value of λ ∈ {1, 1.5, ..., 8}, we estimate the full posterior in each state as before and report three statistics: the proportion of draws in which Trump win's two-party voteshare is over 50 percent (top panel), the posterior mean (bottom-left panel), and the posterior standard deviation (bottom-right panel).
We see that, as λ increases, so does the standard deviation. An eight-fold increase in λ leads to about a doubling in the spread of the posterior estimate. In most states (with the exception of Texas and Ohio), the increase in λ -which can be taken as a downweighting of the prior -also shifts the mean of the distribution against Trump by about 3-5 percentage points, and towards the baseline model. These trends consequently pull down Trump's chances in likely Republican states, and bring the results of Texas and Ohio to tossups.

Conclusion
Surveys are powerful tools to infer unknown population quantities from a relatively small sample of data, but the bias of an individual survey is usually unknown. In this paper we show how, with hindsight, we can model the structure of this bias by incorporating fairly simple covariates such as population size. Methodologically, we demonstrate the advantage of using metrics such as DDC or WDC as a metric of bias in poll aggregation, rather than simply subtracting off point estimates of past error. In our application of the 2020 U.S. Presidential election, our approach urges caution in interpreting a simple aggregation of polls. At the time of our writing (before election day), polls suggest a Biden lead in key battleground states. Our scenario analysis confirms such leads in some states while casting doubts on others, providing a quantified reminder of a more nuanced outlook of the 2020 election.    Figure B2 shows the DDC models (in orange) in addition to the WDC models (in red). In the center and right columns, we also show results separating pollsters by FiveThirtyEight's pollster "grade." The grade is computed by FiveThirtyEight incorporating the historical polling error of the pollsters, and is listed in Table B1. When modeling 2016 error, we retroactively assign these grades to the pollster instead of using the pollster's 2016 grade.

B.2. Only Using Data 3 Weeks Out.
To address the concern that using data from 3 months prior to the election would smooth out actual changes in the underlying voter support, we redid our analysis with the last 3 weeks of our data, from October 1 to 21, 2020. This resulted in less than half the number of polls (72 polls in the same twelve states, but covered by 15 pollsters). Figure B3 shows the same posterior estimates as Figure 1 but using this subset. We see that the model estimates are largely similar to those from data using 3 months. To more clearly show the key differences, we compare the summary statistics of the two set of estimates directly in the scatterplot in Figure B4. This confirms that the posterior means of the estimates do not change even after halving the size of the dataset. The standard deviation of the estimates increases, as expected, but only slightly.

Method for Aggregating 2020 Polls
Assuming No Bias Account for WDC Account for DDC* Figure B2. DDC* and WDC models. In the first column, we show estimates from all pollsters, as in the main text, showing results for both the DDC and WDC models. In the next two columns, we do the same but subsetting pollsters in two, by their FiveThirtyEight Grade.

Standard Deviation
Each point represents a summary statistic (a mean or a standard deviation) state−model combination.

Figure B4. Comparison of Summary Statistics with 3 Week Subset.
Just Accepted