Skip to main content
SearchLoginLogin or Signup

Psephological Correlated Simulation Techniques With Decision Desk HQ: For the 2022 Midterms and Beyond

Published onSep 01, 2022
Psephological Correlated Simulation Techniques With Decision Desk HQ: For the 2022 Midterms and Beyond
·
history

You're viewing an older Release (#1) of this Pub.

  • This Release (#1) was created on Oct 27, 2022 ()
  • The latest Release (#3) was created on Sep 20, 2023 ().

Abstract

In today’s highly polarized political environment, there is a high level of interest in accurate forecasts of presidential and congressional elections, addressing how likely a given party is to win an election or a chamber of Congress. However, electoral outcomes at the federal level in the United States are strongly correlated from one state to another. Correctly capturing these race-level correlations is a major computational challenge that affects even the most well-known political forecasters, and there is no accepted way of handling the problem. This analysis presents a correlated simulation methodology for two-party and multiparty scenarios, and outlines a methodology for estimating congressional majority sizes in real time on election night. Taking the 2022 midterm election as an example, we construct a covariance matrix characterizing the degree of correlation from one U.S. geographic region to another, and use this matrix to forecast individual race outcomes. We discuss our simplification of the U.S. electoral landscape into key regions, allowing us to avoid the daunting and impractical task of separately estimating thousands of pairwise correlations between states and congressional districts. Our simplification does not reduce the dimensionality of the correlation structure, but permits an accurate and more mathematically efficient simulation than the alternative. The multiparty scenario uses a Senate primary as an example, with discussion of the underlying distributions and correlation structure.

Keywords: elections, simulation, midterms, correlation


Media Summary

In recent years, many different election forecasts have become available online, attempting to accurately predict the results of an election. They will project vote shares, seat counts, and ‘tipping-point’ districts. One challenge that arises when generating the results of a forecast is making the random variables in each simulation correlated, as, for example, with U.S. states in a presidential election. The initial values in a simulation are the forecasted values, but every good forecaster knows there is some uncertainty in their forecast, and this uncertainty is not without a pattern. This analysis will demonstrate efficient ways to simulate two-party elections, while still properly accounting for the shape and size of this uncertainty. It will also demonstrate a correlated simulation methodology for multiparty election scenarios, although the multiparty method is currently less efficient than the two-party simulations. However, the multiparty methodology we describe could be a starting point for faster and more accurate simulation methods. Aside from our preelection simulation methodology, we will also demonstrate how we can update and improve our model’s predictions when the margins of some districts are known. We show examples for both two-party and multiparty scenarios. Last, we even show how we can model elections when only the outcome (R or D win) is known, but the eventual margin is unknown. Through a creative mix of subsetting and rejection sampling, Decision Desk HQ is able to accurately simulate and improve its forecasts in real-time, and will improve on some of the shortcomings of previous attempts to forecast elections up to and through election night.

Decision Desk HQ (DDHQ), founded in 2012, provides a full suite of services connected with election results and analytics. We provide rapid election night results and race calls for primary and general elections across all 50 states, sourced directly from state and county governments. We also supply real-time election night data via API interface to a range of major media organizations, including Vox, The Economist, NewsNation, BuzzFeed, Business Insider, and a host of private clients. DDHQ was one of only nine sources officially recognized by Twitter for election results in 2020, and DDHQ was the first media source to project Joe Biden as the winner of the 2020 presidential election, and Donald Trump as the winner of the 2016 election. DDHQ also provides an industry-standard forecast for presidential and congressional elections. In 2020, the DDHQ presidential forecast successfully predicted Joe Biden as the winner of the presidential election, projecting him to receive 318 total electoral votes—only 12 votes above his ultimate total of 306.


1. Introduction

There are multiple challenges that we address in this article. The first challenge is creating an initial correlation and covariance matrix that is both plausible and can be used across elections. The second challenge is producing credible conditional estimates in both hypothetical and real-time scenarios. We provide detailed discussions of both challenges in this section and discuss solutions in both two-party and multiparty (three plus) scenarios in the Methodology section. We also provide some interesting examples in our Discussion section, with a final short summary at the end. In general, we are not focusing so much on the creation of a forecasting model, but rather the execution of one: how forecasters should conduct simulations and make their forecasts versatile enough to be usable up to and during election night.

Creating a correlation structure for election-related scenarios has been discussed extensively in the literature, such as in (Isakov & Kuriwaki, 2020), (Lauderdale & Linzer, 2015), and (Heidemanns et al., 2020). Other election forecasters have tried to estimate every correlation between all pairs of states (over 2,000 pairs) when there are so few recent elections to train on. In particular, one excellent piece is (Gelman, 2020), a blog post that analyzed Nate Silver’s 2020 election model. This is a great piece not only because it compares two well-known and respected models, but also because it shows how even correlation structures that are well-designed can be shown to be very implausible. Gelman mentioned the overestimation of tail probabilities as a possible issue (FiveThirtyEight uses a wide-tailed t distribution instead of a normal distribution, Silver, 2016), pointing out that Biden’s probability of winning Alabama should be much less than 2%. However, the main flaw in the FiveThirtyEight model that Gelman was picking out was that some pairs of states had very low pairwise correlation in Silver’s model. When running simulations of Silver’s model, Gelman found that the probability of Trump wins in Alaska and Pennsylvania given a Trump win in New Jersey were just 57% and 39%, respectively. When DDHQ tested the scenarios Gelman outlined on our own 2020 model, our probabilities of a Trump victory in Alaska given a Trump win in New Jersey would have been about 99.96% and 74%, respectively. We will elaborate more on how we get those values in a later section. Even in Gelman’s 2020 model, the pairwise correlations were estimated, and ended up being extremely variable, with Corr(NV,NH)=0.25Corr(NV,NH)=0.25 and Corr(OH,MI)=0.88Corr(OH,MI)=0.88, to name some examples (Heidemanns et al., 2020). Thus, we only try to estimate a few levels of correlation well, to avoid estimating thousands of correlations poorly. We also outline the basic properties that any correlation matrix should have, regardless of how the values are estimated.

2. Methodology

2.1. Workflow

Below is a chart outlining the workflow of our methodology and ideas. Our forecast contains a predicted margin for every Senate and House race in 2022, and this prediction comes from both fundamentals (fundraising, demographics, election history, redistricting), and a time-weighted, bias-adjusted polling average. Our House predictions are more reliant on fundamentals due to the difficulty of conducting accurate district-level polls, while our Senate predictions rely more on race-level polls because Senate races are less dictated by the national environment.

Figure 1. DDHQ methodology flowchart.

2.2. Discussion of Covariance Matrix Construction

To compute a covariance matrix that is both plausible and compatible with the simulation techniques, it must satisfy two conditions that are not always present. First, the matrix must be of full rank. This severely limits the possibility of estimating the matrix through calculating the covariance of predictors in a model, because the rank of the sample covariance is at most the minimum of the number of predictors and the number of observations. For the House of Representatives, one would need more than 435 predictors to obtain a full-rank covariance matrix, a near impossibility. Next, we restate the conditional distribution of a multivariate normal distribution, found through a Schur complement as shown in (Haynsworth, 1968). If XN(μ,Σ)X\sim N(\mu,\Sigma) is partitioned into vectors X1,X2X_1,X_2 (X2X_2 being a vector of the known values), with the mean and covariance partitioned as μ1,μ2\mu_1,\mu_2, and Σ11,Σ12,Σ21,Σ22\Sigma_{11},\Sigma_{12},\Sigma_{21},\Sigma_{22}, then the conditional distribution is as follows:

X1X2=x2N(μ1+Σ12Σ221(x2μ2),Σ11Σ12Σ221Σ21)     (2.1)X_1|X_2=x_2\sim N(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) \ \ \ \ \ \text{(2.1)}


where μ1,μ2\mu_1,\mu_2 are the means of the partitioned vectors, and Σ11,Σ12,Σ21,Σ22\Sigma_{11},\Sigma_{12},\Sigma_{21},\Sigma_{22} are the partitioned matrices from the initial covariance matrix. This is an important result when we are trying to estimate the results of one state given the results in another state. We also use conditional distributions to calibrate our model on past elections. This is also why a full-rank covariance matrix is so important, since in the computation of the conditional distribution, once m1m-1 variables are known, the rank of the Σ22\Sigma_{22} term is 0, making it singular and thus impossible to calculate a conditional distribution. It is extremely difficult to find more than 51 sufficiently independent variables to compute such a matrix for a presidential election (or more than 400 for the U.S. House), so alternative means must be used. Our resolution was to build a matrix with a national-level correlation coefficient ρN\rho_N and a regional-level correlation coefficient ρR\rho_R, both of which would be calculated by training on past elections. This preserved both conditions from above and when it was optimized to fit the 2016 election, it ended up fitting the 2020 results well. A major benefit of this technique was that only two values of correlation were being estimated, rather than trying to estimate thousands of separate pairwise correlations. Fortunately, the simulations are robust to the choice of correlation values, so it is not necessary to precisely estimate the national and regional correlations.

Second, the covariance matrix must be positive definite (containing only positive eigenvalues). This is because Cholesky decomposition is necessary for generating multivariate normal random samples. The covariance matrix Σ\Sigma is decomposed into the form AAAA', where AA is upper triangular and AA' is lower triangular. Then, by generating a NxPN x P matrix X of standard normal random variables, where NN represents the desired number of multivariate normal samples, and PP represents the dimension of the distribution, the multivariate random variables are generated as the matrix product XAXA. A multivariate normal distribution without a positive-definite covariance matrix will have a degenerate probability density function (pdf), and will not be useful in simulating elections.

2.2.1. Simplified Method

Fast simulation for large multivariate random variables is necessary when simulating House elections. The mvrnorm function in the MASS package (Venables & Ripley, 2002) is very efficient, but for larger multivariate distributions, the large matrix multiplication becomes a major drag on simulation speed. Matrix multiplication does not scale well with size, so it becomes necessary to use a more efficient operation, like matrix addition. For large electoral bodies like the U.S. House, a much more efficient technique can be utilized. However, it must be emphasized that this faster method only works for forecasts that assume normality rather than a fat-tailed distribution. Also, this method relies on simplifying a correlation matrix to two levels of correlation. Given two levels of correlation (like a national-level and regional-level correlation), one can express a House seat’s margin as follows:

MARGIN=N+R+D          (2.2)MARGIN = N + R + D \ \ \ \ \ \ \ \ \ \ \text{(2.2)}


Where NN(0,σN2)N\sim N(0,\sigma_N^2) denotes a national effect, RN(0,σR2)R\sim N(0,\sigma^2_R) denotes a regional effect, and DN(μD,σD2)D\sim N(\mu_D,\sigma_D^2) is a district-level effect. The distribution of the final margin is thus MARGINN(μD,σN2+σR2+σD2)MARGIN\sim N(\mu_D, \sigma_N^2+\sigma_R^2+\sigma_D^2), with μD\mu_D being the initially forecasted two-party margin in a district. In order to update this prediction based on new information, one takes the Schur complement stated in Equation 2.1 above, with Σ\Sigma being the original covariance matrix. Now, consider a couple of examples of correlation between districts. For two districts Di,DjD_i,D_j in different regions of the country R1,R2R_1,R_2, the correlation is given by

ρN=Cov(N+R1+Di,N+R2+Dj)Var(N+R1+Di)Var(N+R2+Dj))σN2σN2+σR2+σD2.     (2.3)\rho_N=\frac{Cov(N+R_1+D_i, N+R_2+D_j)}{\sqrt{Var(N+R_1+D_i)Var(N+R_2+D_j))}}\Rightarrow \frac{\sigma_N^2}{\sigma_N^2+\sigma_R^2+\sigma_D^2}. \ \ \ \ \ \text{(2.3)}


Likewise, for two districts in the same region of the country, the correlation is ρR=σN2+σR2σN2+σR2+σD2\rho_R=\frac{\sigma_N^2+\sigma_R^2}{\sigma_N^2+\sigma_R^2+\sigma_D^2}. Each of σN2,σR2,σD2\sigma_N^2,\sigma_R^2,\sigma_D^2 can be derived from the total variance σ2\sigma^2 and the correlation values as follows:

σN2=σ2ρN,σR2=σ2(ρRρN),σD2=σ2(1ρR).     (2.4)\sigma^2_N=\sigma^2*\rho_N,\sigma^2_R=\sigma^2*(\rho_R-\rho_N),\sigma^2_D=\sigma^2*(1-\rho_R). \ \ \ \ \ \text{(2.4)}


The national and regional-level correlations ρN\rho_N and ρR\rho_R can be set to desired values, and from there, simulating elections becomes simpler and much faster. As an example, the DDHQ 2022 House model has ρN=0.3\rho_N=0.3 and ρR=0.6\rho_R=0.6. We show a map of the House regions we used in our model in the next subsection. Instead of conducting Cholesky decomposition and then multiplying by a matrix of normally distributed random variables, one only needs to generate a matrix of normal random variables, then add it to a normally distributed vector. The mathematical benefit of this technique comes from the fact that large matrix multiplication is not necessary. In testing, this new simplistic technique sped up simulation speed several times, compared with the previous mvrnorm technique discussed earlier.

2.3. Preelection Modeling

There is an important distinction to be made between setting the correlation and covariance structure in a forecast, and adjusting the forecast in the presence of new information. This area will focus on the construction of a plausible and effective correlation structure.

2.3.1. Correlation Matrix

When there are simultaneous district-level races taking place, their margins are correlated. However, as was explained in the last section, great care must be taken to ensure that the resulting matrix is full-rank and positive definite. As stated in the last section, we use a national and regional-level correlation value that we optimize using past elections. Shown below is a 2022 House map broken down by ‘region.’

Figure 2. DDHQ House ’regions.’

In our House model, every congressional district is 30% correlated with every district in a different region, and 60% correlated with districts in the same region. We arrived at those numbers by testing on previous elections, and splitting our regions based on demographics rather than location. The benefits of this simplification do come with some costs, however. Although we came up with the districts based off numerical cutoffs, the selection of the criteria that determined each region was somewhat subjective. In future research, one could use k-means clustering with the district demographics and geography as a more objective way to determine regions for the House.

2.3.2. Covariance Matrix

Every covariance matrix depends on two things: correlation and variance. One may want to assume that all races have the same variance, which one would only need to scale the correlation matrix by the desired variance. However, not all races have the same variance. States that are not considered battlegrounds are seldom polled, and the polling in those states often understates the dominant party’s strength. As such, the margins in those states are more variable, even if the winner is not in doubt. We will outline the importance of using nonconstant variance in the next section with what we call the ‘Kentucky Problem.’ Finding the right values for the variance of each race can be done through taking the conditional distribution. For example, one can ask how much a prediction for Ohio should change in a scenario where Trump does 10 points better than the model prediction in Kentucky, and adjust the variance for the states accordingly until the conditional distributions are plausible in different scenarios.

2.4. Election Night Updating

Our modeling and methodology are very versatile. Unlike most models, it is used not only to forecast an election before election night, but to forecast elections during election night, in real time as results come in.

2.4.1. Two-Party Case

We use the conditional normal distribution that we stated above to update our predictions when the margins in a state or some states are known. We can update a new forecast conditional on the known margins. This process is also extremely useful when building a correlation matrix, as one can use the conditional distributions to test what their forecast would predict in probable and improbable events, like Trump winning California or Biden winning West Virginia. It is also applicable with several states known. Perhaps the neatest element of these conditional distributions is that if the starting distribution is multivariate normal, the conditional distributions are multivariate normal, and the marginal distribution for each state is still a normal distribution, with the mean and variance being affected by the observed data. The conditional mean and variance are found through taking the Schur complement, as described in Equation 2.1. Therefore, one can easily calculate the win probability and expected margin in any possible situation, for any state. This can lead to much better estimation of covariance matrices, and will allow flaws in FiveThirtyEight and The Economist’s models to be fixed before one writes a blog post about the other’s model.

2.4.2. Multiparty Case

Our modeling is not limited to the two-party system of U.S. general elections, where the two-party margin is the only outcome variable. We have designed modeling that is capable of working in scenarios where there are three or more viable parties. We use different distributions to model proportions, but still use the same correlation structure and the same idea of conditional distributions.

The Dirichlet distribution is a multivariate generalization of the beta distribution. To simulate a Dirichlet random variable, take n independent Gamma random variables X1,...,XnX_1,...,X_n (one for each major candidate or party) with shape parameter αi\alpha_i and constant scale parameter , and let S=i=1nXiS=\sum_{i=1}^n X_i. Then, (X1S,...,XnS)Dirichlet(α1,...,αn)(\frac{X_1}{S},...,\frac{X_n}{S})\sim Dirichlet(\alpha_1,...,\alpha_n). The idea of using a Dirichlet distribution to model elections with several viable parties is nothing new, and has been used in (Hanretty, 2021), among others. They used a Dirichlet model to predict the results of by-elections (special elections) in the U.K. House of Commons. In their model, their errors were comparable to those of polls, despite their model not using polls as an input. For our purposes, we add another element to this by extending the distribution to scenarios where there are many concurrent and correlated elections, like a standard election night. Because of the ease of generating Gamma random variables in most programming languages, the Dirichlet distribution is also excellent for realistic and reasonably efficient simulation.

We can model the ‘strength’ of a candidate throughout a state by using a multivariate Gamma distribution. When we use a Gamma distribution to represent the strength of each candidate in each county, it follows that each county is Dirichlet distributed. We set the distribution for each candidate to have a mean that is highest around the candidate’s home, and a spatially dependent covariance matrix. We use a Gaussian copula, so that we can more easily calculate conditional distributions once some constituencies have reported. The conditional distributions allow us to improve our prior estimate, and when we simulate from the conditional distribution, we can get approximate probabilities of each candidate coming out on top. This methodology is performed in an attached R file. The simulations in this scenario are much slower than for other types of elections, since for each ‘simulation’ it is necessary to generate a Gamma distribution for every candidate and for every county. We outline a detailed case study in the subsection ‘Indiana Example.’

2.4.3. Seat Calculator

We outline how a real-time model can be built, using the outcomes of called races to estimate the number of seats each party will eventually win. FiveThirtyEight attempted to build something similar to this, in the 2018 midterm elections. Their real-time model had mixed results, as it initially started out with Democrats as the clear favorite to take control of the House, then quickly moved toward a tossup, before slowly converging to the actual result. Their estimator used the live margins from the states and districts that had been called already. Their model’s early move toward Republicans was most likely caused by the fact that it did not account for ‘blueshifts’ that occurred after election night, a relatively new phenomenon at the time.

DDHQ has made some improvements to this line of thinking. We recognize that the reported margins will not be settled on election night, especially with the rise of mail-in and absentee voting, and the blue (or red) shifts that can take place after election night. For that reason, the reported margins in called races will not be considered in this model. We will only rely on our preelection forecast, and the races that we have called on election night. Our model recognizes the the margin in a called race is still a random variable, but one that we know will be greater than zero for the party that the race is called for. By treating the called races as a random variable, we will not be underestimating the variance in our updates. Consequently, our model will not be making any undue assumptions, and it will be able to steadily converge to the eventual outcome of the House. We outline this with a flowchart and example in Section 3.3

3. Discussion/Examples

This section has important examples that help to illustrate our methodology and show the benefits of adopting certain techniques.

3.1. Kentucky Problem

Kentucky is usually one of the first states to report results on election nights. If the result in Kentucky ends up being substantially different than what was predicted, the estimates for the other states in the region will likewise move substantially with Kentucky. From a forecasting point of view, this can make predicting a Biden victory turn into predicting a Trump victory if not handled properly. The biggest reason that non-swing states like Kentucky can throw off a model’s estimate is because reliable polling data is scant and an intrinsic tendency of polls is to underestimate the margin in such states. Consider FiveThirtyEight’s 2016 election model, which was mostly a polls-based forecast (Silver, 2016). In Silver’s forecast, he greatly underestimated the actual margins in non-battleground states for both Trump and Clinton. Shown below is a plot of FiveThirtyEight’s residuals (actual − predicted Trump margin) versus actual margin, for both the 2016 and 2020 elections.

Figure 3. Evaluation of FiveThirtyEight 2020 model.

Figure 4. Evaluation of FiveThirtyEight 2020 model.

The FiveThirtyEight model in 2016, being mostly polling based, severely underestimated Trump’s margin in traditionally republic states, while underestimating Clinton’s margin in traditionally democratic states. Although FiveThirtyEight partially resolved this in 2020 by giving more weight to fundamentals, the issue of polling non-swing states is still a major source of model inaccuracy, causing even the best-calibrated models to underestimate the eventual margins in non-battleground states. DDHQ’s 2020 Election model had a similar issue. The way DDHQ resolved the ‘Kentucky Problem’ in simulations is by inflating the variances of non-swing states so that a margin much higher than expected will only marginally shift the estimates for other states, even if the correlation is relatively high. This variance inflation for noncompetitive races is also a key improvement from 2020 that we made in our forecast for the 2022 midterms (Williams et al., 2020).

As far as compatibility is concerned, scaling the variances of a correlation matrix does not pose a risk to the matrix’s full-rank status, as the scalar itself can be expressed as a full-rank diagonal matrix V. By Sylvester’s Rank Inequality, the product of two full-rank matrices, is always full-rank. So long as the initial correlation matrix is full-rank and positive-definite, scaling is a safe way to improve a covariance matrix. We can express this transformation as VΣVV\Sigma V, where Σ\Sigma is the initial correlation matrix and V is a diagonal matrix of standard deviations. However, transformation through matrix multiplication is computationally inefficient for large matrices, so we simply scale the rows and the columns by the standard deviations to achieve a quick transformation. On a paradoxical note, when swing state outcomes are thought of as a Bernoulli random variable, they have high variance, but when their final two-party margin is thought of as a normally distributed random variable, they actually have the lowest variance, since they are heavily polled.

We present a table showing the model adjustments for the first 10 states called in 2020, in terms of expected electoral votes for Trump. The constant variance model (Model 1) assumes a variance of 50 for all states, and the nonconstant variance model (Model 2) assumes a variance of 16+2M16+2|M|, where MM is the forecasted margin. Both assume correlations of 0.6.

Table 1. Model adjustment for the expected electorial votes for Trump in the first 10 states called in 2020.

State

Model 1

Model 2

Pre-election

201.0

198.1

Kentucky

245.3

234.3

Vermont

203.1

202.2

Indiana

219.8

218.4

West Virginia

256.8

245.1

Virginia

250.0

239.6

South Carolina

250.1

241.2

Alabama

249.4

240.1

Mississippi

244.2

235.3

Tennessee

245.2

235.9

Oklahoma

248.8

238.0

The nonconstant variance model stays closer to Trump’s eventual electoral vote count of 232. This more powerful model better addresses the Kentucky Problem, and leads to better predictions that match up well with the eventual result.

3.2. Indiana Example

We will use an example of a primary that took place in 2018. The republican senatorial primary in Indiana was contested between three viable candidates, those candidates being Rep. Luke Messer, Rep. Todd Rokita, and the eventual senator, Mike Braun. We know from our experience covering primaries that a candidate’s strength is most concentrated around where they live. From that, we can build a spatially weighted distribution centered around their locations. Their home county is where we would expect them to perform the strongest, and the further away a county is, the weaker they would be expected to perform.

In this hypothetical example, we model the prior strength of each candidate as follows: Let Δmax\Delta_{max} be the maximum possible distance between two counties and Δ\Delta be the distance from the candidate’s home county. Then, we model the strength ϕ\phi as follows: ϕ=8Δ+2(1Δ)Δmax\phi=\frac{8*\Delta+2*(1-\Delta)}{\Delta_{max}}. Let’s say that the prior strengths of candidates are estimated at 7,4, and 3 in a given county. Then, the prior estimate for vote shares would be 50%, 29%, and 21% respectively. As for the initial correlation matrix, the correlation between any two counties A,B, is Corr(A,B)=exp(ΔA,BΔmax)Corr(A,B)=exp(-\frac{\Delta_{A,B}}{\Delta_{max}}), and this is used for each candidate. One can also add in demographics and district boundaries to enhance the ‘priors,’ but for the purposes of this example, we will just stick to location. We use the R package sf (Pebesma, 2018) to read the shapefiles and produce the maps shown on the next page. Shown at the top of the next page is a map of initial estimates based solely off the candidates’ locations, and assumes the three candidates are evenly matched.

Figure 5. Initial prediction of Indiana Primary.

Then, once some counties have reported, we can update our previous estimate by taking the conditional distribution for the strength of each candidate. We simulated draws from this conditional distribution to estimate the probability of each candidate winning. In this example, Braun won about 99% of them. Shown below is the predicted map after 20 randomly selected counties have reported, compared side-by-side with the actual results.

Figure 6. Prediction of Indiana primary with 20 counties known.

Figure 7. Actual Results of Indiana Primary.

Finally, we present a table comparing the 95% confidence intervals of our simulations compared with the actual result. This represents the prediction of the other 72 counties given the results of 20 randomly selected counties.

Table 2.

Candidate

Low

Median

High

Actual

Braun

37.2%

41.0%

45.2%

41.2%

Messer

23.9%

26.7%

30.4%

28.8%

Rokita

28.3%

32.1%

36.3%

30.0%

3.3. House Seat Calculator

First, we outline our thought process with the flowchart shown below:

Figure 8. Live seat methodology flowchart.

Also, it is important to note that this methodology is perfectly applicable for Senate seats, and is actually much simpler to carry out, since there are only around 35 Senate races per cycle compared with all 435 House seats.

3.3.1. Subsetting

In the early stages of election night, DDHQ will likely be able to call many races that are rated as solidly democratic or solidly republican. Most of our simulations from our forecast will have these outcomes, so we can simply take the subset of simulations that have the desired outcome. Shown below is a simplified example of our process. It is 7:15 pm on election night. DDHQ has called IN-06 for Republicans and IN-07 for Democrats.

Simulation

IN-05

IN-06

IN-07

IN-08

1

R+2.3

R+9.8

D+18.9

R+25.3

2

D+1.5

D+0.6

D+18.6

R+17.0

3

R+6.9

R+9.2

D+22.9

R+16.3

4

R+10.8

R+17.1

R+1.3

R+25.5

5

D+1.8

R+2.9

D+9.9

R+10.1

Since Simulations 1,3,5 are the only ones that match the results of both of the calls, we would include only those simulations in the new subset. The new subset will also exclude the called seats, since their result is no longer relevant. This strategy is effective when only the safest seats are involved, but it cannot be sustained once we start to call the toss-up races, because even with an accurate forecast, we will quickly run out of simulations to subset. In the next section, we discuss how we can overcome the curse of dimensionality through rejection sampling.

3.3.2. Late-Game Simulator

This model saves the most recent subset after each run. Our model has a ‘pivot point,’ which is when the last subset had over 10,000 simulations, and the new subset will have less than 10,000. Once we reach this point on election night, the model will find the sample mean and covariance of the old subset. This old subset, as well as the new subset, will be multivariate normal distributed (proof and simulation study in the Appendix). The multivariate normal property is important, especially once we simulate more draws. Then, using the sample mean and covariance as the parameters, we simulate extra draws from this distribution until we get 10,000 draws that have the desired outcome in the next seat (We ignore the previous seats, because the old subset is already derived from the outcomes of the previous seats). There is no risk of the sample covariance matrix not being positive-definite, since there will always be at least 10,000 samples from the initial distribution. This is a form of multivariate rejection sampling. We sample until we get 10,000 samples that have the desired outcome in the newly called seats. Shown below in figures 9 and 10 is a pair of plots that illustrates the concept.

Figure 9. Pre-rejection sampling distribution.

Figure 10. Post-rejection sampling distribution.

In the extremely rare event that the outcome of the next race does not occur in our subset, we cap the number of added simulations at 10,000,000, and will just take the 10,000 simulations closest to the desired outcome. So far in our testing, we have not needed to employ this scenario, even in the most extreme wave elections where a party wins 350 seats.

4. Summary

Although we described several possible improvements to election forecasting in this article and in (Williams et al., 2020), our main purposes in this article were to improve the mechanics of modeling correlated outcomes and to extend that methodology to a wide variety of scenarios. There are many different forecasts out there, but all forecasts should be considering the mechanics of making outcomes correlated and thoroughly testing the probability of joint outcomes through conditional distributions, regardless of how they predict individual race margins and probabilities.

We recognize that other forecasters often try to do too much with too little information, like estimating thousands of separate correlation values with just a few previous elections. We decided that the best resolution was to simplify the map into regions, and to just use one correlation value for districts in different regions, and another for districts in the same region. Our approach may seem somewhat simplistic, but the covariance in our model is highly plausible and is calibrated on both preelection and election-night scenarios.

It was also important to show that our modeling is usable when some actual information is present. Through conditional distributions, models can be improved with results from any combination of districts, and will steadily converge to the eventual outcome. When building and testing a forecast, this approach also allows us to go back and revise our model in case our conditional distributions seem unrealistic. We use multivariate normal distributions to model the dependence in our forecast, but the methodology for those who want to use a multivariate t distribution is outlined in Appendix C.

This conditional distribution methodology can even be generalized to multiparty scenarios, so election night forecasting need not be limited to U.S. general elections every 2 years. There are dozens of primary nights in every 2-year cycle in the United States, where several candidates have viable shots at winning. Further, the multiparty methodology is applicable for elections outside the United States, since most European and Asian democracies work under parliamentary systems. As such, the multiparty methodology makes it possible to use such tools on a far more frequent basis.

It was also necessary to consider a more complicated but more realistic election-night scenario. Through our subsetting and rejection sampling process, it is now possible for forecasters to build accurate models that take into account election-night results. We are also able to overcome the dreaded curse of dimensionality in a very efficient manner, while only relying on called races, rather than trying to use margins when they are not determined yet. We skipped some important mathematical steps in that subsection, but we outline the math in detail with a case study in the appendices. Our live seat calculator makes fewer assumptions than previous models, and is very robust to the blueshifts and redshifts that can occur in the weeks after election night.

As a final note, some important proofs and footnotes are included in the appendices below. They include a proof of the multivariate normality of a subset, a simulation study furthering the proof, a statement of code and packages used, and a quick discussion of using a t distribution to model outcomes instead of a normal distribution.


Disclosure Statement

Sydney Louit, Mukul Ram, Kiel Williams, Alex Alduncin, Patrick McCaul, and Scott Tranter have no financial or non-financial disclosures to share for this article.


References

Ding, P. (2016). On the conditional distribution of the multivariate t distribution. The American Statistician, 70(3), 293–295. https://doi.org/10.1080/00031305.2016.1164756

Gelman, A. (2020). Reverse-engineering the problematic tail behavior of the fivethirtyeight presidential election forecast. https://statmodeling.stat.columbia.edu/

Hanretty, C. (2021). Forecasting multiparty by-elections using dirichlet regression. International Journal of Forecasting, 37(4), 1666–1676. https://doi.org/10.1016/j.ijforecast.2021.03.007

Haynsworth, E. V. (1968). On the Schur complement, Basel mathematical notes (University of Basel). BMN 20, 17 .

Heidemanns, M., Gelman, A., & Morris, G. E. (2020). An updated dynamic Bayesian forecasting model for the US presidential election. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.fc62f1e1

Isakov, M., & Kuriwaki, S. (2020). Towards principled unskewing: Viewing 2020 election polls through a corrective lens from 2016. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.86a46f38

Lauderdale, B. E., & Linzer, D. (2015). Under-performing, over-performing, or just performing? The limitations of fundamentals-based presidential election forecasting. International Journal of Forecasting, 31(3), 965–979. https://doi.org/10.1016/j.ijforecast.2015.03.002

Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. https://doi.org/10.32614/RJ-2018-009

R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Silver, N. (2016). A user’s guide to FiveThirtyEight’s 2016 general election forecast. https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/

Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer. https://doi.org/10.1007/978-0-387-21706-2

Williams, K., Ram, M., Shor, M., Jarugula, S., DeRemigi, D., Alduncin, A., & Tranter, S. (2020). Forecasting the 2020 U.S. elections with Decision Desk HQ: Methodology for modern American electoral dynamics. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.9663befd


Appendices

Appendix A. Proof of Multivariate Normality of Subset

The purpose of this proof is to show that the distribution of X2sign(X1)X_2|sign(X_1) is still multivariate normal, though the mean and covariance of this distribution have no closed-form. We start with an n-dimensional multivariate normal distribution with parameters x1,...,xnx_1,...,x_n. Then we take the slice where x1>0x_1>0. The distribution of X2,...,XnX1>0X_2,...,X_n| X_1>0 can be expressed as 0f(x1,...,xn)dx1\int_0^\infty f(x_1,...,x_n) dx_1, or 0f(x1,...,xn)dx1\int_{-\infty}^0 f(x_1,...,x_n) dx_1 when X1<0X_1<0. Then, we use the principle that every integral is the limit of a Riemann sum. The components of this Riemann sum are all multivariate normal distributed (each component is the conditional distribution X2,...,XnX1=x1X_2,...,X_n|X_1=x_1 over the domain of x1x_1). From this, we can apply the stable property of multivariate normal distributions to conclude that the resulting integral will end up as a multivariate normal distribution (though the mean and covariance will have no closed form and must be approximated through rejection sampling). Finally, this is applicable with multivariate integrals because the components of the Riemann sum will still be multivariate normal (conditional distribution with multiple variables known), and is shown through a visual example and simulation study in the next appendix.

Appendix B. Simulation Study of Multivariate Normality Proof

Let’s consider the Senate results in Arizona, Georgia, Nevada, and Pennsylvania. All four races are considered to be very close contests at the time of this writing (Summer 2022). Let’s assume that the races are multivariate normally distributed with the marginal distribution for each race as N(μ=0,σ=5)N(\mu=0,\sigma=5), and are all 60% correlated with each other. Then, consider the scenario where Democrats win in both Arizona and Nevada (negative GOP margins in both races). We run 100,000 simulations in this example, and we take only the simulations that have democratic wins in Arizona and Nevada. Shown below is a contour plot of the sampled outcomes in Georgia and Pennsylvania given the above information. As can be seen in the plot, the joint distribution of the margins in Georgia and Pennsylvania is bivariate normal, with μ^GA=3.4,σ^GA=4.2,μ^PA=3.4,σ^PA=4.2\hat{\mu}_{GA}=-3.4,\hat{\sigma}_{GA}=4.2,\hat{\mu}_{PA}=-3.4,\hat{\sigma}_{PA}=4.2, and ρ^=0.43\hat{\rho}=0.43.

Figure B1. Multivariate normality of subset under multiple conditions.

Appendix C. Assessment of Multivariate t Distribution

While DDHQ prefers to use a multivariate normal distribution, some other forecasters may want to run simulations with a multivariate t distribution, since it has wider tails and thus can better account for extreme scenarios. Because multivariate normal distributions do not have tail dependence, they may fail to account for the truly extreme scenarios where multiple states are showing extreme results. The past few elections have had some surprising results, but none extreme enough as to necessitate a t distribution. FiveThirtyEight used t distributions with 10 degrees of freedom to model states in their 2016 forecast, found at (Silver, 2016), explaining that with the small number of elections to train on, extreme results are still possible. In order to generate observations of a multivariate t distribution with a given mean and covariance matrix, one can use the mvtnorm package in R. As for the calculation of conditional multivariate t distributions, we restate a result found in (Ding, 2016).

First, let Xtp(μ,Σ,ν)X\sim t_p(\mu,\Sigma,\nu), with location μ\mu, covariance matrix Σ\Sigma, df ν\nu, and length pp. A conditional multivariate t distribution is as follows:

X2X1tp2(μ21,ν+d1ν+p1Σ221,ν+p1)     (D.1)X_2|X_1\sim t_{p_2}(\mu_{2|1}, \frac{\nu+d_1}{\nu+p_1}\Sigma_{22|1},\nu+p_1) \ \ \ \ \ \text{(D.1)}


where d1=(X1μ1)Σ111(X1μ1)d_1=(X_1-\mu_1)'\Sigma_{11}^{-1}(X_1-\mu_1) is the squared Mahalanobis distance between X1X_1 and μ1\mu_1, Σ221\Sigma_{22|1} is the conditional scale matrix of X2X_2 given X1X_1, and μ21\mu_{2|1} is the conditional location parameter of X2X_2 given X1X_1. Conditional distributions are most useful when some state margins are known and others are not, as one can update model estimates on districts that are yet to come in using either a conditional normal or conditional t. However, the downside is that t distributions are generally not stable, so the fast method described in our ‘preelection modeling’ section cannot be used under a t-distribution assumption.


Data Repository/Code

R (R Core Team, 2021) is used for all calculations and simulations mentioned in this article, and the R package MASS (Venables & Ripley, 2002) is used for generation of correlated multivariate random variables. In the MASS package, the mvrnorm function is used to generate samples from a multivariate normal distribution. We use the sf package (Pebesma, 2018) to produce the maps shown in the Indiana example. Our code is publicly available on Harvard Dataverse at the link https://dataverse.harvard.edu/dataverse/ddhq-dataverse.


©2022 Sydney Louit, Mukul Ram, Kiel Williams, Alex Alduncin, Patrick McCaul, and Scott Tranter. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?