In today’s highly polarized political environment, there is a high level of interest in accurate forecasts of presidential and congressional elections, addressing how likely a given party is to win an election or a chamber of Congress. However, electoral outcomes at the federal level in the United States are strongly correlated from one state to another. Correctly capturing these race-level correlations is a major computational challenge that affects even the most well-known political forecasters, and there is no accepted way of handling the problem. This analysis presents a correlated simulation methodology for two-party and multiparty scenarios, and outlines a methodology for estimating congressional majority sizes in real time on election night. Taking the 2022 midterm election as an example, we construct a covariance matrix characterizing the degree of correlation from one U.S. geographic region to another, and use this matrix to forecast individual race outcomes. We discuss our simplification of the U.S. electoral landscape into key regions, allowing us to avoid the daunting and impractical task of separately estimating thousands of pairwise correlations between states and congressional districts. Our simplification does not reduce the dimensionality of the correlation structure, but permits an accurate and more mathematically efficient simulation than the alternative. The multiparty scenario uses a Senate primary as an example, with discussion of the underlying distributions and correlation structure.
Keywords: elections, simulation, midterms, correlation
In recent years, many different election forecasts have become available online, attempting to accurately predict the results of an election. They will project vote shares, seat counts, and ‘tipping-point’ districts. One challenge that arises when generating the results of a forecast is making the random variables in each simulation correlated, as, for example, with U.S. states in a presidential election. The initial values in a simulation are the forecasted values, but every good forecaster knows there is some uncertainty in their forecast, and this uncertainty is not without a pattern. This analysis will demonstrate efficient ways to simulate two-party elections, while still properly accounting for the shape and size of this uncertainty. It will also demonstrate a correlated simulation methodology for multiparty election scenarios, although the multiparty method is currently less efficient than the two-party simulations. However, the multiparty methodology we describe could be a starting point for faster and more accurate simulation methods. Aside from our preelection simulation methodology, we will also demonstrate how we can update and improve our model’s predictions when the margins of some districts are known. We show examples for both two-party and multiparty scenarios. Last, we even show how we can model elections when only the outcome (R or D win) is known, but the eventual margin is unknown. Through a creative mix of subsetting and rejection sampling, Decision Desk HQ is able to accurately simulate and improve its forecasts in real-time, and will improve on some of the shortcomings of previous attempts to forecast elections up to and through election night.
Decision Desk HQ (DDHQ), founded in 2012, provides a full suite of services connected with election results and analytics. We provide rapid election night results and race calls for primary and general elections across all 50 states, sourced directly from state and county governments. We also supply real-time election night data via API interface to a range of major media organizations, including Vox, The Economist, NewsNation, BuzzFeed, Business Insider, and a host of private clients. DDHQ was one of only nine sources officially recognized by Twitter for election results in 2020, and DDHQ was the first media source to project Joe Biden as the winner of the 2020 presidential election, and Donald Trump as the winner of the 2016 election. DDHQ also provides an industry-standard forecast for presidential and congressional elections. In 2020, the DDHQ presidential forecast successfully predicted Joe Biden as the winner of the presidential election, projecting him to receive 318 total electoral votes—only 12 votes above his ultimate total of 306.
There are multiple challenges that we address in this article. The first challenge is creating an initial correlation and covariance matrix that is both plausible and can be used across elections. The second challenge is producing credible conditional estimates in both hypothetical and real-time scenarios. We provide detailed discussions of both challenges in this section and discuss solutions in both two-party and multiparty (three plus) scenarios in the Methodology section. We also provide some interesting examples in our Discussion section, with a final short summary at the end. In general, we are not focusing so much on the creation of a forecasting model, but rather the execution of one: how forecasters should conduct simulations and make their forecasts versatile enough to be usable up to and during election night.
Creating a correlation structure for election-related scenarios has been discussed extensively in the literature, such as in (Isakov & Kuriwaki, 2020), (Lauderdale & Linzer, 2015), and (Heidemanns et al., 2020). Other election forecasters have tried to estimate every correlation between all pairs of states (over 2,000 pairs) when there are so few recent elections to train on. In particular, one excellent piece is (Gelman, 2020), a blog post that analyzed Nate Silver’s 2020 election model. This is a great piece not only because it compares two well-known and respected models, but also because it shows how even correlation structures that are well-designed can be shown to be very implausible. Gelman mentioned the overestimation of tail probabilities as a possible issue (FiveThirtyEight uses a wide-tailed t distribution instead of a normal distribution, Silver, 2016), pointing out that Biden’s probability of winning Alabama should be much less than 2%. However, the main flaw in the FiveThirtyEight model that Gelman was picking out was that some pairs of states had very low pairwise correlation in Silver’s model. When running simulations of Silver’s model, Gelman found that the probability of Trump wins in Alaska and Pennsylvania given a Trump win in New Jersey were just 57% and 39%, respectively. When DDHQ tested the scenarios Gelman outlined on our own 2020 model, our probabilities of a Trump victory in Alaska given a Trump win in New Jersey would have been about 99.96% and 74%, respectively. We will elaborate more on how we get those values in a later section. Even in Gelman’s 2020 model, the pairwise correlations were estimated, and ended up being extremely variable, with
Below is a chart outlining the workflow of our methodology and ideas. Our forecast contains a predicted margin for every Senate and House race in 2022, and this prediction comes from both fundamentals (fundraising, demographics, election history, redistricting), and a time-weighted, bias-adjusted polling average. Our House predictions are more reliant on fundamentals due to the difficulty of conducting accurate district-level polls, while our Senate predictions rely more on race-level polls because Senate races are less dictated by the national environment.
To compute a covariance matrix that is both plausible and compatible with the simulation techniques, it must satisfy two conditions that are not always present. First, the matrix must be of full rank. This severely limits the possibility of estimating the matrix through calculating the covariance of predictors in a model, because the rank of the sample covariance is at most the minimum of the number of predictors and the number of observations. For the House of Representatives, one would need more than 435 predictors to obtain a full-rank covariance matrix, a near impossibility. Next, we restate the conditional distribution of a multivariate normal distribution, found through a Schur complement as shown in (Haynsworth, 1968). If
Second, the covariance matrix must be positive definite (containing only positive eigenvalues). This is because Cholesky decomposition is necessary for generating multivariate normal random samples. The covariance matrix
Fast simulation for large multivariate random variables is necessary when simulating House elections. The mvrnorm function in the MASS package (Venables & Ripley, 2002) is very efficient, but for larger multivariate distributions, the large matrix multiplication becomes a major drag on simulation speed. Matrix multiplication does not scale well with size, so it becomes necessary to use a more efficient operation, like matrix addition. For large electoral bodies like the U.S. House, a much more efficient technique can be utilized. However, it must be emphasized that this faster method only works for forecasts that assume normality rather than a fat-tailed distribution. Also, this method relies on simplifying a correlation matrix to two levels of correlation. Given two levels of correlation (like a national-level and regional-level correlation), one can express a House seat’s margin as follows:
Likewise, for two districts in the same region of the country, the correlation is
The national and regional-level correlations
There is an important distinction to be made between setting the correlation and covariance structure in a forecast, and adjusting the forecast in the presence of new information. This area will focus on the construction of a plausible and effective correlation structure.
When there are simultaneous district-level races taking place, their margins are correlated. However, as was explained in the last section, great care must be taken to ensure that the resulting matrix is full-rank and positive definite. As stated in the last section, we use a national and regional-level correlation value that we optimize using past elections. Shown below is a 2022 House map broken down by ‘region.’
In our House model, every congressional district is 30% correlated with every district in a different region, and 60% correlated with districts in the same region. We arrived at those numbers by testing on previous elections, and splitting our regions based on demographics rather than location. The benefits of this simplification do come with some costs, however. Although we came up with the districts based off numerical cutoffs, the selection of the criteria that determined each region was somewhat subjective. In future research, one could use k-means clustering with the district demographics and geography as a more objective way to determine regions for the House.
Every covariance matrix depends on two things: correlation and variance. One may want to assume that all races have the same variance, which one would only need to scale the correlation matrix by the desired variance. However, not all races have the same variance. States that are not considered battlegrounds are seldom polled, and the polling in those states often understates the dominant party’s strength. As such, the margins in those states are more variable, even if the winner is not in doubt. We will outline the importance of using nonconstant variance in the next section with what we call the ‘Kentucky Problem.’ Finding the right values for the variance of each race can be done through taking the conditional distribution. For example, one can ask how much a prediction for Ohio should change in a scenario where Trump does 10 points better than the model prediction in Kentucky, and adjust the variance for the states accordingly until the conditional distributions are plausible in different scenarios.
Our modeling and methodology are very versatile. Unlike most models, it is used not only to forecast an election before election night, but to forecast elections during election night, in real time as results come in.
We use the conditional normal distribution that we stated above to update our predictions when the margins in a state or some states are known. We can update a new forecast conditional on the known margins. This process is also extremely useful when building a correlation matrix, as one can use the conditional distributions to test what their forecast would predict in probable and improbable events, like Trump winning California or Biden winning West Virginia. It is also applicable with several states known. Perhaps the neatest element of these conditional distributions is that if the starting distribution is multivariate normal, the conditional distributions are multivariate normal, and the marginal distribution for each state is still a normal distribution, with the mean and variance being affected by the observed data. The conditional mean and variance are found through taking the Schur complement, as described in Equation 2.1. Therefore, one can easily calculate the win probability and expected margin in any possible situation, for any state. This can lead to much better estimation of covariance matrices, and will allow flaws in FiveThirtyEight and The Economist’s models to be fixed before one writes a blog post about the other’s model.
Our modeling is not limited to the two-party system of U.S. general elections, where the two-party margin is the only outcome variable. We have designed modeling that is capable of working in scenarios where there are three or more viable parties. We use different distributions to model proportions, but still use the same correlation structure and the same idea of conditional distributions.
The Dirichlet distribution is a multivariate generalization of the beta distribution. To simulate a Dirichlet random variable, take n independent Gamma random variables
We can model the ‘strength’ of a candidate throughout a state by using a multivariate Gamma distribution. When we use a Gamma distribution to represent the strength of each candidate in each county, it follows that each county is Dirichlet distributed. We set the distribution for each candidate to have a mean that is highest around the candidate’s home, and a spatially dependent covariance matrix. We use a Gaussian copula, so that we can more easily calculate conditional distributions once some constituencies have reported. The conditional distributions allow us to improve our prior estimate, and when we simulate from the conditional distribution, we can get approximate probabilities of each candidate coming out on top. This methodology is performed in an attached R file. The simulations in this scenario are much slower than for other types of elections, since for each ‘simulation’ it is necessary to generate a Gamma distribution for every candidate and for every county. We outline a detailed case study in the subsection ‘Indiana Example.’
We outline how a real-time model can be built, using the outcomes of called races to estimate the number of seats each party will eventually win. FiveThirtyEight attempted to build something similar to this, in the 2018 midterm elections. Their real-time model had mixed results, as it initially started out with Democrats as the clear favorite to take control of the House, then quickly moved toward a tossup, before slowly converging to the actual result. Their estimator used the live margins from the states and districts that had been called already. Their model’s early move toward Republicans was most likely caused by the fact that it did not account for ‘blueshifts’ that occurred after election night, a relatively new phenomenon at the time.
DDHQ has made some improvements to this line of thinking. We recognize that the reported margins will not be settled on election night, especially with the rise of mail-in and absentee voting, and the blue (or red) shifts that can take place after election night. For that reason, the reported margins in called races will not be considered in this model. We will only rely on our preelection forecast, and the races that we have called on election night. Our model recognizes the the margin in a called race is still a random variable, but one that we know will be greater than zero for the party that the race is called for. By treating the called races as a random variable, we will not be underestimating the variance in our updates. Consequently, our model will not be making any undue assumptions, and it will be able to steadily converge to the eventual outcome of the House. We outline this with a flowchart and example in Section 3.3
This section has important examples that help to illustrate our methodology and show the benefits of adopting certain techniques.
Kentucky is usually one of the first states to report results on election nights. If the result in Kentucky ends up being substantially different than what was predicted, the estimates for the other states in the region will likewise move substantially with Kentucky. From a forecasting point of view, this can make predicting a Biden victory turn into predicting a Trump victory if not handled properly. The biggest reason that non-swing states like Kentucky can throw off a model’s estimate is because reliable polling data is scant and an intrinsic tendency of polls is to underestimate the margin in such states. Consider FiveThirtyEight’s 2016 election model, which was mostly a polls-based forecast (Silver, 2016). In Silver’s forecast, he greatly underestimated the actual margins in non-battleground states for both Trump and Clinton. Shown below is a plot of FiveThirtyEight’s residuals (actual − predicted Trump margin) versus actual margin, for both the 2016 and 2020 elections.
The FiveThirtyEight model in 2016, being mostly polling based, severely underestimated Trump’s margin in traditionally republic states, while underestimating Clinton’s margin in traditionally democratic states. Although FiveThirtyEight partially resolved this in 2020 by giving more weight to fundamentals, the issue of polling non-swing states is still a major source of model inaccuracy, causing even the best-calibrated models to underestimate the eventual margins in non-battleground states. DDHQ’s 2020 Election model had a similar issue. The way DDHQ resolved the ‘Kentucky Problem’ in simulations is by inflating the variances of non-swing states so that a margin much higher than expected will only marginally shift the estimates for other states, even if the correlation is relatively high. This variance inflation for noncompetitive races is also a key improvement from 2020 that we made in our forecast for the 2022 midterms (Williams et al., 2020).
As far as compatibility is concerned, scaling the variances of a correlation matrix does not pose a risk to the matrix’s full-rank status, as the scalar itself can be expressed as a full-rank diagonal matrix V. By Sylvester’s Rank Inequality, the product of two full-rank matrices, is always full-rank. So long as the initial correlation matrix is full-rank and positive-definite, scaling is a safe way to improve a covariance matrix. We can express this transformation as
We present a table showing the model adjustments for the first 10 states called in 2020, in terms of expected electoral votes for Trump. The constant variance model (Model 1) assumes a variance of 50 for all states, and the nonconstant variance model (Model 2) assumes a variance of
The nonconstant variance model stays closer to Trump’s eventual electoral vote count of 232. This more powerful model better addresses the Kentucky Problem, and leads to better predictions that match up well with the eventual result.
We will use an example of a primary that took place in 2018. The republican senatorial primary in Indiana was contested between three viable candidates, those candidates being Rep. Luke Messer, Rep. Todd Rokita, and the eventual senator, Mike Braun. We know from our experience covering primaries that a candidate’s strength is most concentrated around where they live. From that, we can build a spatially weighted distribution centered around their locations. Their home county is where we would expect them to perform the strongest, and the further away a county is, the weaker they would be expected to perform.
In this hypothetical example, we model the prior strength of each candidate as follows: Let
Then, once some counties have reported, we can update our previous estimate by taking the conditional distribution for the strength of each candidate. We simulated draws from this conditional distribution to estimate the probability of each candidate winning. In this example, Braun won about 99% of them. Shown below is the predicted map after 20 randomly selected counties have reported, compared side-by-side with the actual results.
Finally, we present a table comparing the 95% confidence intervals of our simulations compared with the actual result. This represents the prediction of the other 72 counties given the results of 20 randomly selected counties.
First, we outline our thought process with the flowchart shown below:
Also, it is important to note that this methodology is perfectly applicable for Senate seats, and is actually much simpler to carry out, since there are only around 35 Senate races per cycle compared with all 435 House seats.
In the early stages of election night, DDHQ will likely be able to call many races that are rated as solidly democratic or solidly republican. Most of our simulations from our forecast will have these outcomes, so we can simply take the subset of simulations that have the desired outcome. Shown below is a simplified example of our process. It is 7:15 pm on election night. DDHQ has called IN-06 for Republicans and IN-07 for Democrats.
Since Simulations 1,3,5 are the only ones that match the results of both of the calls, we would include only those simulations in the new subset. The new subset will also exclude the called seats, since their result is no longer relevant. This strategy is effective when only the safest seats are involved, but it cannot be sustained once we start to call the toss-up races, because even with an accurate forecast, we will quickly run out of simulations to subset. In the next section, we discuss how we can overcome the curse of dimensionality through rejection sampling.
This model saves the most recent subset after each run. Our model has a ‘pivot point,’ which is when the last subset had over 10,000 simulations, and the new subset will have less than 10,000. Once we reach this point on election night, the model will find the sample mean and covariance of the old subset. This old subset, as well as the new subset, will be multivariate normal distributed (proof and simulation study in the Appendix). The multivariate normal property is important, especially once we simulate more draws. Then, using the sample mean and covariance as the parameters, we simulate extra draws from this distribution until we get 10,000 draws that have the desired outcome in the next seat (We ignore the previous seats, because the old subset is already derived from the outcomes of the previous seats). There is no risk of the sample covariance matrix not being positive-definite, since there will always be at least 10,000 samples from the initial distribution. This is a form of multivariate rejection sampling. We sample until we get 10,000 samples that have the desired outcome in the newly called seats. Shown below in figures 9 and 10 is a pair of plots that illustrates the concept.
In the extremely rare event that the outcome of the next race does not occur in our subset, we cap the number of added simulations at 10,000,000, and will just take the 10,000 simulations closest to the desired outcome. So far in our testing, we have not needed to employ this scenario, even in the most extreme wave elections where a party wins 350 seats.
Although we described several possible improvements to election forecasting in this article and in (Williams et al., 2020), our main purposes in this article were to improve the mechanics of modeling correlated outcomes and to extend that methodology to a wide variety of scenarios. There are many different forecasts out there, but all forecasts should be considering the mechanics of making outcomes correlated and thoroughly testing the probability of joint outcomes through conditional distributions, regardless of how they predict individual race margins and probabilities.
We recognize that other forecasters often try to do too much with too little information, like estimating thousands of separate correlation values with just a few previous elections. We decided that the best resolution was to simplify the map into regions, and to just use one correlation value for districts in different regions, and another for districts in the same region. Our approach may seem somewhat simplistic, but the covariance in our model is highly plausible and is calibrated on both preelection and election-night scenarios.
It was also important to show that our modeling is usable when some actual information is present. Through conditional distributions, models can be improved with results from any combination of districts, and will steadily converge to the eventual outcome. When building and testing a forecast, this approach also allows us to go back and revise our model in case our conditional distributions seem unrealistic. We use multivariate normal distributions to model the dependence in our forecast, but the methodology for those who want to use a multivariate t distribution is outlined in Appendix C.
This conditional distribution methodology can even be generalized to multiparty scenarios, so election night forecasting need not be limited to U.S. general elections every 2 years. There are dozens of primary nights in every 2-year cycle in the United States, where several candidates have viable shots at winning. Further, the multiparty methodology is applicable for elections outside the United States, since most European and Asian democracies work under parliamentary systems. As such, the multiparty methodology makes it possible to use such tools on a far more frequent basis.
It was also necessary to consider a more complicated but more realistic election-night scenario. Through our subsetting and rejection sampling process, it is now possible for forecasters to build accurate models that take into account election-night results. We are also able to overcome the dreaded curse of dimensionality in a very efficient manner, while only relying on called races, rather than trying to use margins when they are not determined yet. We skipped some important mathematical steps in that subsection, but we outline the math in detail with a case study in the appendices. Our live seat calculator makes fewer assumptions than previous models, and is very robust to the blueshifts and redshifts that can occur in the weeks after election night.
As a final note, some important proofs and footnotes are included in the appendices below. They include a proof of the multivariate normality of a subset, a simulation study furthering the proof, a statement of code and packages used, and a quick discussion of using a t distribution to model outcomes instead of a normal distribution.
Sydney Louit, Mukul Ram, Kiel Williams, Alex Alduncin, Patrick McCaul, and Scott Tranter have no financial or non-financial disclosures to share for this article.
Ding, P. (2016). On the conditional distribution of the multivariate t distribution. The American Statistician, 70(3), 293–295. https://doi.org/10.1080/00031305.2016.1164756
Gelman, A. (2020). Reverse-engineering the problematic tail behavior of the fivethirtyeight presidential election forecast. https://statmodeling.stat.columbia.edu/
Hanretty, C. (2021). Forecasting multiparty by-elections using dirichlet regression. International Journal of Forecasting, 37(4), 1666–1676. https://doi.org/10.1016/j.ijforecast.2021.03.007
Haynsworth, E. V. (1968). On the Schur complement, Basel mathematical notes (University of Basel). BMN 20, 17 .
Heidemanns, M., Gelman, A., & Morris, G. E. (2020). An updated dynamic Bayesian forecasting model for the US presidential election. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.fc62f1e1
Isakov, M., & Kuriwaki, S. (2020). Towards principled unskewing: Viewing 2020 election polls through a corrective lens from 2016. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.86a46f38
Lauderdale, B. E., & Linzer, D. (2015). Under-performing, over-performing, or just performing? The limitations of fundamentals-based presidential election forecasting. International Journal of Forecasting, 31(3), 965–979. https://doi.org/10.1016/j.ijforecast.2015.03.002
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. https://doi.org/10.32614/RJ-2018-009
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Silver, N. (2016). A user’s guide to FiveThirtyEight’s 2016 general election forecast. https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer. https://doi.org/10.1007/978-0-387-21706-2
Williams, K., Ram, M., Shor, M., Jarugula, S., DeRemigi, D., Alduncin, A., & Tranter, S. (2020). Forecasting the 2020 U.S. elections with Decision Desk HQ: Methodology for modern American electoral dynamics. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.9663befd
The purpose of this proof is to show that the distribution of
Let’s consider the Senate results in Arizona, Georgia, Nevada, and Pennsylvania. All four races are considered to be very close contests at the time of this writing (Summer 2022). Let’s assume that the races are multivariate normally distributed with the marginal distribution for each race as
While DDHQ prefers to use a multivariate normal distribution, some other forecasters may want to run simulations with a multivariate t distribution, since it has wider tails and thus can better account for extreme scenarios. Because multivariate normal distributions do not have tail dependence, they may fail to account for the truly extreme scenarios where multiple states are showing extreme results. The past few elections have had some surprising results, but none extreme enough as to necessitate a t distribution. FiveThirtyEight used t distributions with 10 degrees of freedom to model states in their 2016 forecast, found at (Silver, 2016), explaining that with the small number of elections to train on, extreme results are still possible. In order to generate observations of a multivariate t distribution with a given mean and covariance matrix, one can use the mvtnorm package in R. As for the calculation of conditional multivariate t distributions, we restate a result found in (Ding, 2016).
R (R Core Team, 2021) is used for all calculations and simulations mentioned in this article, and the R package MASS (Venables & Ripley, 2002) is used for generation of correlated multivariate random variables. In the MASS package, the mvrnorm function is used to generate samples from a multivariate normal distribution. We use the sf package (Pebesma, 2018) to produce the maps shown in the Indiana example. Our code is publicly available on Harvard Dataverse at the link https://dataverse.harvard.edu/dataverse/ddhq-dataverse.
©2022 Sydney Louit, Mukul Ram, Kiel Williams, Alex Alduncin, Patrick McCaul, and Scott Tranter. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.