Skip to main content
SearchLoginLogin or Signup

Election Night Forecasting With DDHQ: A Real-Time Predictive Framework

2024 Election Theme
Published onOct 18, 2024
Election Night Forecasting With DDHQ: A Real-Time Predictive Framework
·

Abstract

The Decision Desk HQ (DDHQ) team has built an innovative Live Primary Model for real-time prediction of primary election outcomes. This article provides a comprehensive overview of the model’s methodology, which combines live vote reporting, geospatial data, and demographic information to estimate candidates’ winning probabilities and expected vote shares. The model focuses on completed and nearly completed voting collection units (VCUs), utilizing a combination of geographic and demographic components to generate predictions. It employs a variety of sophisticated statistical techniques, including generalized estimating equation regression and copula approaches, to ensure accurate and consistent results. The model’s adaptability is enhanced through adjustable exogenous parameters, allowing for fine-tuning based on specific election dynamics. By leveraging the DDHQ Application Programming Interface (API) for continuous data updates, the Live Primary Model provides a robust framework for analyzing and forecasting primary election outcomes, offering an accurate depiction of election results in real time.

Keywords: elections, forecasting, modeling, politics


Media Summary

The Decision Desk HQ Live Primary Model introduces a scalable approach to real-time, election night primary election analysis. The techniques used eliminate the need to set manual priors in each county, making it highly adaptable and efficient for many primary elections. The Live Primary Model uses sophisticated analysis of geospatial and demographic data to predict candidate performance in outstanding votes.

The model’s foundation rests on a curated analysis of completed and nearly completed (the idea of a near completion is defined in Section 2.2) voting collection units (VCUs). In most states, VCUs are counties, while in the six New England states, the VCUs used are townships. By focusing on VCUs with sufficient votes to ensure stable topline results, the model avoids potential inaccuracies stemming from partial reports, enhancing its reliability.

The model initially creates prior estimates for each candidate using available polling data. These numbers are then refined into regional “strength” estimates by leveraging the understanding that candidates typically perform better in their home regions, providing preelection predictions. As votes begin to be reported, the model works to understand the profile of communities where each candidate performs better or worse, incorporating key demographic variables from the U.S. Census and American Community Survey. These include critical factors such as race, income, education, and partisanship, allowing for a more comprehensive analysis of voting patterns.

The Live Primary Model’s dynamic nature is one of its standout features. As election night progresses and more data becomes available, the model continuously adjusts the balance between the geographic and demographic components of the model based on previous elections.

To account for the inherent uncertainty in election outcomes, the model runs 10,000 simulations for each model update. This method provides a range of potential outcomes and communicates the underlying uncertainty that is so important in elections forecasting.

Each primary election has unique dynamics due to the specific matchup of candidates and their varying racial and ideological differences. To account for this, the model considers three adjustable parameters that allow experts to fine-tune the model for each election’s particular matchup and characteristics.

The model’s integration with the DDHQ API ensures a constant stream of up-to-date data, further enhancing its real-time forecasting capabilities. This seamless data flow enables the model to provide timely and accurate predictions as results unfold.


1. Introduction

The Live Primary Model is a predictive tool that provides real-time predictions of primary election outcomes. The model leverages reported vote data to accurately estimate each candidate’s probability of winning and expected vote share, employing an innovative approach that incorporates geospatial and demographic data to formulate predictions.

Enhancing real-time election coverage, the model provides consumers with a more comprehensive and contextual analysis of results, addressing the limitations of reporting raw vote counts. Furthermore, the model serves as an invaluable tool for race call desks, enabling them to triage their analysis more effectively and make public race calls with greater accuracy and efficiency. The model directly incorporates our data collection, leveraging real-time results and proprietary turnout estimates. Its scalable architecture accommodates both geospatial and demographic factors, making it adaptable to various electoral contexts, including primaries and general elections even when little is known about the background of the race. This integration ensures consistency and allows for rapid, live updates, enhancing the model’s responsiveness and accuracy.

The Live Primary Model is designed to handle statewide races, such as presidential, gubernatorial, and Senate primaries. Our model best handles states with a large number of VCUs, requiring at least 15 for implementation. As the number of VCUs increases, the model’s accuracy and confidence intervals improve. This is because some VCUs report near-complete results while others are still counting, providing a more representative sample of the overall vote distribution.

The model remains up-to-date throughout election night by integrating a continuous feed of vote and turnout data from the DDHQ API. As new data is fed in, the model automatically produces new predictions, allowing users to track the evolving dynamics of the race in real time. For example, candidates perform more similarly in communities that are located near each other and have a more similar demographic composition. By analyzing areas where votes are reported, the model can predict a candidate’s performance in unreported VCUs and aggregate these predictions to estimate the final expected margin.

Correct implementation relies on the availability of nearly completed county-level results on election night. This assumption does not hold in certain states (often in the western half of the United States), such as Washington, Oregon, California, Nevada, Arizona, Utah, Colorado, and Maryland, where mail-in voting is prevalent, and results in any VCUs will not be fully tabulated on election night. In these cases, the model’s accuracy may be compromised due to the potential differences in candidate vote share between early- and late-reporting mail-in ballots. While it is possible to adjust the model’s parameters to account for these states, serious caution should be exercised when interpreting the results.

After the model determines expected vote shares and confidence intervals for candidates in each county, it then runs thousands of simulations to model election outcome uncertainty. Millions of Americans follow election results in real time each year, often relying on limited information. The Live Primary Model offers a more comprehensive view than simply tracking the current vote count leader. By incorporating known data about outstanding votes, it provides crucial context.

The model is not only designed to assist the general public by providing real-time contextual vote updates but also to offer critical insights to political analysts. In recent years, allegations of election fraud after leaders switched in key races have become increasingly prominent, eroding trust in the electoral process. The Live Primary Model not only enables viewers to observe vote counts as they occur but also offers interpretive insights regarding the status of the election based on the votes already tallied. Americans are going to consume partial election results: Nielsen reports indicate that approximately one-third of American households tune into election-night programming, along with skyrocketing social media coverage (Nielsen, 2020). This increasing demand and heightened audience engagement reflect the public’s desire for a deeper understanding of the electoral processes in which they participate.

Moreover, vulnerabilities in reporting election night results (e.g., correcting for errors, erroneous exit polling numbers, differences in Election Day votes vs. mail-in ballot, etc.) have frequently led to mistrust and confusion among voters and even election officials (Pettigrew & Stewart, 2020). Several studies examining uncertainty related to how election results are reported outline opportunities for news outlets to reduce the public uncertainty (Beckers, 2020; Cai & Kay, 2024; Witzenberger & Diakopoulos, 2023). These opportunities to correct for uncertainty include improving the public’s awareness of how and when different votes come in, the errors associated with the different types of votes, and depicting these vote trends in readable graphs or tables. Our model can explicitly address these issues by displaying to the public how votes are likely to come in and what differences on election night voters should expect.

Live election coverage should prioritize accurate, insightful analysis over raw vote counts, which can be misleading. Accurately estimating final election results earlier benefits not only politically engaged individuals eager for quick results, but also less engaged voters who may follow partial returns without understanding the context—such as where outstanding votes are located and how they are expected to break. Providing clearer estimates helps prevent misinterpretation and improves public understanding of the electoral process.

In addition to providing valuable insights to election-night viewers, the model offers political analysts critical data for the strategic improvement of election processes. The Live Primary Model allows users not only to track how quickly counties report votes, but also evaluate how different reporting speeds affect the public’s understanding of election results. Currently, election reporting efficiency varies widely across counties and states. By identifying counties that delay result clarity, the model helps improve reporting processes, ensuring faster and clearer election outcomes for the public.

Decision Desk HQ runs a major call desk, and its race calls receive a large amount of media attention on every Election Day. The Live Primary Model is a useful tool for the call desk, but is a separate process. Even if the model reaches a 99% (or >99 %) win probability for a candidate, the call team will wait for more certainty, focusing on rare edge cases that could unexpectedly swing the election, such as extreme turnout or regional vote shifts. Even though the Race Call team will never call elections purely from the results of any model, it is another useful data-point provided to the Race Call team for internal analysis.

The Live Primary Model is particularly valuable due to its ability to accurately forecast primary races, which are generally considered more challenging to predict than general elections. This complexity arises from the less partisan nature of primaries, where candidates from the same party compete against one another. In general elections, election centers often rely on the partisan composition of a district or VCU as a primary indicator of a candidate’s likely vote share. However, in primary elections, such partisan metrics are far less predictive. Further, primary elections tend to experience lower voter turnout, which leads to greater variability in outcomes compared to general elections. Lower public interest in primaries also means there is often a lack of other highly predictive data, such as comprehensive polling and substantial fundraising figures, further complicating predictions. The Live Primary Model addresses these challenges by providing nuanced insights into the dynamics of these unpredictable races.

Despite the difficulties of predicting primary election outcomes, DDHQ’s model has demonstrated high effectiveness in back-testing. For the 2022 midterms, it accurately predicted winners in all 10 key statewide races before both DDHQ and Associated Press (AP) call desks, without any false calls or assigning >95% win probability to losing candidates. In 2024, the model outperformed both the DDHQ and AP call desks in seven key races, including the West Virginia GOP gubernatorial primary and several Republican presidential primaries (New Hampshire and second place in Iowa), despite fewer competitive statewide primaries.

2. Data Ingestion

2.1. Overview of Election Night Analysis

On election nights, political analysts nationwide strive to forecast race outcomes in real time by extrapolating the anticipated final vote tally for each candidate based on the reported data. The Live Primary Model formalizes the process employed by race call teams at DDHQ, the AP, and television networks.

To predict final race results, election night analysts aim to answer two key questions:

  1. Given fully reported data in a VCU, what is the expected composition of the fully reported data in other VCUs?

  2. Given partially reported data in a VCU, what is the expected composition of the fully reported data in that VCU?

2.2. Focus on Completed and Nearly Completed Counties

The Live Primary Model focuses solely on answering the second question, exclusively utilizing data from nearly completed and completed counties. A county is considered completed when all votes from election night have been reported, acknowledging that additional votes, from provisional and overseas ballots, may trickle in over the following week. The model also incorporates nearly completed counties, defined as counties where a sufficiently high proportion of the vote has been reported, such that the overall results are expected to change only minimally after that percentage of the vote is accounted for. Different states have varying laws and norms surrounding “vote type,” which includes absentee (AB), early voting (EV), and Election Day (ED) voting. These differences influence the methods by which people vote and when those votes are reported. As a result, the timing of when states are considered nearly completed varies. In some states, a significant portion of the vote may be cast through AB/EV, and these votes are often counted and reported before or on election night. In such cases, a higher percentage of the total vote count may be sufficient to consider the county as “nearly completed” since these early returns may not accurately represent the preferences of the electorate as a whole. For example, Connecticut townships that have reported 30% or more of their votes are considered nearly completed, while Ohio counties need to reach a 70% reporting threshold to be classified as nearly completed. This discrepancy is due to the differences in voting patterns and reporting practices between the two states.

After a county is nearly completed, it is added to the geographic and demographic models and affects conditional distributions/predictions for the rest of the uncompleted counties. By concentrating on data from these two categories of counties, the Live Primary Model provides a more reliable and accurate prediction of the final race outcomes, minimizing the potential for distortions caused by incomplete or unrepresentative data.

2.3. Challenges With Partially Reported Data

The first question is significantly harder to answer than the second, particularly in the current political climate where vote-type analysis is crucial for election night data analysis (Li et al., 2022). Different populations of American voters tend to use different voting methods, which are then tabulated, counted, and reported at varying times by many states (Absher & Kavanagh, 2023). Many states and counties do not specify which types of votes have been reported and which are outstanding, which makes it even more difficult to determine. In recent elections, these uncertainties were especially pronounced, complicating the ability of the model to answer Question 2. Even in counties with a single vote type, partially reported data can be misleading. Precincts within a county often have drastically different political compositions, making it inappropriate to assume that partially reported county data is a random sample (Baltz et al., 2022).

2.4. Rationale for Model Approach

Given the above issues with partially reported county data, it is relatively unproductive to attempt to answer Question 2 through modeling, given the wide confidence intervals and the difficulty of estimating a remotely accurate probability of whether the outstanding vote is ED or AB/EV. This is especially unproductive since there are likely dozens of nearly completed or completed counties across the state that can be compared to historical precedent in a much more straightforward manner.

Completed or nearly completed county data remains the most reliable source for constructing sophisticated models to forecast election results. In most states, counties within the same state often report at vastly different rates. Therefore, our model excludes less reliable, partially reported county data. Inclusion of such data can confound the higher confidence and lower variance imputations derived from completed or nearly completed counties.

Due to the aforementioned factors, the live primary model focuses on completed and nearly completed counties in most statewide races. This approach helps to mitigate the potential for misleading conclusions that may arise from the geographical and political patterns present in partially reported data, as precincts within a county can have drastically different political compositions.

3. Geographic Model

The geographic component of the Live Primary Model predicts live election results by incorporating geographically dependent priors with election data from nearly completed counties. Before polls close, the model constructs a prior “strength” for each candidate in each VCU, based on the candidate’s home VCU and the distance between each VCU and their home VCU. The estimated strengths are then normalized to reflect an estimate of vote shares. Candidates are expected to obtain stronger support in their home regions, and consequently, the estimated strength decreases as the distance from the candidate’s home increases. The priors are adjusted based on a candidate’s polling average. If DDHQ/The Hill provides a polling average, it is used; otherwise, averages from OurCampaigns.com (Our Campaigns, n.d.) are used. In the absence of any polling data, priors allocate an equal vote share to each candidate before making the previously described geographic adjustment.

Eventually, these distributions are updated with actual election data to form conditional distributions, which provide refined predictions for each candidate in each VCU. The computation of the conditional distribution involves conditioning the priors on the known results and utilizing a pre-determined n×nn \times n covariance matrix that captures the spatial correlation between the nn VCUs. The covariance matrix is constructed by evaluating the distances between VCUs and adjusting for the estimated vote shares. In cases where a candidate who currently represents or has previously represented a district is running for a statewide office, the covariance between VCUs within the same U.S. House district is increased. As more results become available, the updating process integrates the priors with observed results, progressively giving greater weight to the observed data.
The covariance matrix Σ\bm{\Sigma} is constructed using the following equation for VCUs:

Σij=exp(dijx){1.5if i and j are in the same district1otherwise\Sigma_{ij} = \exp\left(-\frac{d_{ij}}{x}\right) \cdot \begin{cases} 1.5 & \text{if } i \text{ and } j \text{ are in the same district} \\ 1 & \text{otherwise} \end{cases}

where dijd_{ij} is the pairwise distance between VCUs ii and jj, and xx is the maximum pairwise distance. The covariance is then adjusted based on the estimated vote shares:

ΣijΣij(1est_sharei)(1est_sharej).\Sigma_{ij} \leftarrow \Sigma_{ij} \cdot (1 - \text{est\_share}_i) \cdot (1 - \text{est\_share}_j).

Next, we consider the strengths of the candidates. Let i[1,n]i \in [1, n] denote each VCU and m[1,M]m \in [1, M] denote the index of the candidates. The prior strength for candidate mm in VCU ii is calculated using the following equation:

priorim={locmif candidate m is from unit ilocm(locmglom)(dimmax(d))0.1otherwise\text{prior}_{im} = \begin{cases} \text{loc}_m & \text{if candidate } m \text{ is from unit } i \\ \text{loc}_m - (\text{loc}_m - \text{glo}_m) \cdot \left(\frac{d_{im}}{\max(d)}\right)^{0.1} & \text{otherwise} \end{cases}

where locm\text{loc}_m and glom\text{glo}_m are the local and global prior strengths respectively for candidate mm, and dimd_{im} is the distance between unit ii and candidate mm’s home unit. Backtesting has found the best results with locm\text{loc}_m set to 0.70.7 and glom\text{glo}_m set to 0.30.3, but this is an area ripe for more exploration in the future.

The mean and covariance of the conditional distribution of vote strengths in remaining VCUs is obtained using the following equations, we calculate:

μnew=μ1+Σ12Σ221(y2μ2)\mu_{\text{new}} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (y_2 - \mu_2)
Σnew=Σ11Σ12Σ221Σ21\Sigma_{\text{new}} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}

where μ1\mu_1 and μ2\mu_2 are vectors representing the prior means for the unknown and known units, respectively, Σ11\Sigma_{11}, Σ12\Sigma_{12}, Σ21\Sigma_{21}, and Σ22\Sigma_{22} are the corresponding blocks of the covariance matrix Σ\bm{\Sigma}, and y2y_2 represents the vote shares in observed VCUs.

These equations allow the model to update the priors based on the observed results and the spatial correlation captured in the covariance matrix. The resulting conditional distribution provides updated estimates of the vote shares for each candidate in each VCU.

4. Demographic Model

While a county’s location significantly influences voter choice, incorporating voter demographics into predictions is also highly effective for modeling voter behavior in outstanding counties. Recent research (Stauffer & Fisk, 2022) has shown that voters tend to choose candidates more similar to them demographically. The demographic component of the Live Primary Model operates without priors and does not contribute to the final predictions until data from at least six counties are reported.

The model’s demographic component incorporates several key variables from the 2021 American Community Survey (ACS) 5-year estimates at the county or township levels. These variables include racial composition (percentages of Black, Hispanic, Asian, and White residents), income, educational attainment, and partisanship (e.g., percentage of the vote won by the Democratic nominee in the most recent presidential election). The specific variables included in the demographic regression vary depending on the characteristics of the electorate in a given election, with a typical range of three to five variables per race.

For GOP primaries, the most informative variables are often related to population density, partisanship, and bachelor’s degree attainment rate. In GOP primary electorates with a significant proportion of Hispanic or Black voters, these demographic variables should also be included in the model to capture their potential influence on voting patterns. Since a higher proportion of Democratic voters are racial minorities, racial data is more predictive in Democratic primaries.

These conclusions are based on numerous analyses conducted by our team. To maintain the model’s integrity and competitive advantage, the specific values and weightings assigned to each variable are proprietary and dynamically adjusted for each election.

Once the number of completed counties exceeds the number of predictors by three, the demographic model is activated. This demographic model employs a generalized estimating equation (GEE) regression (Hardin & Hilbe, 2002) with a binomial family to predict the proportion of votes won by each candidate within grouped data. The demographic regression is trained on data from counties that are either nearly or fully completed, as ingested into the model. In cases where statewide candidates previously represented a U.S. House district and are now running for higher office, the model groups the data by their former House district. GEEs are especially applicable for this situation, where considering correlations between these groups and finding robust standard errors are important (Ziegler & Vens, 2010). This difference has an especially large influence on the demographic model when there are not many counties reported, and the specific district may be over- or underrepresented. After training, the model predicts the vote share for each candidate in the counties not yet included in the model. To ensure that the predicted vote shares are consistent with the expected total votes, the model normalizes these values so that the sum of all candidates’ vote shares is 100%.

5. Prediction and Simulation

County-level predictions from the geographic and demographic models are incorporated into final county-level predictions through a weighted average of the two. A calculation to find a standard deviation of this prediction was estimated through backtesting, which determines how correlated the predictions of the two models tend to be and how much the combination of the two models decreases their respective standard errors.

At the beginning of the night, when there are not many counties completed, the geographic model is a far better predictor and is weighted much higher than the demographic model in the weighted average. The geographic model incorporates a more informative prior, while the demographic model is susceptible to overfitting and giving high-variance predictions due to the relatively low number of observations compared to predictors. As more counties report, the demographic model’s predictive power increases and its variance decreases. Meanwhile, the geographic model may produce more biased estimates by omitting crucial information. This leads to the aggregation method giving greater weight to the geographic model. The geographic model remains an integral component of the overall model, even as the number of included counties grows, since the optimal prediction leverages both available geographic and demographic data. The model weights the demographic and geographic predictions based on the number of reported counties. The demo_multiplier parameter determines the relative importance of demographics, with higher values favoring demographics over geography for a given number of reported counties.

To ensure the geographic model maintains a minimum contribution, the weighting function employs a threshold mechanism. Let xx be the number of reported counties and tt be the weighting factor. The calculation of tt uses 100x100 \wedge x, maintaining a lower bound on the geographic model’s weight even when the number of reported counties is large. This threshold is particularly relevant for states with numerous counties or townships.

To determine the weighting of different components of the Live Primary model, the model performs the following calculation:

λ=f(x,demo_multiplier)=logit1(αlog(x)+βx+γ+log(demo_multiplier)+log(δ))    (5.1)\begin{aligned} & \lambda = f(x, \text{demo\_multiplier}) = \\ &\text{logit}^{-1}\left( \alpha \cdot \log(x) + \beta \cdot x + \gamma + \log(\text{demo\_multiplier}) + \log(\delta) \right) \ \ \ \ (5.1) \end{aligned}

where:

  • xx is the number of nearly completed counties

  • demo_multiplier\text{demo\_multiplier} is the exogenous demographic multiplier parameter

  • α\alpha, β\beta, γ\gamma, and δ\delta are constants that can be adjusted to fit the specific model

  • λ\lambda is increasing with respect to both xx and demo_multiplier

The final prediction for the vote share of each candidate in the county, denoted by pp, is calculated using the following formula:

p=demoλ+geo(1λ)(5.2)\quad \quad p = \text{demo} \cdot \lambda + \text{geo} \cdot (1-\lambda) \quad \quad \quad \quad (5.2)

where demodemo represents the demographic prediction and geogeo represents the geographic prediction. This formula combines the demographic and geographic predictions, weighted by the proportion λ\lambda, to produce the final vote share prediction for each candidate in the county.

The constants in the expressions for tt and λ\lambda have been determined through extensive backtesting on historical elections to optimize the predictive performance of the model. These values can always be optimized further; future analyses to determine these constants would be interesting.

By adjusting the ‘demo_multiplier’ dd, the model can be tuned to place more or less emphasis on the demographic prediction relative to the geographic prediction. This flexibility allows for the incorporation of domain knowledge and the ability to adapt the model to different contexts or election scenarios.

After determining the estimated mean and standard deviation for a candidate’s vote proportion in each county, the model performs 10,000 simulations. Each simulation begins by generating a turnout distribution based on the estimated county turnout estimates from the DDHQ system. The expected vote shares (by county) are used as the concentration parameter for a Dirichlet distribution. The Dirichlet distribution is particularly well-suited for the purpose of turnout modeling because it incorporates uncertainty and variability in the county-level turnout estimates, while ensuring that the simulated proportions are consistent with the overall estimated vote shares (Lin, 2016). By simulating various vote distributions across counties with the Dirichlet distribution, the model accurately represents the possible outcomes for each county’s share of the total vote.

The model applies a copula approach (Sklar, 1959) to ensure that the range of generated candidate vote share values by county is realistic and consistent with the expected proportions. This involves transforming the generated values using the cumulative distribution function (CDF) of the standard normal distribution and then applying the inverse CDF of the gamma distribution to obtain the final proportions for each candidate in the unknown counties.

The model then combines the simulated proportions for the unknown counties with the known proportions from the other counties and the turnout modeling to calculate the final vote totals for each candidate in the full race for each simulation. The process runs thousands of times, capturing the uncertainty and variability in the election outcomes and determining the probability density function (PDF) for each candidate’s vote total at a race-wide level.

6. Exogenous Model Parameters

To enhance the Live Primary model’s adaptability to each election’s unique dynamics, it incorporates three adjustable parameters: “demo_multiplier,” “flexibility,” and “nearly_comp.” These exogenous parameters allow experts to fine-tune the model’s performance based on their knowledge of each race’s specific characteristics and circumstances.

6.1. Demographic Component Multiplier

The demo_multiplier parameter determines the relative weight given to the demographic component of the model compared to the geographic component, given a constant number of nearly completed counties. Higher values assign greater weight to the demographic component of the model relative to the geographic component, reflecting assumptions about the specific primary election. The precise method for tuning the model, along with the formula incorporating the demo_multiplier, is presented in Equation 5.1. The default value for demo_multiplier is 1.5 for GOP primaries and slightly higher for Democratic primaries. This parameter should be higher in presidential primaries since candidates are less likely to have geographic support bases within the state (Panagopoulos et al., 2017). Conversely, if a candidate represents a U.S. House district, the parameter should be slightly lower, as the geographic model accounts for district-level effects. The optimal value for demo_multiplier depends on significant ideological or racial differences between candidates. In primaries where candidates diverge notably regarding ideology or race, the demo_multiplier should be set higher to reflect the increased influence of demographic factors on voter preferences.

6.2. Strictness of Assumptions

The flexibility parameter calibrates the model’s confidence in its assumptions about county turnout and candidate vote shares in unreported counties. A lower flexibility value increases this confidence. Mathematically, this parameter affects two key aspects of the model:

  • Candidate strength covariance: Lower flexible values increase the covariance between candidate strength in different counties. This reflects the assumption that in more normal elections, competitive candidates maintain a higher minimum level of support across all voting count units (VCUs). Conversely, higher flexibility values decrease this covariance, allowing for more extreme geographic splits in less typical elections.

  • Turnout variation: The flexibility parameter also influences the expected variation in turnout across VCUs. Lower values constrain turnout to more closely match the expected proportion of total votes for each VCU. Higher values allow for greater deviations, accommodating more extreme local effects on turnout in less typical elections.

Lower flexibility values are appropriate for elections with greater media attention, higher spending, and higher expected turnout. Higher values suit elections with unusual circumstances, such as inclement weather or low media attention and candidate spending.

6.3. Threshold for Data Ingestion

The nearly_comp parameter sets the percentage of votes that must be reported in a VCU for it to be considered nearly completed and ingested into the model. When a VCU is marked as nearly completed, the model assumes that the remaining vote from the VCU will have the same breakdown as the reported vote rather than relying on the prior estimate formed from geographic and demographic factors. Additionally, the votes in the nearly completed VCU are used to update the geographic and demographic models, influencing the predictions for other unreported counties. The appropriate value for nearly_comp depends on the reliability of the initial data and the expected shift between AB/EV and ED voting. We typically set this parameter between 0.5 and 0.9. For Democratic primaries, we use values on the lower end of this range. When a Black candidate faces a White candidate with a significant proportion of votes coming from urban areas, we assign a higher value, as Black candidates tend to perform better in urban election districts. The higher the urban vote share, the greater the adjustment. We take into account factors specific to the state reporting system and the type of election. In cases of uncertainty, we err on the side of caution by setting nearly_comp higher to minimize potential bias in the model’s output.

7. Case Study

The Republican primary for West Virginia’s 2024 gubernatorial race was a close contest between Attorney General Patrick Morrisey and former state lawmaker Moore Capito (son of West Virginia Senator Shelley Moore Capito). Morrisey narrowly won the nomination, securing 33% of the vote compared to Capito’s 28%. The primary was one of the most expensive in state history, with Morrisey and his supporters spending over $19 million—surpassing the $12 million spent by Miller's camp, and the $4 million spent by Capito and his backers.

While Chris Miller and Mac Warner each garnered more than 10% of the vote, this Live Primary Model case study focuses solely on the margin between Patrick Morrisey and Moore Capito, the top two contenders in the Republican primary. Polling for this race was remarkably accurate, with preelection projections closely matching the final outcome. Polls predicted Morrisey would win by 4.9%, and he ultimately prevailed by 5.7%. This level of accuracy is unusual for primary elections, which typically see larger discrepancies between polls and results. Primary polls often miss by double digits due to the high number of swing voters who make their decisions late in the race.

Figure 1 shows the accuracy of county-level predictions from the live primary model in this election as more and more votes reported. Capito initially overperformed in northern West Virginia, particularly in Monongalia County. These early results were misleading, as they did not accurately reflect the overall trend. Unusually, preelection polling proved more reliable than initial returns. This led the model to overestimate Capito’s strength in central West Virginia, slightly reducing its accuracy as more results came in (as seen in Figure 1.2).

Figure 2 illustrates the improvement rate of the model’s predictions over time. As vote counting progressed, Morrisey’s victory became increasingly evident. His strong performance in the northwest and south, combined with his domination in the eastern panhandle, secured his win. By 8:18 p.m. Eastern Time, with only 32% of votes counted, the average margin prediction error per county had shrunk to just 5.6%. The model gave Patrick Morrisey a greater than 99.9% chance of winning at 8:46 p.m. ET, well ahead of the DDHQ call desk, which called the race at 9:18 p.m. ET, and the AP call desk, which followed at 10:25 p.m. ET.

Figure 1. Comparison of model predictions vs. actual margins by county at various stages of vote reporting.

Figure 2. Error in DDHQ Live Primary Model’s predicted Morrisey vs. Moore margin by county, based on the percentage of vote reporting for the May 14, 2024, West Virginia GOP Gubernatorial Primary.

8. Discussion

8.1. Leveraging Partial Data

While the Decision Desk HQ Live Primary Model primarily uses data from nearly completed and completed counties, partially reported county data remains valuable. Focusing solely on VCU-level results instead of precinct-level data leads to information loss. The flow of election results from county reporting to front-end providers like DDHQ and the AP is complex and the quality and depth of election night reporting by state government websites vary widely (Pettigrew & Stewart, 2020). For example, counties in Ohio and Kentucky exhibit predictable vote reporting patterns, while secretary of state websites in North Carolina and Georgia provide live, accurate, and detailed data. In states with high-quality reporting, more sophisticated models could potentially be developed using precinct and vote-type data. Implementing more sophisticated models would necessitate a substantially more comprehensive election night process than currently employed by DDHQ or any other reporting agency, requiring extensive human resources and specialized data extraction tools to collect information from diverse sources such as state and county election websites and PDF reports.

8.2. Adapting to General Elections

The primary model’s methods can be readily applied to general elections, offering the same insights through a similar method. General elections present a simpler analytical challenge than primaries due to the availability of more reliable predictive data, including polling and past election results. These data can inform stronger priors at both the overall and VCU levels. While these priors should be more stable than in primary elections due to the increased available information, the model should still account for spatial and demographic factors. As in primary elections, candidates are likely to overperform their baseline in geographically proximate areas and those with similar demographic profiles.


Disclosure Statement

Every author of this article either is employed, or has recently been employed, by Decision Desk HQ, a data science firm specializing in election reporting.


References

Absher, S., & Kavanagh, J. (2023). The impact of state voting processes in the 2020 election: Estimating the effects on voter turnout, voting method, and the spread of COVID-19. RAND Corporation. https://www.rand.org/pubs/research_reports/RRA112-25.html

Baltz, S., Agadjanian, A., & Chin, D. (2022). American election results at the precinct level. Scientific Data, 9, Article 651. https://doi.org/10.1038/s41597-022-01745-0

Beckers, K. (2020). The voice of the people in the news: A content analysis of public opinion displays in routine and election news. Journalism Studies, 21(15), 2078–2095. https://doi.org/10.1080/1461670X.2020.1809498

Cai, M., & Kay, M. (2024). Watching the election sausage get made: How data journalists visualize the vote counting process in U.S. elections. In F. F. Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. T. Dugas, & I. Shklovski (Eds.), CHI ’24: Proceedings of the CHI conference on human factors in computing systems (Article 391). ACM. https://doi.org/10.1145/3613904.3642329

Hardin, J. W., & Hilbe, J. M. (2002). Generalized estimating equations. Chapman; Hall/CRC. https://doi.org/10.1201/9781420035285

Li, Y., Hyun, M., & Alvarez, R. M. (2022). Why do election results change after Election Day? The “blue shift” in California elections. Political Research Quarterly, 75(3), 860–874. https://doi.org/10.1177/10659129211033340

Lin, J. (2016). On the Dirichlet distribution [Unpublished master’s thesis]. Queen’s University.

Nielsen. (2020, November). Media Advisory: 2020 election draws 56.9 million viewers during prime. https://www.nielsen.com/news-center/2020/media-advisory-2020-election-draws-56-9-million-viewers-during-prime/

Our Campaigns. (n.d.). Candidate information and election results. www.ourcampaigns.com

Panagopoulos, C., Leighley, J. E., & Hamel, B. T. (2017). Are voters mobilized by a ‘friend-and-neighbor’ on the ballot? Evidence from a field experiment. Political Behavior, 39, 865–882. https://doi.org/10.1007/s11109-016-9383-3

Pettigrew, S., & Stewart, C., III. (2020). Protecting the perilous path of election returns: From the precinct to the news. The Ohio State Technology Law Journal, 16(2), 587–638.

Sklar, M. (1959). Fonctions de répartition à N dimensions et leurs marges. Annales de l’ISUP, 8, 229–231.

Stauffer, K., & Fisk, C. (2022). Are you my candidate? Gender, undervoting, and vote choice in same-party matchups. Politics & Gender, 18(3), 575–604. https://doi.org/10.1017/S1743923X20000677

Witzenberger, B., & Diakopoulos, N. (2023). Election predictions in the news: How users perceive and respond to visual election forecasts. Information, Communication & Society, 27(5), 951–972. https://doi.org/10.1080/1369118X.2023.2230267

Ziegler, A., & Vens, M. (2010). Generalized estimating equations: Notes on the choice of the working correlation matrix. Methods of Information in Medicine, 49(5), 421–425. https://doi.org/10.3414/ME10-01-0026


©2024 Zachary Donnini, Sydney Louit, Shelby Wilcox, Mukul Ram, Patrick McCaul, Arianwyn Frank, Matt Rigby, Max Gowins, and Scott Tranter. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?