Curating a COVID-19 data repository and forecasting county-level death counts in the United States

As the COVID-19 outbreak continues to evolve, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we collate a large data repository containing COVID-19 information from a range of different sources. We use this data to develop several predictors and prediction intervals for forecasting the short-term (e.g., over the next week) trajectory of COVID-19-related recorded deaths at the county-level in the United States. Specifically, using data from January 22, 2020, to May 10, 2020, we produce several different predictors and combine their forecasts using ensembling techniques, resulting in an ensemble we refer to as Combined Linear and Exponential Predictors (CLEP). Our individual predictors include county-specific exponential and linear predictors, an exponential predictor that pools data together across counties, and a demographics-based exponential predictor. In addition, we use the largest prediction errors in the past five days to assess the uncertainty of our death predictions, resulting in prediction intervals that we refer to as Maximum (absolute) Error Prediction Intervals (MEPI). We show that MEPI is an effective method in practice with a 94.5\% coverage rate when averaged across counties. Our forecasts are already being used by the non-profit organization, Response4Life, to determine the medical supply need for individual hospitals and have directly contributed to the distribution of medical supplies across the country. We hope that our forecasts and data repository can help guide necessary county-specific decision-making and help counties prepare for their continued fight against COVID-19. All collected data, modeling code, forecasts, and visualizations are updated daily and available at \url{https://github.com/Yu-Group/covid19-severity-prediction}.


Introduction
In recent months, the COVID-19 has dramatically changed the shape of our global society and economy to an extent modern civilization has never experienced. Unfortunately, the vast majority of countries, the United States included, were thoroughly unprepared for the situation we now find ourselves in. There are currently many new efforts aimed at understanding and managing this evolving global pandemic. This paper, together with the data we have collated, represents one such effort.
Our goal is to provide access to a large data repository (that combines data collected by a range of different sources) and to provide a predictor to forecast short-term COVID-19 mortality at the countylevel in the United States along with uncertainty assessments of our predictors in the form of intervals.
Predicting the short-term impact of the virus in terms of the number of deaths (e.g., over the next week) is critical for many reasons. Not only can it help elucidate the overall impacts of the virus, but it can also help guide difficult policy decisions, such as when to impose/ease lock-downs. While many other studies focus on predicting the long-term trajectory of COVID-19, these approaches are currently difficult to verify due to a lack of long-term COVID-19 data. On the other hand, predictions for immediate shortterm trajectories are much easier to verify and are likely much more accurate than long-term forecasts. Moreover, other predictive efforts have focused on modeling COVID-19 case-or death-counts at the national or state-level, rather than the more fine-grained county-level that we consider in this paper.
As a result of researchers across academia and industry collectively refocusing their efforts towards combating this universal viral threat we face, there are now a large number of papers covering many dimensions of COVID-19 (see Section 6 for related work). However, to the best of the authors' knowledge, there is no related work addressing county-level predictions of COVID- 19.
The predictions we produce in this paper focus on confirmed death counts, rather than confirmed cases since confirmed cases fail to accurately capture the true prevalence of the virus due to limited testing availability. Moreover, comparing different counties based on confirmed cases is difficult since some counties have performed many more tests than others: the number of positive tests does not equal the number of actual cases. We note that the confirmed death count is also likely to be an under-count of the number of true COVID-19 deaths (since it seems as though in many cases only deaths occurring in hospitals are being counted). 2 Nonetheless, the confirmed death count is believed to be more reliable than the confirmed case count.
In Section 2, we introduce our data repository and summarise the data sources contained within. This data includes a wide variety of COVID-19 related information in addition to the county-level caseand death-counts.
In Section 3, we introduce our predictive approach, wherein we fit a range of different exponential and linear predictor models using the data. Each predictor captures a different aspect of the behaviors exhibited by COVID-19, both spatially and temporally, i.e., across regions and time. The predictions generated by the different methods are combined using an ensembling technique, which we refer to as Combined Linear and Exponential Predictors (CLEP).
In Section 4, we develop uncertainty estimates for our predictors in the form of prediction intervals, which we call Maximum (absolute) Error Prediction Intervals (MEPI). The ideas behind these intervals come from conformal inference [44].
Section 5 details the evaluation of the predictors and the prediction intervals. Overall, we find that our predictions are adaptive to the exponential and sub-exponential nature of COVID-19 outbreak, and our prediction intervals are reasonably narrow and cover the recorded number of deaths for more than 90% of days from April 11 to May 10 for most of the counties in the US.
Finally, we describe related work by other authors in Section 6, we discuss the impact of our work in distributing medical supplies across the country in Section 7, and conclude in Section 8. Supporting material is provided in the Appendix.
Making both data and the methods used in this paper accessible to others is key to ensuring the usefulness of these resources. Thus the data, code, and predictors we discuss in this paper are available as open-source on GitHub and are updated daily. The results in this paper contain case and death information in the U.S. from January 22, 2020, to May 10, 2020, but the data and forecasts in the GitHub repository are updated daily. We also provide several visualizations of the daily updated data and forecasts. 3 In Figure 1, we provide a high-level summary of the contributions made in this work.  Figure 1: An overview of the paper. We curate an extensive data repository combining data from multiple data sources. We then build several predictors for county-level predictions of cumulative COVID-19 death counts, and develop an ensembling procedure (CLEP) and a prediction interval scheme (MEPI) for these predictions. Both CLEP and MEPI are generic machine learning methods and can be of independent interest (see Sections 3.6 and 4.1 respectively). All the data, predictions, and visualizations are publicly available on GitHub.

COVID-19 data repository
One of our primary contributions is the curation of a COVID-19 data repository that we have made publicly available on GitHub. It is also updated daily with new information. Specifically, we have compiled and cleaned a large corpus of hospital-and county-level data from 20+ public sources to aid data science efforts to combat COVID-19. The repository currently includes data on COVID-19related cases, deaths, demographics, health resource availability, health risk factors, social vulnerability, and other COVID-19-related information. In Table 1, we provide an overview of the county-level and hospital-level data sources in our repository. The full corpus of data, along with further details and extensive documentation, are available on GitHub. Note that similar but complementary county-level data was recently aggregated and released in another study [33]. Table 1: COVID-19 Data Repository: Overview of county-and hospital-level data sets. Data sets marked with an asterisk (*) were used in the predictors discussed in this work.
In Sections 3, 4 and 5, we primarily use the county-level case and death reports provided by USA Facts from January 22, 2020 to May 10, 2020, along with some county-level demographics and health data (the datasets marked with an asterisk (*) in Table 1).

Predictors for forecasting short-term death counts
As a first step, we provide a visualization of the COVID-19 outbreak across the United States in Figure 2. We plot (a) the cumulative recorded death counts due to COVID-19 up to May 10 (top panel), and (b) the new death counts from May 1 to May 10 (bottom panel). Each bubble denotes the count in a county, a darker and larger bubble denotes a higher death count, and lack of bubble denotes that the count is zero. The top panel captures the extent of the outbreak in a region, while the bottom panel captures the recent trends in the outbreak. The color scale in the two plots is different to better illustrate the respective counts in each plot, but we keep the scales for the bubble size the same to provide a comparison between the extent and recent trend of COVID-19. Overall, Figure 2 clearly shows that the COVID-19 outbreak in the United States is very dynamic both in time and across different regions. The worstaffected regions include the states of New York, New Jersey, Massachusetts, Michigan, and parts of Illinois, Florida, Louisiana, Georgia, Washington, and California. Moreover, the majority of these areas continue to face a substantial COVID-19 burden in early May.
We fit several different statistical predictor models to capture the dynamic behavior of COVID-19 death counts. Since each method captures slightly different trends in the data, we also compute various weighted combinations of these models. The five predictors we introduce in this paper are: 1. A separate-county exponential predictor (the "separate" predictors): a series of predictors built for each county using only data from that county, used to predict deaths in that county.

2.
A shared-county exponential predictor (the "shared" predictor): a single predictor built using data from all counties, used to predict death counts for individual counties.
3. An expanded shared-county exponential predictor (the "expanded shared" predictor): a predictor similar to the shared-county exponential predictor, but also includes COVID-19 case numbers and neighboring county cases and deaths as predictive features.

4.
A demographics shared-county exponential predictor (the "demographics shared" predictor): a predictor also similar to the shared-county exponential predictor, but also includes various county demographic and health-related predictive features.

5.
A separate-county linear predictor (the "linear" predictor): a predictor similar to the separate county exponential predictors, but uses a simple linear format, rather than the exponential format.
An overview of these predictors is presented in Table 2. In order to combine the different trends captured by each of these predictors, we also fit various combinations of them, which we refer to as Combined Linear and Exponential Predictors (CLEP). CLEP produces a weighted average of the predictions from the individual predictors, where we borrow the weighting scheme from prior work [42]. In this weighting scheme, a higher weight is given to those predictors with more accurate predictions, especially on recent time points. In practice, we find that the CLEP that combines only the expanded shared predictor and the linear predictor consistently has the best predictive performance. For the rest of this section, we expand upon the individual predictor models and the weighting procedure for the CLEP ensembles.

3.1
The separate-county exponential predictors (the "separate" predictors) The separate-county exponential predictor aims to capture the reported exponential growth of COVID-19 deaths [37]. We approximate an exponential curve for death count separately for each county using the most recent 5 days of data from that county. These predictors have the following form: where t denotes the day, and we fit a separate predictor for each county. The coefficients β 0 and β 1 are fit for each county using maximum likelihood estimation under a Poisson generalized linear model (GLM) with t as the independent variable and deaths t as the observed variable. If the first death in a county occurred less than 5 days prior to fitting the predictor, only the days from the first death were used for the fit. If there is only one day's worth of data, we simply predict the most recent value for future values. We also fit exponential predictors to the full time-series (as opposed to just the most recent 5 days) of available data for each county, but due to the rapidly shifting trends, these performed worse than our 5-day predictors. We also found that predictors fit using 6 days of data yielded similar results to predictors fit using 5 days of data, and using 4 days of data performed slightly worse. To handle possible over dispersion of data (when the variance is larger than the mean), we also explored estimating β 0 , β 1 by fitting a negative binomial regression model (in place of Poisson GLM) with inverse-scale parameter taking values in {0.05, 0.15, 1}. However, we found that this approach yields a larger mean absolute error than the Poisson GLM for counties with more than 10 deaths.

The shared-county exponential predictor (the "shared" predictor)
To incorporate additional data into our predictions, we fit a predictor that combines data across different counties. Rather than producing a separate predictor model for each county (as in the separate predictor approach above), we instead produce a single shared predictor that pools information from counties across the nation. The shared predictor is then used to predict future deaths in the individual counties.
The data underlying the shared predictor is slightly different from the separate county predictors. Instead of only including the most recent 5 days of data from each county, we include all days after the third death in each county. Thus the data from many of the counties extend substantially further back than 5 days. By using data that extends much further back, the early-stage data from counties that are now much further along could inform the predictions for current earlier-stage counties. Instead of basing the exponential predictor prediction on time t (as was the case for the separate predictors above), we base the prediction on the (logarithm of the) previous day's death count. This makes the counties comparable since the outbreaks began at different time points in each county. The shared predictor is given as follows: where the coefficients β 0 and β 1 are fitted by maximizing the log-likelihood corresponding to Poisson GLM (like that in the separate county predictor (1)).

The expanded shared predictor
Next, we expand the shared county exponential predictor to include other COVID-19 dynamic (timeseries) features. In particular, we include the number of confirmed cases in the county as this may give an additional indication to the severity of an outbreak, as well as the number of confirmed deaths and cases in neighboring counties. Let cases t , neigh deaths t , neigh cases t respectively denote the number of cases in the county at time t, the total number of deaths across all neighboring counties at time t, and the total number of cases across all neighboring counties at time t. Then our (expanded) predictor to predict the number of confirmed deaths k days into the future is given by where the coefficients {β i } 4 i=0 are shared across all counties and are fitted using the Poisson GLM. When fitting the predictor at time t, we use the death counts for the county up to t − 1, however we only use the new features (cases in the current county, cases in neighboring counties, and deaths in neighboring counties) up to time t − k. While predicting the death count for a given county k days into the future (for time t + k), we iteratively use the daily sequential predictions for the death counts for that county, and use the information for the other features only up to time t (the time up to which we have data available) 4 . It may be possible to jointly predict the new features along with the number of deaths, but we leave this to future work.
For this predictor, we found it beneficial to implement feature scaling and regularization. We scaled all features to have mean 0 and variance 1 and applied elastic net with an equal penalty on the 1 and 2 regularization terms. The regularization penalty of 0.01 was chosen through cross-validation on previous days' data.

The demographics shared predictor
The demographics shared county exponential predictor (the "demographics shared" predictor) is again very similar to the shared predictor. However, it includes several static county demographic and healthcare-related features to address the fact that some counties will be affected more severely than others, for instance due to (a) their population makeup, e.g., older populations are likely to experience a higher death rate than younger populations, (b) their hospital preparedness, e.g., if a county has very few ICU beds relative to their population, they might experience a higher death rate since the number of ICU beds is correlated strongly (0.96) with the number of ventilators [41], and (c) their population health, e.g., age, smoking history, diabetes, cardiovascular disease, and respiratory diseases are all considered to be likely risk factors for acute COVID-19 infection [30,40,31,29,46].
For a county c, given a set of demographic and healthcare-related features d c 1 , . . . , d c m (such as median age, population density, or number of ICU beds), the demographics shared predictor is given by where the coefficients {β 0 , β 1 , β d1 , . . . , β dm } are fitted by maximizing the log-likelihood of the corresponding Poisson generalized linear model. The features we choose fall into three categories:

The separate county linear predictor (the "separate linear" predictor)
We also fit a linear version of the separate county predictors based on the most recent 4 days of data in each county. The motivation for the linear model is that some counties are now exhibiting sub-exponential growth. For these counties, the exponential predictors introduced in the previous section may not be a good fit to the data. The separate linear predictors are given by where we fit the coefficients β 0 and β 1 using ordinary least squares. In the following section, we introduce the Combined Linear and Exponential Predictor (CLEP), which incorporates the abilities of our exponential predictors (to deal with exponential trends) and linear predictor (to deal with sub-exponential trends). In practice, we found that combining the expanded shared predictor and the linear predictor has the best predictive performance.

The combined predictors: CLEP
Finally, we consider various combinations of the five predictors we have introduced using an ensemble approach similar to that described in [42]. The Combined Linear and Exponential Predictors (CLEPs) are developed as follows. Let us first consider the procedure for generating a combined predictor for any two of our predictors. Let y 1 t+k and y 2 t+k be the predictions of (cumulative) deaths by day t + k made on day t by the two predictors that we can index arbitrarily by predictor 1 and 2. Note that on day t we only have access to complete confirmed cases and recorded deaths data up to day t − 1, because recorded deaths and confirmed cases are not fully updated until the end of the day. The prediction of the combined estimates of deaths by day t + k can be written as where w 1 t ≥ 0 and w 2 t ≥ 0 represent the weights of the first and second predictors respectively, and w 1 t + w 2 t = 1. We select weights for the two predictors based on their past predictive performance, using an exponential decay term (a function of t). As a result, more recent predictive performance has more influence on the weight term than less recent performance. Let y m i (where m = 1, 2) denote the predicted number of deaths from predictor m for day i, y i denote the recorded deaths for day i, and ( y m i , y i ) denote a loss function (used for measuring predictive performance). Then following [42], the exponential weighting term w m t for predictor m applied on day t is given by where µ ∈ (0, 1) and c > 0 are tuning parameters, t 0 represents some past time point, and t represents the day on which the prediction is calculated. Since µ < 1, the µ t−i term represents the greater influence given to more recent predictive performance. Note that the loss terms ( y m i , y i ) used in the weights are calculated based on the 3-day-ahead predictions generated over the course of a week starting with the predictor built 11 days ago (for predicting counts 8 days ago) up to the predictor built 4 days ago (for predicting yesterday's counts). We found that using the 3-day predictive performance of each model across the past week in the weights performed well.
In [42], the authors choose ( y m i , y i ) = | y m i − y i | as their loss function, since their errors roughly had a Laplacian distribution. In our case, we found that using this loss function led to vanishing weights due to the heavy-tailed nature of our error distribution. To help address this, we apply a logarithm to the predictions and the true values, and define ( y m i , where we add a one inside the logarithm to handle potential zero values. We found that this transformation improved performance in practice. To generate our predictions, we use the default value of c in [42] which is 1. However, we change the value of µ from the default of 0.9 to 0.5 for two reasons: (i) we found µ = 0.5 yielded better empirical performance, and (ii) it ensured that performance more than a week ago had little influence over the predictor. We chose t 0 = t − 7 (i.e., we aggregate the predictions of the past week into the weight term), since we found that performance did not improve by extending further back than 7 days. (Moreoever,the information from more than a week effectively has a vanishing effect due to our choice of µ.) Thus, in practice, we used weights for predictor m of the form: where y m i is the 3-day ahead prediction from the predictor m trained on data till time i − 3. We compute these weights separately for each county.
Our CLEP ensemble approach can be easily extended to more than two predictors. Given M predictors, we compute each weight w m t in the same way, and then normalize them so that M m=1 w m t = 1.

Prediction Intervals via Conformal Inference
Accurate assessment of the uncertainty of forecasts is necessary to help determine how much emphasis to put on the predictions, for instance, when making policy decisions. As such, the next goal of the paper is to quantify the uncertainty of our predictions by creating prediction intervals. A common method to do so involves constructing (probabilistic) model-based confidence intervals, which rely heavily on the probabilistic assumptions made about the data. However, due to the highly dynamic nature of COVID-19, assumptions on the distribution of death and case rate are challenging to check. Moreover, such prediction intervals based on probability models are likely to be invalid when the underlying probability model does not hold to the desired extent. For instance, a recent study [35] reported that the 95% uncertainty credible intervals for state-level daily mortality predicted by the initial IHME model [22], had a coverage of a mere 27% to 51% of recorded death counts over March 29 to April 2. The authors of the IHME model noted this behavior, and have since updated their uncertainty intervals so that they now provide more than 95% coverage (where coverage is defined below in equation (10a)). However, while the previous releases of the intervals were based on asymptotic confidence intervals, the IHME authors have not precisely described the methodology for their more recent intervals.

Maximum-absolute-Error Prediction Interval (MEPI)
We now introduce a generic method to construct prediction intervals for sequential data. In particular, we build on the ideas from conformal inference [44] and make use of the past errors made by a predictor to estimate the uncertainty for its future predictions.
To construct prediction intervals for county-level cumulative death counts caused by COVID-19, we calculate the largest (normalized absolute) error for the death count predictions generated over the past 5 days for the county of interest and use this value (the "maximum absolute error") to create an interval surrounding the future (e.g., tomorrow's) prediction. We call this interval the Maximum absolute Error Prediction Interval (MEPI).
Mathematically, let y t be the actual recorded cumulative deaths for day t, and y t denote the estimate for y t made k days earlier, i.e., on day t − k, by a prediction algorithm. We call y t the k-day-ahead prediction for day t. We define the normalized absolute error, ∆ t , of the prediction, y t , to be We use the normalization so that y t is equal to either y t (1 − ∆ t ) or y t (1 + ∆ t ). This normalization addresses the fact that the counts are increasing over time, and thus the un-normalized errors, |y t − y t |, also tend to be increasing over time. The normalization ensures that the errors across time are comparable in magnitude, which is essential for the exchangeability of the errors (see Section 4.3).
To compute the k-day-ahead prediction interval for day t + k, to be computed on day t, we first compute the k-day-ahead prediction y t+k using a CLEP. Next, we compute the normalized errors for the most recent 5 days ∆ t , ∆ t−1 , ..., ∆ t−4 (5 days was chosen to balance the trade-off between coverage and length, see Appendix A.2 for more details). The largest of these normalized errors is then used to define the maximum absolute error prediction intervals (MEPI) for the k-day-ahead prediction as follows: where ∆ max := max We construct these intervals separately for each county. The lower bound includes a maxima calculation to account for the fact that y t is a cumulative count, and thereby non-decreasing. This maxima calculation ensures that the lower bound for the interval is not smaller than the last observed value.

Evaluation metrics
A good prediction interval should both contain the true value most of the time (have good coverage) and have a reasonable width/length. 5 Indeed, one can trivially create very wide prediction intervals that would always contain the target of interest. We thus consider two metrics to measure the performance of prediction intervals: coverage and normalized length.
Let y t denote a positive real-valued time-series of interest, which in this case is the target variable: COVID-19 deaths (t denotes the time index). Let { PI t = [a t , b t ]} denote the sequence of prediction intervals produced by an algorithm. The coverage of this prediction interval over a specified period, Coverage(T ), denotes the fraction of days in this period for which the prediction interval contained the observed cumulative death counts, for that county. This notion of coverage for streaming data has been used extensively in prior works on conformal inference [44] and can be calculated for a given evaluation period T (which we set to be from April 11 to May 10) as follows: The average normalized length of the prediction intervals, NL(T ), is calculated as follows: In practice, we replace the denominator on the RHS of expression (10b) with max{1, y t } to avoid possible division by 0. Importantly, the definitions of coverage (10a) and the average length (10b) are entirely data-driven and do not rely on any probabilistic or generative modeling assumptions.

Exchangeability of the errors
While the ideas from MEPI are a special case of conformal prediction intervals [44,43], there are some key differences. Where conformal inference uses the raw errors in predictions, MEPI uses the normalized errors, and where conformal inference uses a percentile (e.g., the 95th percentile) of the errors, MEPI uses the maximum. Furthermore, we only make use of the previous five days instead of the full sequence of errors. The reason behind these alternate choices is because the validity of prediction intervals constructed in this manner relies crucially on the assumption that the sequence of errors is exchangeable. Our choices are designed to make this assumption more reasonable. Due to the dynamic nature of COVID-19, considering a longer period (e.g., substantially longer than five days) would mean that it is less likely that the errors across the different days are exchangeable. Meanwhile, the normalization of the errors eliminates a potential source of non-exchangeability by removing the sequential growth of the errors resulting from the increasing nature of the counts themselves. Since we only use 5-time points to construct the interval, the 95th percentile can be rounded up to the maximum. Figure 3 provides empirical evidence that the exchangeability of the normalized residuals/errors under our 5-day window is indeed reasonable. We rank the errors {∆ t+5 , ∆ t , ∆ t−1 , . . . , ∆ t−4 } in increasing order so that the largest error has a rank of 6. If the errors were exchangeable, then for each of them, the rank has a uniform distribution on {1, 2, 3, 4, 5, 6}, and in particular has a mean of 3.5. To approximate this numerically, we measure the rank of the errors ∆ t+k and ∆ t−j , j = 0, . . . , 4, for each day t between March 27 to April 10, and take an average. Figure 3 plots the results for each of the 6 worst hit counties as well as 6 randomly selected counties (see Section 5.2 for more discussion on these counties). . We rank the errors {∆ t+5 , ∆ t , ∆ t−1 , . . . , ∆ t−4 } in increasing order so that the largest error has a rank of 6. If {∆ t+5 , ∆ t , ∆ t−1 , . . . , ∆ t−4 } are exchangeable for any day t, then the expected average rank for each of the six errors would be 3.5 (dashed black line).
See Appendix A.2 for more details on these choices and some theoretical guarantees for coverage.

Results
In this paper, we focused on short-term (up to 7 days) predictive accuracy. In this section, we first present and compare the results of our various predictors, and then give further examinations of the best performing predictor: the CLEP ensemble predictor with the expanded shared predictor and the linear predictors. Finally, we report the performance of the coverage and length of the MEPIs for this CLEP. Table 3 summarizes the Mean Absolute Errors (MAEs) of our predictions for cumulative recorded deaths using both raw and log-scale. Each row in Table 3 corresponds to a single predictor, and we report different statistics of errors made for k-day-ahead predictions for k ∈ {3, 5, 7} over the period t ∈ T = {March 22, . . . , May 10}. For a given day t, let C t , t ∈ T denote the collection of counties in US that have at least 10 cumulative deaths, and let y c t and y c t denote the predicted and recorded cumulative death count of county c ∈ C t on day t. Then the raw and log-scale MAE on day t are

Empirical performance of various predictors
For any given predictor, we compute these errors for each day and report the 10th percentile (p10), 50th percentile (median), and 90th percentile (p90) values over the period T . From the table, we find that the CLEP ensemble that combines the expanded shared exponential predictor and the separate county linear predictors has the best overall performance.   May 10, 2020). "p10", "median", and "p90" denote the 10th-percentile, median, and 90th-percentile of the 50 mean absolute errors computed daily from March 22, 2020 to May 10, 2020. "k-day-ahead prediction" (k = 3, 5, 7) indicates predictions made k days ahead of each day from March 22, 2020 to May 10, 2020. The smallest error in each column is displayed in bold.
In Figure 4, we compare the raw scale and log scale MAEs as a function of time over the past 50 days for the expanded shared exponential predictor, the separate county linear predictor, and the CLEP that combines the two. We found that the MAEs of the CLEP are often similar to, and usually slightly smaller than the smaller MAE of the two single predictors. Putting together the results from Table 3 and Figure 4, we find that the adaptive combination used for building our ensemble predictor CLEP is able to leverage the advantages of linear and exponential predictors, and improves upon the MAE of single predictors. Results are shown for expanded shared exponential predictor (orange line), the separate county linear predictor (red line), and the CLEP that combines the two predictors (blue line). The exponential model performed well early in the outbreak (March through early April), and the linear predictor seems to be doing better since mid-April. We can see that CLEP is able to adaptively combine these two predictors and obtain the best performance overall.

County-wise visualizations
We now provide several visualizations of our 5-day-ahead CLEP predictions (based on the best-performing CLEP of the expanded shared and linear separate predictor models) and corresponding MEPIs for certain counties, for the period April 11 to May 10. Since there are over 3,000 counties in the United States, we present results for two sets of counties: In Figure 5, we present the results for the worst-affected counties in the top panel (a), and for the randomly selected counties in the bottom panel (b). The solid black line denotes the recorded death counts, dashed blue line denotes the CLEP 5-day-ahead predictions, and shaded blue region denotes the corresponding MEPI (prediction interval). The predictions and prediction intervals for a given day t (t = April 11, . . . , May 10) are based on data up to day t − 5. Since the predictions are updated daily to reflect latest trends in cumulative deaths for each county, we do not expect the predictions (the dashed blue lines) to be monotonically increasing over time. From Figure 5(a), we observe that, among the worst-affected counties, the CLEP appears to fit the data very well for Cook County, IL, and Wayne County, MI. However, our predictor greatly over-predicts the number of deaths in four NY counties on April 19, 20, and 21 (the predictions were made on April 14, 15, and 16, respectively). On April 14, New York City reported 3,778 additional deaths that have occurred since March 11 and have been classified as "probable". This change led to a sharp uptick of recorded cumulative deaths in Kings, Queens, Bronx, and New York County on April 14. Due to this reporting lag, our CLEP predicted that the four counties were experiencing exponential growth when they were actually experiencing linear growth. As a result, it greatly over-predicts the number of recorded deaths in the following days.
From Figure 5(b), we find that our predictors and MEPI perform well for five out of the six the randomly counties (Broward County FL, Dougherty County GA, Monmouth County NJ, Bergen County NJ, and Oakland County MI). However, the performance is slightly worse for Suffolk County, NY, due to a sudden uptick in cumulative deaths on May 5.

Empirical performance of MEPI
We now present the performance of our MEPI at the county level for cumulative death counts, with respect to both coverage (10a) and average normalized length (10b) (see Section 4.2). Since the average performance may change with time, we report the results for two time periods: the past month-T 1 = {April 11, April 12, . . . , May 10}, and the past 15 days-T 2 = {April 26, . . . , May 10}. We evaluate the 5-day-ahead MEPIs, i.e., k = 5 in equation (9a), designed with the CLEP that combines the expanded shared and separate linear predictors. We summarize the results in Figure 6 and discuss details below. Coverage: In panels (a) and (c) of Figure 6, we plot the histogram of observed coverage (10a) across all counties in the US for the recent month (T 1 ) and recent 15 days (T 2 ) respectively. Panel (e) of Figure 6 shows the observed coverage for the 447 counties that had at least 10 deaths on May 1 (since our time period ends on May 10, we want at least 10 days of 10+ deaths) and is based on the time period for each county starting from either April 11 or the day of 10 deaths (if this day is after April 11) until May 10. The median number of days since 10 deaths is 27. From these plots, we observe that for the majority of the counties, we achieve excellent coverage.
For the recent month, the average observed coverage across all counties over the past month (April 11 to May 10) is 94.5%. There is not a large difference in coverage when comparing the longer and shorter time periods (Figure 6(a) and (c)), but the coverage decreases slightly when restricting to counties with at least 10 deaths (median of observed coverage for these counties is 87%). This makes sense for counties that had zero or very few deaths as for such counties, the prediction interval has very good coverage.
For the counties with poor coverage, we show in Appendix A.1 that there is usually a sharp uptick in the number of recorded deaths on someday, possibly due to recording errors, or backlogs of counts. Modeling these upticks and obtaining coverage for such events is beyond the scope of this paper.
Normalized length: In panels (b) and (d) of Figure 6, we plot the histogram of observed average normalized length (10b) across different counties in the US for the recent month (T 1 ) and recent 15 days (T 2 ) respectively. Panel (f) covers the same counties as did Panel (e): those with at least 10 deaths in the relevant time period.
Recall that the normalized length is defined as the length of the MEPI over the recorded number of deaths (10b). And, more than 70% of counties in the US recorded 2 or less COVID-19 deaths by May 1. For these counties, having a normalized length of 2 means the actual length of the prediction interval is 4 (or less). And thus, it is not surprising to see that the average normalized length of MEPI for a non-trivial fraction of counties is larger than 2 in panels (b) and (d).
When considering counties with at least 10 deaths in panel (f), the average normalized length over these (county-specific) periods is much smaller. The median of the average normalized length of these counties is 0.58. Overall, Figure 6 shows that our MEPIs provide a reasonable balance between coverage and length, especially when the cumulative counts are not too small.

Related Work
Several recent works have tried to predict the number of cases and deaths related to COVID-19. Even more recently, the Center for Disease Control and Prevention (CDC) has started aggregating results from several models. 6 But to the best of our knowledge, all of these predictions are either made at the country level or the state level, and none of these works have predicted deaths and cases at the county For the bottom two panels (e and f), we only include counties that have had at least 10 deaths on May 1, and compute the histogram over county-specific period where we only include those days for which a county's cumulative deaths is at least 10. Overall these plots suggest that the MEPI intervals provide a large coverage while being reasonably narrow. Refer to the text for further discussion.
level. In addition, directly comparing other models' forecasting results to our own can be difficult for several other reasons: (1) the predictors mostly make strong assumptions and typically do not involve data-fitting, (2) we do not have access to a direct implementation of their predictors (or results), and (3) their predictors focus on substantially longer time horizons. Keeping these points in mind, we now provide a brief summary of recent work on predictive modeling for COVID-19. Two recent works [36] and [28] have modeled the death counts at the state level in the US. The earlier versions of the Institute for Health Metrics and Evaluation (IHME) [36] model was based on Farr's Law with feedback from Wuhan data; the Imperial College predictor [28] uses an individual-based simulation predictor with parameters chosen based on prior knowledge. On the topic of Farr's Law, we note that a 1990 paper [24] used Farr's Law to predict that the total cases from the AIDS pandemic would diminish by the mid-1990s and the total number of cases would be around 200,000 in the entire lifetime of the AIDS pandemic. It is now estimated that 32 million people have died from the disease so far. While the AIDS pandemic is very different from the COVID-19 pandemic, it is still useful to keep this historical performance in mind.
Another approach uses exponential smoothing from time-series predictors to estimate day-level COVID-19 cases [26]. In addition, several works use compartment epidemiological predictors such as SIR, SEIR, and SIRD [27,39,23] to provide simulations at the national level. Other works [38,32] simulate the effect of social distancing policies either in the future for the US, or in a retrospective manner for China. Finally, several papers estimate epidemiological parameters retrospectively based on data from China [45,34].
7 Impact: a hospital-level severity index for distributing medical supplies Together with the non-profit response4life 7 , our models have been used to determine which hospitals are most urgently in need of medical supplies, and have subsequently been directly involved in the distribution of medical supplies across the county. To do this, we translate our forecasts into the COVID pandemic severity index, which is a simple measure of the COVID-19 outbreak severity for each hospital. To generate this hospital-level severity index, we divided the total county-level deaths among all of the hospitals in the county proportional to their number of employees. Next, for each hospital, we computed its percentile among all US hospitals with respect to total deaths so far and also with respect to predicted new deaths in the next seven days. These two percentiles are then averaged to obtain a single score for each hospital. Finally, this score is quantized evenly into three categories: low, medium, and high severity. Evaluation and refinement of this index are ongoing, as more hospital-level data becomes available.

Conclusion
In this paper, we made three key contributions. We (1) introduced a data repository containing COVID-19-related information from a variety of public sources, (2) used this data to develop predictors for short-term forecasting at the county level, and (3) introduced a novel yet simple method for producing prediction intervals for these predictors. By focusing on county-level predictions, our forecasts are at a finer geographic resolution than those from other relevant studies. By comparing our predictions to real observed data, we found that our predictions are accurate and that our predictions intervals are reasonably narrow and yet provide good coverage. We hope that these results will be useful for individuals, businesses, and policymakers to plan and cope with the COVID-19 pandemic and that our data repository and forecasting and interval methodology will be useful for academic purposes. Indeed, our results are already being used to determine the hospital-level need for medical supplies and has been directly influential in determining the distribution of these supplies.
Our data repository will be useful both for educational purposes, as well as for other teams interested in analyzing the data underlying COVID-19 pandemic. Our MEPI methodology can be applied to other models for COVID-19 forecasting, as well as to time series analysis more broadly.

A Further discussion on MEPI
We now provide a more detailed discussion on a couple of aspects of MEPI.

A.1 Counties with poor coverage
While Figure 6 shows that MEPI intervals achieve higher than 83% coverage for the vast majority of counties over the April 11-May 10 period, there are also counties with coverage below the targeted level. We provide a brief investigation of counties where the coverage of MEPIs for cumulative death counts is below 0.8 in Figure 6(a). Among these counties, Figure 7 shows the cumulative deaths from April Stark, OH Figure 7: Overview of counties where the coverage of 5-day-ahead MEPIs for cumulative deaths count is below 0.8 (in Figure 6(a)). Among these counties, we choose the 24 worst-affected counties, and for each of the 24 counties, we plot its cumulative deaths from April 11 to May 10. For most of the counties shown here, there is a sharp uptick in the number of deaths.

A.2 MEPI vs conformal interence
Recall that MEPI (9a) can be viewed as a special case of conformal prediction interval [44,43], We now provide further discussion on this connection and discuss under what assumptions the MEPI would achieve good coverage.
A general recipe in conformal inference with streaming data is to compute the past several errors of the prediction model and use an s-percentile value for some suitable s (e.g., s = 95) to construct the prediction interval for the future observation. At a high-level, conformal prediction intervals rely simply on the assumption that the sequence of errors is exchangeable. Roughly speaking, the proof proceeds as follows: The exchangeability of the residuals ensures that the rank of the future residual is uniform in the pool of the residuals. Hence the probability of the future residual being in top s-percentile is no larger than s, thereby obtaining the promised s%-coverage. For more details, we refer the reader to the excellent tutorial [43] and the book [44].
The MEPI scheme deviates from the general conformal recipe in two ways. We compute a maximum over the past 5 days for the normalized errors (instead of 95-percentile threshold over the entire past for the unnormalized errors). Loosely speaking, both our choices-of normalized errors and looking at only past 5 errors-try to make the errors more exchangeable. Given the dynamic nature of COVID-19 and our adaptive predictors, the prediction errors do not remain exchangeable over a long period. Moreover, given that we take 5 data points to bound the future error, computing a maximum over them is a more conservative choice (e.g., when compared to taking median or a percentile-based cut-off). 8 Normalized vs unnormalized errors: We now provide some numerical evidence to support our choice of normalized errors over the 1 -errors to define the MEPI. Figure 8(a) shows the (unnormalized) 1 errors | y t −y t | of our CLEP for the six worst-affected counties over the same period as Figure 3 (March 27-April 10). We found that in all of these counties, the 1 errors on days t−4, t−3, t−2, t−1, t and t+5 do not appear to be exchangeable. Recall that under exchangeability conditions, the expected average rank of each of these six 1 errors would be 3.5. However, for all six counties, the average rank of the absolute error on day t + 5 is larger than 4. This indicates that the future absolute error tends to be higher than past errors, and using the 1 error | y t − y t | in place of the normalized error ∆ t can lead to substantial underestimation of future prediction uncertainty.
Longer time window: In Figure 8(b), we show the rank distribution of normalized errors over a longer window of 10 days. We found that due to the highly dynamic nature of COVID-19, these errors appear to be less exchangeable. Under exchangeability conditions, the expected average rank of each of these 11 errors would be 6. However, we found that the average rank substantially deviates from this expected value for many days in this longer window for all displayed counties.
Overall, we believe that putting together the observations from Figures 3 and 8 yield reasonable justification for the two choices we made to define MEPI (9a), namely, the 5-day window (versus the entire past), and the choice of normalized errors (versus the unnormalized absolute errors).
Theoretical guarantees for coverage: In order to obtain a rough baseline coverage for the MEPIs, we now mimic some of the theoretical computations from the conformal literature. For a given county, and a fixed time t and parameter k, if the six errors, namely, E t = {∆ t+k , ∆ t , ∆ t−1 , ∆ t−2 , ∆ t−3 , ∆ t−4 } are exchangeable, then we have P y t+k ∈ PI t+k = P (∆ t+k < ∆ max ) = 1 − P (∆ t+k = ∆ max ) = 5 6 ≈ 0.83.
Thus, we may believe that the Coverage(T ) ≈ 83% holds for large |T |, where the coverage was defined in equation (10a). However, we now elaborate why establishing theoretically that MEPI achieves this coverage remains a challenging task. On the one hand, the probability in equation (12) is taken over the randomness in the errors, and the time-index t + k remains fixed. This observation, in conjunction with the law of large numbers, implies the following: Over multiple independent runs of the time-series, for a given county and a given time t + k, the fraction of runs for which the MEPI PI t+k contains the observed value y t+k converges to 5/6 as the number of runs goes to infinity. However, analyzing such a fraction over several different independent runs of the COVID-19 outbreak is not relevant for our work.
On the other hand, the evaluation metric we consider is the average coverage of the MEPI over a single run of the time-series, c.f., the definition (10a) for Coverage(T ). Thus, we require an online version of the law of large numbers in order to guarantee that Coverage(T ) → 83% as |T | → ∞. Such a law of large numbers, established in prior works [43], has been crucial for establishing theoretical guarantees in conformal inference. In our case, this law (Proposition 1, Section 3.4 [43]) guarantees that, when the entire sequence of errors {∆ t , t ∈ T } for a given county is exchangeable, the corresponding Coverage(T ) ≈ 83%, when the period T is large. Unfortunately, such an assumption is both hard to check and unlikely to hold for the prediction errors obtained from CLEP for the COVID-19 cumulative death counts.
Despite the challenges listed above, our results in Section 5.3 showed that MEPIs with CLEP achieved good coverage with narrow widths.