Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States

johndrow@wharton.upenn.edu Abstract. Knowledge of the number of individuals who have been infected with the novel coronavirus SARS-CoV-2 and the extent to which attempts for mitigation by executive order have been effective at limiting its spread are critical for effective policy going forward. Directly assessing prevalence and policy effects is complicated by the fact that case counts are unreliable. In this paper, we present a model for using death-only data—in our opinion, the most stable and reliable source of COVID-19 information—to estimate the underlying epidemic curves. Our model links observed deaths to an SIR model of disease spread via a likelihood that accounts for the lag in time from infection to death and the infection fatality rate. We present estimates of the extent to which confirmed cases in the United States undercount the true number of infections, and analyze how effective social distancing orders have been at mitigating or suppressing the virus. We provide analysis for four states with significant epidemics: California, Florida, New York, and Washington.


Introduction
The coronavirus SARS-CoV-2, which causes the disease COVID-19, has already changed the lives of billions of people globally. Many people have been ordered to stay in their homes, and economies worldwide have largely come to a halt. Thousands have died and at least several million have been infected. In the United States, social distancing policies began being implemented in mid-March, as states such as Washington, California, and New York saw a sharp rise in the number of hospitalizations attributable to the virus. Government response has been hindered by an insufficient supply of materials needed for testing, and by the large proportion of infected individuals who are asymptomatic and therefore are unlikely to seek testing even if it is available. These factors make it difficult to know the true size of the infected population. Because effective surveillance has not been possible, policymakers have instead turned to social distancing policies as the best available tool to slow the spread of the virus.
Here, we seek to address two key questions: (1) How many people are actually infected or have ever been infected with SARS-CoV-2?; and (2) Are the social distancing policies currently in place effective at suppressing the virus? That is, can they be expected to lead to a decrease in the number of infections? Effectively addressing these questions requires innovative modeling due to severe limitations in commonly used sources of data for tracking the spread of the virus (Angelopoulos et al., 2020). . Ideally, to address these questions we would use data on the number of confirmed cases to understand the prevalence of the disease and assess policy measures. However, we view confirmed case counts for COVID-19 to be unreliable and illsuited to this type of analysis for a number of reasons. Media reports have made clear that testing is more available in some regions than others, and so case counts are primarily an indication of

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 3 where testing is most comprehensive. 1 In fact, there may be as much as a 20-fold difference in case detection between countries, leading to incomparable numbers (Golding et al., 2020).
Evidence for the insufficiency of case counts within the United States can also be seen in the large observed difference in test-positive rates -the proportion of positives among all tests performed -across states. For example, as of May 17, 2020, California has conducted 1.24 million tests and enumerated about 80,000 confirmed cases, for a test positive rate of about 6.5%. By contrast, New York has conducted 1.41 million tests and enumerated about 350,000 confirmed cases, for a test positive rate of close to 25%. For case counts to represent an accurate count of the number of infections in both locations, one would need to believe that testing policies and test seeking behavior in New York are much more likely to identify individuals who have the disease while simultaneously not missing infected people at a higher rate than in California, a highly dubious prospect. One way such large discrepancies in test positive rates could arise is through random testing. If people were tested at random in both states, we would expect to see higher test positive rates in locations with higher prevalence, such as New York. While the prevalence likely does vary between the states, outside of limited designed experiments accounting for a tiny fraction of the total number of tests, testing is not administered to a random sample of people. Instead, tests are administered mainly to symptomatic individuals and people who seek testing. Rather, the most likely explanation for this disparity in test positive rates is that in locations like New York-where the epidemic is large-the number of people infected has persistently outstripped the capacity to test, while in Californiawhere the epidemic is much smaller-this is less true. Thus, the case counts likely do not accurately reflect even the relative size of the infected population across states. It is even more questionable that they could accurately reflect the absolute size of the infected population across states.
Furthermore, even within a single administrative boundary, case counts from earlier in the epidemic cannot meaningfully be compared with more recent case counts. Early in the epidemic, tests were administered only to people meeting strict criteria that likely excluded many sick individuals (Johnson & McGinley, 2020). In some cases, these criteria were chosen due to shortages of testing resources, such as reagents (Cavitt, 2020). Over time, shortages of key components of testing have slowly resolved, and the administrative rules and guidelines have themselves changed. For example, according to the Centers for Disease Control's website, several revisions to the official guidance on which patients should be tested have been made, including a March 4, 2020 revision which modified the criteria for testing to expand the pool of eligible people ("Overview of Testing for SARS-CoV-2", 2020). Thus, some of the trends in cases we observe over time within a given location are likely attributable to changes in test availability and criteria for test eligibility rather than purely to changes in prevalence of the disease. This makes case data of questionable value for modeling, even if model results are not compared across different locations.
Another, potentially more reliable data source is hospitalizations. Unfortunately, these data have become universally available only fairly recently. For example, the Johns Hopkins COVID-19 database provides hospitalization data by state beginning only on April 14, 2020 -more than two months after the epidemic began in New York, Washington, and California. High quality data on the size of the epidemic in the early days are critical to fitting epidemiological models, which tend to be sensitive to initial conditions, and thus a partial time series lacks information that is critical for obtaining the best estimates. Moreover, even if the data as it currently exists were available since the beginning of the outbreak, it measures total hospitalizations by day. Without supplementary information on the length of stay for each patient, it is not possible to calculate the number of new hospital admissions over time, which is much more useful from the perspective of modeling the total number of infected individuals. Thus, while these data could be useful as an additional data source, particularly in the future as the length of the available time series grows, at the moment they are not adequate for our purposes.
This brings us to data on deaths attributable to COVID-19. Death data have been recorded since the earliest days of the epidemic and are universally available across states. Because gravely ill patients who die from severe disease are more likely than the average infected person to be hospitalized and tested, we believe death data are the most complete and representative data source available on a timely basis that could be used to estimate the number of SARS-CoV-2 infections. Even so, death data is imperfect. Excess mortality calculations suggest that the deathonly data misses some people who died of COVID-19 but whose cause of death was not listed as such on their death certificates (Weinberger et al., 2020). Why, then, not use excess mortality data? Excess mortality estimates offer an invaluable approach to evaluating the disease's overall impact. However, excess mortality almost certainly over-counts actual deaths directly attributable to COVID-19, since some of the excess mortality is a result of people with other health conditions avoiding hospitals and clinics, cancellation of procedures to reduce hospital census, and so forth. Such second order effects of the disease don't fit into traditional models of the spread of infectious disease that are built around the assumption that people who die from the disease were at some point themselves infectious. Thus, while approaches based on excess mortality are incredibly useful for revealing the total effect of the epidemic on mortality, we find it less suitable for fitting models of the spread of the virus. As a result, we fit our model to the data that we believe is best suited for modeling the spread of the virus and has a reasonable chance of being accurate, or-more preciselyis likely the least inaccurate of available measures of the extent of COVID-19 infections: death data. Notably, several other influential COVID-19 modeling efforts have come to the same conclusion and have turned to death data to fit or calibrate their models (Altieri et al., 2020;Ferguson et al., 2020;Flaxman et al., 2020;Golding et al., 2020).
1.2. Our modeling approach. A common approach in the statistical epidemiological literature focuses on fitting or calibrating ordinary differential equation (ODE) models to observed data, and this underpins our approach as well (Hethcote, 2000). However, because we build our model using death-only data, we propose modifications to the standard approaches that rely on case data. In order to do this, the observed deaths need to be linked to the underlying state variables of the ODE model via a sensible, scientifically motivated likelihood. We do this by means of a distribution for the time from infection to death, and an assumption about the infection fatality rate (IFR), that is, the proportion of infected individuals who will eventually die of COVID-19. Using death-only data in the early stage of the epidemic, these parameters and the parameters of the underlying ODE model are not separately identifiable. For this reason, the time to death distribution and IFR are assumptions of our model.
There exist external estimates of the time to death distribution based on high quality data. For the IFR, precisely because of the difficulty outlined above in establishing the true number of infections, there remains considerable uncertainty about its true value. Because the IFR is such an important assumption of our model, we perform analysis for five scenarios with different values of the IFR that span the (relatively small) range of plausible values. Our model assumes that the IFR is constant over the time period considered. While we believe this is a useful approximation to the truth, in reality there may be drift over time. For example, as physicians and researchers gain more insight into the most efficacious treatments for the disease, outcomes for seriously ill patients may improve over time, leading to a lower rate of death given infection later in the epidemic. It has also

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 5 been speculated that the demographics of who becomes infected may change the IFR. 2 However, because our focus is on the first months of the pandemic during which we believe the IFR was likely close to constant, we do not incorporate a time-varying IFR in our model.
Conditional on these assumptions and a sampling model for deaths given infections, we fit the parameters of an underlying ODE model of epidemic dynamics to the observed death data. We emphasize that our analysis should be read as 'if the infection fatality rate is X, then the following would be true,' rather than primarily as advocating for a particular value of the IFR over others.
As a side-effect of fitting epidemic curves and evaluating the impact of interventions, our model allows us to make projections for the likely trajectory of infections and deaths. However, exact projections is not the primary goal of this work. A model focused more on precise predictions would likely include location-specific measures of containment and treatment efficacy, as well as age-and comorbidity-specific infection fatality rates. In order to minimize the prediction error, it would also account for reporting artifacts, such as delays in recording of deaths in quickly updated records over weekends or holidays. It would also account for increased mortality if the sick overwhelm hospital systems. It would also emphasize finding external data sources that provide strong leading indicators of future deaths. This kind of information could be included in a more prediction-optimized version of the model. Our focus here is to outline a modeling approach using minimal but relatively reliable data, that is described by a likelihood and priors that incorporate our understanding of the data generating process, is fitted to data, and is underpinned by a widely-used epidemiological model that is designed to approximate the real dynamics of disease spread. It also turns out that this model predicts reasonably well over a two-week time horizon -the only prediction that we assess here.
The remainder of this paper is organized as follows. In section 2 we review related work. In section 3 we introduce our model and explain how the model can be used to infer the number of infections and the effect of executive orders for social distancing. In section 4 we describe an MCMC algorithm for fitting our model. In section 5 we give results. In section 6 we conduct a sensitivity analysis. Section 7 concludes.

Related Work
Several previous studies have attempted to model the dynamics of the pandemic in various geographic locations and with varying goals. Several of these studies have provided some analysis or estimate of the amount by which confirmed cases undercount the true number of infections. For example, R.  propose that in the first month of the epidemic in China, 82-90% of infections were undocumented. Riou et al. (2020) use an SEIR model, an epidemiological ODE model, and calibrate their model to the time series of reported deaths and reported infections. By modeling the underreporting of symptomatic cases, and by assuming that approximately half of infections lead to symptomatic cases, they estimate the infected population in Hubei, finding that approximately 30% of infections were documented. Perkins et al. (2020) estimate directly that in the US, more than 90% of infections have been undocumented by tests using Chinese data and initial reports in the US. The first wave of a random sampling design in Indiana conducted by researchers at the University of Indiana and the Indiana State Department of Health concluded in a preliminary report that the number of positive tests undercounted the number of people ever infected by a factor of 9.6 (Menachemi et al., 2020). Ferguson et al. (2020) model the effect of transmission between susceptible and infectious individuals using a microsimulation model built on synthetic populations designed to mimic the populations of the United Kingdom and United States. They assume a fixed time-to-onset and a range of R 0 values from 2.0-2.6, and they assume symptomatic cases to be 50% more infectious than asymptomatic cases. Having calibrated their model to the cumulative number of deaths, they estimate deaths and hospital loads under different non-pharmaceutical interventions involving social distancing and isolation. This analysis was arguably the most influential in triggering the adoption of social distancing policies in the United States, as it indicated that without social distancing measures in place, there would likely be around 2.2 million deaths, and hospitals nationwide would be completely overwhelmed. They also predicted that social distancing measures could reduce the number of deaths substantially and prevent exceedance of hospital resources, but only if they were dynamically turned 'on and off' by triggering mechanisms based on the current number of COVID-19 patients. The report envisioned social distancing remaining in place for roughly 18 months, the amount of time they think it will take for a vaccine to become available.
The CHIME app (Weissman et al., 2020) is an online tool created by researchers at the University of Pennsylvania to help hospitals anticipate the number of incoming COVID-19 patients and their needs. The CHIME model uses the current number of COVID-19 hospitalizations to 'back out' the total number of cases based on a user provided hospitalization rate conditional on infection. Similar to our work, their model does not rely on the case counts. They make forward projections for the number of hospital admissions, ICU admissions, and ventilators needed over the coming weeks. They allow the user to specify the parameters of their underlying epidemiological model as inputs in terms of the doubling time for the infected population.
The model given in Murray (2020) has a goal similar to CHIME (hospital use planning). This model takes a different approach by fitting parametric curves to observed cumulative death rates. They use a hierarchical model on the parameters of the parametric curve. The model essentially projects that the future course of the cumulative death rate curve in the United States will follow a path similar to that observed in other locations that are farther along in the course of their epidemics. There is no underlying model of epidemic dynamics, but there is a sampling model for death rates, which allows them to give confidence intervals. The research team developing this model has updated the details of their estimation procedure several times since their model was made public. Instead of an arbitrary curve-fitting procedure, the current version as of writing fits an underlying SEIR model ("Main updates on IHME COVID-19 predictions since April 29, 2020", 2020). The New York Times online tool allows the user to specify inputs to understand how those inputs affect likely infections, hospital loads, and deaths; infections are a side-effect of the rest of the model. Flaxman et al. (2020) take a similar approach to ours to evaluate the effect of social distancing orders across countries in Europe. They define a likelihood for death-only data and incorporate mitigation efforts in the model. They build a hierarchical model to estimate the number of infections and effects of mitigation across countries in Europe. The overall approach is similar, but their analysis is done for Europe, whereas here we consider the United States. Lewnard et al. (2020) similarly evaluate the effect of mitigation efforts. They observe reductions in estimates of the effective reproduction number for patients in three hospital systems in Northern California, Southern California, and Washington State as a consequence of the implementation of non-pharmaceutical interventions, like social distancing. Song et al. (2020) describe a statistical software package that can be used to estimate the impact of quarantine protocols on the spread of COVID-19, and they apply their method to data from China.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 7

Model
Let ν t be the number of new infections on day t of the epidemic, and let p denote the infection fatality rate, i.e., the probability of death given infection. We denote the day of the first infection by T 0 . Let θ = {θ s : s = 0, 1, ..., m} be the set of probabilities defining the discrete time-to-death distribution, where θ s denotes the probability that, for those who die, death from COVID-19 occurs s days after the initial infection. Let X(t, t ) denote the number of individuals newly infected on day t who die on day t . Our death model is The observed deaths on day r are thus given by the sum over all previous days of the number of individuals infected on that day who went on to die on day r. The distribution of D(r) marginal of X is given by and for r = s, D(r) ⊥ ⊥ D(s) | p, θ, ν. This defines our likelihood. The use of a Poisson distribution in specifying our model may seem unnatural compared to the specification ). The Poisson specification allows ν t to take real values as opposed to integer values, as would be required for a Binomial distribution. This allows us to use simpler, deterministic models for the underlying epidemiological curves defining the ν t s, simplifying computation. Fortunately, in cases where pθ s is small and ν is large -precisely the situation in which we find ourselves after the very early days of the epidemic -Poisson(pθ s ν) is a good approximation to Binomial(ν, pθ s ). The observed number of deaths D are linked to a compartmental epidemiological model via the total number of newly infected individuals on day t, ν t . To generate realistic ν t curves, we use a Susceptible-Infected-Removed (SIR) model of epidemic dynamics, with state evolution given by the following ODE: where s t is the proportion of susceptible individuals at time t, i t is the number of infected individuals at time t, r t is the number of removed individuals at time t, and T 1 is the time at which social distancing orders were issued. The ν t variables on which our likelihood is conditioned can be extracted directly from this model. The number of new infections that occurred during day t is simply the difference in the size of the S compartment at time t and t + 1 multiplied by N . This is because, under this model, individuals may leave the susceptible compartment only by becoming infected, so the number of new infections that occurred in any time period is precisely the reduction in the number susceptible in that same period.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 8 Up to time T 1 , these equations represent a standard SIR model with parameters {β, γ}; postmitigation, it is an SIR model with parameters {φβ, ηγ}. In an SIR model with parameters {β, γ}, infected individuals are all considered contagious, and the mean time between infection and removal (the end of the contagious period) is given by γ −1 . R 0 is given by βγ −1 -this is the number people who are infected by a single infected individual in a population in which everyone is susceptible. The quantity (1 − φ) ∈ (0, 1) represents the percent reduction in the rate of infection that occurred following implementation of mitigation policies; the quantity η > 1 represents the increase in the rate at which infected individuals cease to be contagious-whether by recovery, death, self-isolation, or medical quarantine-following the implementation of mitigation policies.
The SIR model is a stylized mathematical model of the spread of infectious disease, and-like any model of a complex process-is an approximation of a more complicated reality. In particular, there are several ways in which its compartments only approximately represent distinct subpopulations in the real world. For example, although the R compartment is typically conceptualized as containing people who have been 'removed' from the population of infectious people due to recovery or death, reality is not so simple. Quarantine or hospitalization may largely 'remove' an individual from the population of infectious people because the rate of transmission for isolated people is substantially reduced; indeed, that is the point of quarantine. Although some models explicitly include a quarantine compartment (e.g., Song et al. (2020)), we allow the post-infection quarantines to be subsumed by the removed compartment. This more expansive definition of the R compartment justifies the increase in the rate of removal from γ to ηγ following the promulgation of mitigation policies. As awareness and fear of the virus grows, individuals become more likely to self-isolate or seek care when they begin to show symptoms. One further consequence of this simplification is that the number of individuals who die on the tth day is not directly comparable with the number of people who enter the R compartment during that period because many of the individuals who died at that time could have been effectively removed at an earlier time.
These limitations notwithstanding, we elect to use the SIR model to underpin our likelihood because it is among the simplest options that generates epidemic simulations incorporating the key facets of an epidemic. Specifically, the SIR model manifests herd immunity, rapid growth in the early phase, and, is constrained to produce a finite number of infections (Hethcote, 2000). The main purpose of incorporating a compartmental model into our method is to have a realistic, lowdimensional generative model of the new infection curves, ν t , on which our likelihood is based. Because of its simplicity, ease of interpretability, and well-understood properties, we find the SIR model to be an appealing choice. However, the framework we have defined for creating a sensible likelihood-based model of death-only data could easily be extended to embed more elaborate compartmental models or any other generative model of new infections. Figure 1 shows one realization of our SIR model. The top panel shows one draw of the daily number of deaths. Notice that this is not a smooth curve. When fitting our model, these Ds will be our data and will be the only thing we observe directly. The other panels are what we will infer from the Ds. The middle panel of the figure shows the νs, the number of daily new infections. The series of Ds has roughly the same shape as the νs, though it lags it by about 25 days. This lag is due to the shape of the time from infection to death distribution θ, which we describe below. The bottom panel gives the standard view of the SIR model, showing the total number of susceptible, infected, and removed at any given time point.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 9 Figure 1. One realization of our modified SIR models. The top two panels are displayed on the scale of counts. The bottom panel, which shows our underlying SIR model, is given on the scale of proportion of the population.
3.1. Fixed model parameters. Due to identifiability constraints, we do not estimate all of parameters of the model. Rather, we fix those parameters for which there exists relatively higher quality information on reasonable values that is applicable in the context studied here. By and large, this includes those parameters pertaining more to the biology of the disease than to the social dynamics of its spread, which we expect to vary more widely across cultural and geographic contexts. 3 These include θ and p.
For θ, we draw from two studies. Zhou et al. (2020) reports that in Wuhan, China, the time from symptom onset to death had a median of 18.5 days with an interquartile range of 15 to 22 days. A Gamma mixture of Poisson distributions (henceforth referred to as the 'Poisson-Gamma' distribution) with parameters {a, b} = {27.75, 1.5} matches the reported quantiles well. Lauer et al. (2020) estimate the incubation period of the disease, also using data from China. They report a median incubation time of 5.1 days with an estimated 97.5th percentile of 11.5 days and a 99th percentile of 14 days. The quantiles of a Poisson-Gamma distribution with parameters {a, b} = {5.5, 1.1} match these reported quantiles well. We compute numerically the mass function of the distribution of the total time from infection to death by generating 100,000 samples from the distribution of the incubation period and the distribution of the time from symptom onset to death that were just described. The time to death is the sum of these two numbers. We truncate the maximum time to death from infection to be the 99th percentile of the generated samples; we call this maximum time to death m. The parameter θ is a length m + 1 vector giving the probability that death occurs θ s days after infection given that death occurs. The probability of death s days after infection (marginal of death occurring) is given by pθ s . This time to death distribution is shown in Figure 2. Because of its importance in the undercount analysis and the considerable uncertainty surrounding the value of the infection fatality rate p, most of our results are given for five different values of p. To set this range of values for p, we rely on several external data points. Russell et al. (2020) use data from individuals on the Diamond Princess cruise ship to estimate an infection fatality rate of 1.2% (95% CI: 0.38%-2.7%) after adjusting for delays between infection confirmation and death. The ship was a closed population: we know who was on the ship and therefore who to test, so we have confidence in the denominator (all individuals infected with COVID-19 were identified because all people on the ship could be tested). We base the upper end of the range of values we consider for p on the upper end of the confidence interval given by Russell et al. (2020).
We base the lower bound of values considered on data from the United States. As of May 18, 2020, the New York City Health website reported 20,806 deaths for which the deceased tested positive for COVID-19 or COVID-19 was listed as the cause of death on their death certificate. Compared with an estimated population of about 8.4 million people, this gives an overall mortality rate of about 0.25%. Given that each day there are more reported COVID-19 deaths in New York City, and that it is unlikely that every person in New York City has already been infected, this offers a compelling lower bound on the infection fatality rate in the United States.
This range of values is consistent with other external sources of information. One high quality source is reported case fatality rates in countries that have done aggressive testing and contact tracing, including testing asymptomatic individuals. In these places, the case numbers may approach the true number of infections, so case fatality information in these locations provide a reasonably tight upper bound on infection fatality rates. Here, we look to South Korea, which has done the most testing per capita. As of a May 17, 2020 press release from the Korea Centers for Disease Control and Prevention, there were 11,050 confirmed positive cases and 262 deaths in South Korea. This gives a näive estimate of the case fatality rate of 2.4%.
For each of these data sources, differences in underlying age and income structure, comorbidities, and other risk factors could limit the generalizability of these estimates to the United States at large. For example, the age distribution on the Diamond Princess cruise ship skews considerably older than the US population in general with about 58% of its passengers 60 years of age or older, while only about 16% of the US population is 62 years of age or older. 4 The Diamond Princess IFR might be higher than the IFR for the US population in general, as older age is a known risk factor for COVID-19 . However, the fact that the people on the Diamond Princess were able to travel may indicate that these people suffer fewer other comorbidities on average than their similarly-aged US cohort. IFR estimates from other geographic regions, such as South Korea, may be limited by similar demographic and risk factor differences relative to the US. Additionally, the IFR is influenced by the quality and availability of medical care. Data from locations where medical care is unavailable or inadequate, or where the healthcare system has been overwhelmed by a surge of COVID-19 patients, may over-estimate the IFR relative to situations where quality care is available to anyone who needs it. Such factors are difficult to adjust for, since data on risk factors is preliminary and limited and health care availability can change dramatically if the number of infections in an area explodes.
The differences between the context from which we draw our estimates of p and the population to which they are applied notwithstanding, we believe that the comparison is still useful in providing a range of rough estimates. Indeed, there are many similarities as well. The Diamond Princess cruise ship had a large American population-about 400 of the 3700 total passengers and crew on board were US citizens. South Korea's healthcare system has not been overwhelmed by COVID-19 cases. This is similar to the United States where, although hospitals in New York City were very full, health care providers have not reached a situation where they have had to ration ventilators.
With all of these qualifications, we cannot be certain that any given value of p is correct when applied in our context. We also do not think that it is possible to precisely estimate the IFR for this virus with the currently available data. However, we believe these external estimates are useful in providing a range of possible values. Thus, because the infection fatality rate is arguably the most important parameter in our model that cannot be learned from the data, we consider five different scenarios that cover the spectrum of plausible values put forth above: 0.2%, 0.5%, 1.0%, 1.5%, and 2.5%. We focus on the 1.0% case in some figures because we consider values of p in this range to be most likely.
3.2. Estimated model parameters. The parameters we estimate in our model are γ, β, φ, η, and T 0 . We base our prior for the infectious period on previously published studies. Ferguson et al. (2020) give a mean generation time (γ −1 in the SIR model) of 6.4 days; Prem et al. (2020) assume an infectious period of 3 or 7 days in their simulations with an incubation period of about 6 days. While these aren't directly comparable due to differences in the model -as described above, people can actually move to the R compartment while still infected with the virus -we use these as rough guidance. We choose the prior γ −1 ∼ Uniform(1, 15), corresponding to an infectious period that ranges from about 1 to 15 days. The value of 1 day would be consistent with a situation in which most symptomatic people rapidly self-isolate, and asymptomatic people are uncommon. Rather than specifying a prior on β, we choose a prior on R 0 = βγ −1 which induces a prior on β. Using the reported confidence interval in Q.  to establish rough upper and lower bounds on R 0 , we specify the prior R 0 ∼ Uniform(1, 4). We place a uniform prior on T 0 between January 1, 2020 and February 20, 2020.
We place a Uniform(0.2, 0.9) prior on φ, allowing it to encompass the possibility that mitigation policies had significant effect (φ small) and that the policy had a very limited effect (φ large). This prior rules out the possibility that the mitigation policies, in fact, increased the transmissibility. This prior is consistent with information on the University of Maryland COVID-19 Impact Analysis Platform, which calculates several social distancing-relevant metrics as a function of mobility data. In each of the states considered, the percent of people not staying home according to this data was roughly 60-80% post-executive orders as it was prior. We do not expect that decreases in mobility perfectly map to decreases in transmissions, as other behaviors that likely changed in tandem with social distancing policies, such as mask wearing or not shaking hands, also have an effect. However, from this data it is clear that we should expect that there was some effect, as there was a significant increase in the proportion of the population staying home. Furthermore, based

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 12 on this, we also believe that transmissions were not completely eliminated, as there still remain a significant proportion of the population interacting outside of their homes.
Finally, we use a Uniform(1, 4) prior on η, allowing the rate at which people are removed from the infectious population to increase after the implementation of social distancing policies. A value of η substantially larger than 1 would indicate that after social distancing measures and other policies were promulgated, symptomatic people more rapidly self-isolated or otherwise "removed" themselves from human contact after becoming symptomatic. This would be consistent with increasing awareness of the virus due to persistent messaging from media and the authorities resulting in greater individual efforts to self-isolate. On the other hand, values near 1 would indicate that symptomatic people did not change their behavior much with regard to self-isolation after social distancing policies went into effect.

Computation
We carry out computation by MCMC using the adaptive Metropolis algorithm (Haario et al., 2001). The algorithm produces samples from the Bayesian posterior distribution of the parameters of our model, ξ = {β, γ, T 0 , φ, η}. Specifically, we update ξ by proposing a new set of parameters, ξ * from N (ξ, Σ), with the time-inhomogeneous covariance Σ computed using the method of Haario et al. (2001). In brief, during a pre-adaptation phase, proposals are made from the usual multivariate Gaussian random-walk, i.e., where Σ 0 is an initial estimate of the posterior covariance, and c 0 is a scalar tuning parameter. After adaptation begins, the proposal becomes (4.2) ξ * t ∼ N (ξ, c 1Σt ), whereΣ t is the usual time-averaging estimate of the posterior covariance based on the Markov path up until step t − 1. For details, see Haario et al. (2001). Once a new state ξ * is proposed, we solve the SIR model with these parameters using the Runge-Kutta (4,5) method. We then calculate the sequence ν * t from the s t sequence. We accept or reject the proposed η * using the usual Metropolis acceptance probability, with target density proportional to where p(D(t) | ν) is the Poisson probability mass function implied by (3.3), and π() represents the prior density. Notice that, because ν is a deterministic function of ξ, both terms in (4.3) depend on the values of these parameters. The tuning parameters of the adaptive Metropolis algorithm are chosen to give acceptance rates between about 0.2 and 0.4. To obtain the initial value of the proposal covariance Σ 0 , we compute an estimate of the sampling covariance of the maximum likelihood estimator using a parametric bootstrap. We tune the constants c 0 and c 1 to obtain acceptance probabilities of approximately 0.2 to 0.4, consistent with the guidance of Roberts and Rosenthal (2001). We run for 100,000 iterations, begin adaptation after 20,000 iterations, and use 50,000 iterations of burn-in. Representative trace plots are shown in appendix A. We also compute Gelman-Rubin statistics based on the last 50,000 iterations and five replicate chains initiated by sampling from a Gaussian centered at the MLE with the bootstrap estimate of the covariance. This is done for every state for p = 0.01. All of the statistics are less than 1.03, which does not suggest problems with non-stationarity. 5

Results
We fit our model to the daily number of deaths for several states in the United States: California, Florida, New York, and Washington. We use the state-level data compiled by the New York Times. The last date in our training data is April 30, 2020. We base the state populations on 2018 US state population estimates from World Population Review, which pulls data from the US Census. We take the date of implementation of social distancing to be the first day on which restaurants and schools were both closed statewide by executive order, as recorded in a GitHub repository maintained by researchers at the University of Washington. This definition is of course somewhat arbitrary, but given limitations of the available data, it is difficult to conceive of a richer model that would allow the effects of social distancing to phase in gradually without making similarly strong assumptions about how much each type of measure is 'worth' compared to the eventual statewide lockdowns that were implemented everywhere. Figure 3 shows model fit for each of the states for p = 0.01. The gray bands show point-wise 95% posterior predictive intervals; the black line shows the posterior mean. The red lines show a seven day moving average of the daily deaths, while the gray lines show the raw daily reported deaths. We include the moving average for reference because we believe some of the variability in the daily raw data is likely due to delays in reporting rather than inherent variability in the timing of deaths. Despite the low-dimensionality of the free parameters in our model, it fits the data reasonably well. On several occasions, the true values exhibit large spikes or valleys that stray outside of the intervals. This suggests that perhaps a negative binomial or other over-dispersed count distribution be considered for future models, though of course one expects to occasionally see values outside of a pointwise 95% interval. We also believe that some of the extremes of day to day variation in the raw data are recording artifacts. For example, there is a persistent trend of reported deaths temporarily falling on weekends and spiking early in the week. There is no obvious way to re-allocate the data to weekend days, and since our model is not very sensitive to variation on the scale of a few days, we do not attempt to do so. The comparison of the posterior mean to the seven day moving average provides another way to visualize the performance of the model that is not as sensitive to recording artifacts. Figure 3 includes data through May 13, 2020, the last 13 days of which were not used for training. The values shown after April 30, 2020 (marked with a vertical line in Figure 3) are projections from the model and associated 95% pointwise posterior credible intervals. For Florida, New York, and Washington, the projections are fairly accurate over this two week timespan and the moving average largely falls within the intervals. For California, the model predicts a slowly moving increase in deaths whereas the data in that time period appears to have reached a plateau. This discrepancy could occur if the true time to death distribution were slightly longer than that indicated by our chosen θ. If that were true, the model would 'expect' changes to the trajectory of deaths to show up in the data before they actually do. Because the time elapsed between the promulgation of social distancing orders and the end of our training data is only about 45 days, whereas the time from infection to death can be as long as 40 days, this might cause the model to erroneously infer that the trajectory had not slowed. Another possibility is that behavioral changes did not coincide with the executive orders but slowly increased since then, causing additional slowing in the death rate. Because the parameters are constant during the full time period after the executive orders have been issued, this would cause the estimated parameters to over-estimate the average effect immediately after the orders were issued and under-estimate it later.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 14 Figure 3. Posterior mean (black dashed line), posterior 95% pointwise credible interval (gray region), raw data (gray lines), and seven day moving average of raw data (red lines). The vertical black line indicates the last day of data used to fit the model. Note that the scale of the vertical axis differs by state.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 15 Table 1 shows posterior means and 95% posterior credible intervals of pre-intervention R 0 for each state for all five values of p. With the exception of Washington, where a larger value of p leads to considerably smaller estimated values of R 0 , these estimates are not especially sensitive to p (good news since this is the parameter about which we are most uncertain), and all fall within the range between about 2.5 and 4. That the estimates are not very sensitive to p may seem surprising, since for smaller values of p, the number of infected individuals at any time must be larger to produce the observed deaths. However, because our model also estimates T 0 (the time of the initial infection) and γ, it is possible to produce quite different time series of new infections with similar R 0 values. The data are thus indicating that changing the assumption about p would imply changes in the parameters that keep R 0 nearly constant.  5.2. Undercount estimates. One quantity estimated by our model is the cumulative number of SARS-CoV-2 infections over time. Comparing these estimates with the reported number of confirmed cases by state allows us to estimate the extent by which confirmed cases of COVID-19 undercount the number of infections. We define the undercount at each time point to be our estimated number of cumulative infections as of that time point divided by the cumulative number of confirmed cases at that time. That is, the undercount is the multiplicative factor by which the recorded number of confirmed cases under-estimates the true number. Figure 4 shows the

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 16 undercount for each state across time for each value of p, based on the posterior mean of the total number of people ever infected by day. Naturally, when p is smaller, and therefore the number of deaths is a smaller proportion of the true number of infections, we estimate the undercount to be greater. Under the range of values for p considered, we find that the undercount was somewhere between 5-fold (for Washington with p = 0.025) and 175-fold (for New York with p = 0.002) across states as of April 1. As testing availability has increased, the undercount factor has fallen to around 2.5-fold (for Washington and Florida with p = 0.025) to about 45-fold (for both New York and California with p = 0.002) as of April 30. Table 2 shows the undercount and associated 95% posterior credible intervals as of the last day that appears in our data, April 30, 2020.

CA
FL NY WA 0.002 43.30 (39.29,47.72) 31.92 (28.04,35.88) 44.90 (44.30,45.46) 33.77 (30.87,36.92) 0.005 17.60 (15.95,19.45) 12.89 (11.16,14.49) 19.25 (18.89,19.59) 13.59 (12.29,14.88) 0  These estimated undercounts-particularly those for p = 0.01, p = 0.015, and, p = 0.025are consistent with other estimates of the extent of undercounting. The first wave of random sampling design in Indiana conducted by researchers at the University of Indiana and the Indiana State Department of Health concluded in a preliminary report that the number of positive tests undercounted the number of people ever infected by a factor of 9.6 (Menachemi et al., 2020). They also estimate the infection fatality rate at 0.006. We arrive at undercount factors of between 6.5 and 9.8 for the four states analyzed with p = 0.01 and between 12.9 and 19.25 with p = 0.005. This places our estimates within a similar range of values as those found using a combination of serological and viral RNA testing in Indiana.
New York State also recently conducted a seroprevalence study. In this study, they found that about 21.2% of New York City residents tested positive for having had the disease. Applying this to New York City's population of 8.4 million people, this results in an estimate of about 1.8 million infections in the New York City compared to the number of reported cases at the time of about 153,000. This implies an under count factor of a little over 11. Compared with around 11,460 reported deaths at the time of the release of this study, this also implies a raw IFR of about p = 0.006. 6 Given that some people who are currently infected will, unfortunately, go on to die, this likely provides a slight under-estimate of p.
These kinds of estimates can also help inform discussions about herd immunity. For example, in New York as of April 30, there had been about 311,000 cases. In our p = 0.01 scenario, we estimate the undercount factor at ten times, as of this date in New York. This would mean that about 3.1 million had been infected in New York between the start of the epidemic and April 30, which is between one sixth and one seventh of the population. This is far too small a proportion for herd immunity to be a major factor at that point. However, an analyst who assigns high probability to the p = 0.002 case would conclude that close to 14 million people (the vast majority of the population) had been infected by April 30, and therefore that herd immunity is sufficient that almost all mitigation efforts could now be lifted and the number of infections would continue to decline. We find this latter scenario to be highly unlikely. 5.3. Effects of executive orders for social distancing. Executive orders were issued and social distancing policies began taking effect in parts of the United States around March 15, and thus as of the writing of this paper, enough time has now passed since the orders in at least some states that the effects will be visible in the deaths data. Recalling the time from infection to death distribution from Figure 2, changes in the dynamics of new infections should begin to be visible in death data around two weeks following the onset of the policy change, and should be mostly visible by around four weeks. This makes it now appropriate to attempt to estimate the effect of such social distancing orders on infection dynamics. In the SIR model, the number of infected individuals will grow when the rate of new infections is higher than the rate of removal and will decrease otherwise. In other words, the rates are equal when the time derivative of i t is zero, i.e., We thus focus on estimating the quantity ρ T (where T is the last day that appears in our training data). When ρ T > 1, the number of infected individuals is still growing, while if ρ T < 1, it is declining and the virus is being suppressed. This quantity is sometimes called R t , but to avoid confusion with compartments of the SIR model, we have used unconventional notation. It is important to note that, because this quantity depends on the current state of s t , which is decreasing in time, it is possible for the current measures to fail to suppress the virus today yet be sufficient to suppress the virus at some later time because s t will have declined. Table 3 shows estimates of the posterior mean of ρ T along with 95% posterior credible intervals. Results are reported for four states and for all five values of p. It is clear that this quantity is not very sensitive to the assumption about p, which is encouraging since this is the major assumption of our model. Based on these results, the current measures appear sufficient to suppress the virus in New York and Washington, and just at the boundary of being sufficient in California and Florida. It will be interesting to see how these conclusions change in the coming weeks as social distancing and other mitigation policies are relaxed.  Table 3. Estimated posterior mean of ρ T , with 95% posterior credible interval shown in parentheses.
Our model could be sensitive to various types of misspecification, including incorrect specification of θ or p, incorrect specification of the time point T 1 at which the parameters of the model change, and incorrect specification of the SIR model itself. Because we give results for multiple values of p, this type of misspecification is less of a concern. In appendix B, we conduct a simulation study to assess sensitivity of estimates of the quantities of interest, such as the undercount factor and R 0 , to these other types of misspecification. The results suggest that the model is fairly robust to these sources of misspecification, except in the case of estimation of R 0 when T 1 is misspecified. However, we conclude that badly misspecifying T 1 is often detectable in poor model fit. Details are provided in appendix B.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 18 5.4. A prior on p. We have so far presented results for different values of p. The parameter p is not identifiable when the parameters of the SIR model are estimated from the data. When performing subjective Bayesian analysis for an application in which there is widespread disagreement about a non-identifiable parameter-such as p in this case-it makes sense to present multiple cases and allow readers to weight the different cases according to their own priors. 'Group' Bayesian decision making is an especially fraught problem with no real solution, and thus when a parameter cannot be updated with data and there clearly exists widespread disagreement about the value of this parameter, it is especially important to allow analysts to impose their own prior on these parameters. In this section, we present results using our own prior on p to demonstrate how one could obtain a single posterior estimate for parameters of the SIR model, averaging over their prior on p.
We assign a prior on p based on our own reading of the literature, media accounts, statements of public health officials, and interactions with epidemiologists and other experts. Despite this, our prior is certainly subjective, and the results we present here should not be taken to supplant the results earlier in the paper. Our prior is a Gamma(30, 3000) distribution, truncated to the interval [0.002, 0.025], and discretized to support points 0.002, 0.003, . . . , 0.024, 0.025. This prior has mean of approximately 0.01 and places most of its mass between 0.005 and 0.015.
To obtain results, we run the MCMC algorithm described in section 4 for p taking each value in the support, then average the samples with weights given by the prior. Results are shown in Table  4. As expected, the posterior mean for the undercount is similar to what we obtained with p fixed at 0.01, but credible intervals are considerably wider. This is indeed the main difference between putting a prior on p and fixing p: because the undercount is sensitive to p, putting a prior on p results in much wider credible intervals for the undercount than with any fixed value of p. In most cases, the results for R 0 and ρ T do not change considerably compared to any of the results with fixed p, since these quantities were not particularly sensitive to the assumed value of p. R 0 ρ T Undercount CA 3.09 (1.89,3.87) 1.09 (1.03,1.19) 9.00 (6.33,13.29) FL 3.20 (2.46,3.68) 1.02 (0.88,1.17) 7.00 (4.54,9.75) NY 3.89 (3.64,4.00) 0.68 (0.65,0.70) 10.00 (7.02,14.00) WA 2.74 (2.08,3.09) 0.75 (0.66,0.83) 7.00 (4.89,10.20) Table 4. R 0 , ρ T , and the estimated undercount (95% posterior credible intervals) calculated by averaging samples over a prior on p.

Sensitivity Analysis
In this section we address sensitivity of our findings to the value we have chosen for θ and to the assumption that the death count data is an accurate representation of the number of COVID-19 attributable deaths.
6.1. Alternative θ. One of the first major interventions outside of China to stem the spread of COVID-19 was the lockdown in Italy on March 9, 2020. Around 20 days later, the number of daily deaths in Italy showed a sustained downward trend. We adjust the θ used in the main text to an average time to death of 20 days consistent with Italy. We then see what effects using this θ might have on our results.
To create a θ based on an average time to death of about 20 days, we induce a correlation between the time to symptom onset and the time to death for those who died. Specifically, we simulate the time from infection to symptom onset as before from a Poisson-Gamma distribution

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 19 with parameters {a, b} = {5.5, 1.1} to be consistent with Lauer et al. (2020). We then simulate the time from symptom onset to death from a Poisson-Gamma distribution with parameters {a, b} = {t 1 −t 1 + 5.5, 1.1}, where t 1 is the simulated time from infection to symptom onset. This induces a positive correlation such that those who had faster symptom onset also had faster time to death from symptom onset on average. This results in an alternative θ with average time to death of about 20.5 days and a correlation between incubation time and time from symptom onset to death of about 0.35. This alternative θ is shown in Figure 5 Figure 5. Alternative θ.
6.2. Excess mortality-inflated death counts. Using the raw death count data assumes that deaths due to COVID-19 have been accurately recorded. Recent studies on excess mortality suggest that, in some states, there has been under-reporting of COVID-19 deaths. For example, Weinberger et al. (2020) calculate excess mortality due to pneumonia and influenza (P&I) from February 9, 2020 to March 28, 2020. They note that in California during this time period, while only 101 COVID-19 deaths were reported, there were 399 excess deaths due to P&I during that same time period. If we interpret all of those deaths to be COVID-related, this implies that that only about one quarter of COVID-19 deaths were reported as such. Similar analysis for Florida and Washington showed reporting rates of about 30% and 100%, respectively. Though New York did not show under-reporting when comparing only to pneumonia and influenza deaths, due to the large volume of cases there, the authors noted a reporting rate of 60% in New York City and 40% in New York State (excluding New York City) based on all cause excess mortality. We assume a rough estimate of under-reporting for New York State in its entirety is around 50%.
Based on these findings, we inflate the number of observed deaths on each day by a factor of four, three, two, and one for California, Florida, New York, and Washington, respectively. By inflating the data in this way, this implicitly assumes that excess mortality during this time period is all due to COVID-19 infections. While many of the excess deaths during this time period may be the result of the environment caused by COVID-19 (e.g., not going to the hospital for other illnesses out of fear of contracting COVID-19, reduction in medical services available to non COVID-19 patients, and saturation of emergency medical services), it is unlikely that all are directly attributable to infection itself. This calculation also assumes that the rate at which COVID-related deaths were not attributed to COVID-19 in official records is constant across time. As testing capacity has expanded over time and the estimates we use are based on data only through March 28, this is unlikely to be true. Because not all excess mortality is due to COVID-19 disease, these two cases (raw death counts presented as our main results and the excess mortality-inflated death counts presented here) likely bracket the true number of deaths. However, more updated estimates of P&I excess mortality do not seem to be available. Even all cause excess mortality estimates from the CDC are lagged by a few weeks.

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 20 6.3. Alternative results. Results from applying our model with the alternative θ and for excess mortality (labeled as 'multiplier') are given below. Each of these were run for p = 0.01. Table 5 gives the estimated R 0 under each of these alternative scenarios. In neither case do the estimates of R 0 substantively change under these alternative scenarios. Point estimates in the alternative scenarios fall within the posterior credible intervals of the model fit in the primary scenarios given in the main text.  Table 5. Estimated posterior mean of R 0 , with 95% posterior credible interval shown in parentheses. Rows correspond to two alternative scenarios considered: death counts adjusted using an excess mortality multiplier and alternative value of θ used. In both cases, p = 0.01.  Table 6. Estimated posterior mean of ρ T , with 95% posterior credible interval shown in parentheses. Rows correspond to two alternative scenarios considered: death counts adjusted using an excess mortality multiplier and alternative value of θ used. In both cases, p = 0.01. Table 7 shows the estimated factor by which current testing undercounts the actual number of infections as estimated by our model. This is analogous to Table 2 in the main text. Unsurprisingly, we found that inflating the number of COVID-19 deaths results in higher estimates of the total infected and, consequently, a larger undercount factor (relative to results for p = 0.01 in the main text). Using the alternative value of θ has minimal impact on estimates of the undercount, with three of the four point estimates calculated under the alternative θ falling within the posterior credible intervals obtained in the main analysis. Notably, under the alternative θ, New York sees a statistically significant though not substantively meaningful decrease in the undercount factor.
In all but one of the calculations under alternative scenarios, we find that the alternative parameters do not substantively change the results. The only exception to this is the undercount factors, which-unsurprisingly-exhibit an increase when we add a multiplier to the time series to account for possible undercounting. However, early evidence suggests that accounting for under-reporting would change the death counts by at most a factor of 4, and this is likely a significant over-estimate for the reasons outlined above. Since increasing the death counts with p fixed has a similar effect

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 21 on our analysis to decreasing p with the death counts fixed, and the values of p we consider in the main analysis range from 0.2% up to 2.5% (a difference of a factor of 10), the variation in results that is possible due to undercounting of deaths is already well represented by variation in p in the scenarios we consider.

Conclusion
We have built a model for the transmission of SARS-CoV-2 using only information which has a reasonable chance of being measured correctly: the observed number of daily deaths, timing of containment measures, and information on the clinical progression of the disease. In contrast to models that use a wider range of information that may be less precisely measured, we believe the main strengths of our approach are that it prioritizes simplicity, interpretability, and identifiability. Because not all parameters one might want to estimate in this setting are separately identifiable, we have carefully specified our model so that all estimated parameters are identifiable conditional on quantities for which we have high quality auxiliary information. Our model also has a proper likelihood and is fit to data in the Bayesian paradigm, rather than relying on ad hoc calibration of model parameters to produce trajectories resembling the observed data. This allows us to formally account for uncertainty in all of our estimates by posterior credible intervals. Our model is underpinned by an SIR model of infection dynamics to link observed deaths to the underlying unobserved infections. It would not be difficult with our approach to substitute some other model in place of the SIR. In particular, any other compartmental model, such as a Susceptible-Exposed-Infected-Removed (SEIR) model could be used, or the model could be elaborated to incorporate more change points in the parameters of the compartmental model to account for fine grained policy analysis. 7 We estimate that official case counts substantially undercount of the number of infections. This is not a surprise. Despite recent increases in testing capabilities, our analysis suggests that testing continues to undercount the number of infections by a factor between 6 and 20 for p = 0.01 and p = 0.005, the values of p we consider most likely. Future work could incorporate IFRs that vary by time or other covariates.
Seroprevalence studies done in two states in the United States result in similar conclusions with respect to undercount. Such studies have only recently become available and do not give snapshots of undercount over time. We believe that the correspondence between our estimates and serological studies that have since been done points to the utility of such a model for estimating key quantities of interest (such as the extent of undercounting), especially early in the epidemic when testing capabilities are yet built up.
While the seroprevalence studies using random design are just now reporting results, our initial estimates were first posted in a pre-print in late March. This suggests that approaches similar to what we've described here could be useful in providing indications of the undercount and the total number of infections of emerging diseases in the early days of its spread. During this time, when the number of cases is relatively low and the rate of spread is nearly exponential, accounting for the time lag between infection and death as we do here can lead to substantial differences in estimates of the total number of cases relative to naive methods, such as simply dividing the total number of deaths by the IFR. 7 A GitHub repository with extensions to the current model and applications to other countries' data can be found at https://github.com/paulo-o/covid19. This repository is still in development and contains ongoing work by two of this paper's authors, James Johndrow and Kristian Lum, as well as Paulo Orenstein (Instituto de Matemática Pura e Aplicada (IMPA)).

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 22 Our model also suggests wide variability in the initial transmissibility of the disease prior to intervention policies as well as wide variability in the effects of those policies. For example, we estimate that New York had the highest R 0 s at the outset of the outbreak. This is unsurprising considering New York City has so far been the epicenter of the epidemic in the United States. However, estimates of ρ T suggest that the spread of the virus in New York has slowed to the lowest rate among the states considered in this analysis. Whereas in New York, estimates of ρ T indicate a continued, fairly rapid decrease in deaths, estimates of ρ T for California and Florida are very near to one, indicating that sharp declines in deaths are unlikely.
All of these projections are conditioned on the current policies staying in place and their effects on transmission remaining constant. Tightening or relaxing mitigation policies, or 'quarantine fatigue' leading to a decrease in compliance by the public, would impact the trajectory of the spread. Estimates of ρ T near one indicate that there is little room for relaxation of mitigation policies without causing an increase in the number of infections and deaths. As mitigation measures are tightened and relaxed over time, an elaboration of our model incorporating multiple change points in the parameters may be useful for future projections and evaluation of the continued efficacy of the measures.
Disclosure Statement. The authors have no conflicts of interest to declare. Figure A1. MCMC samples of γ −1 , η, φ, R 0 , and T 0 for New York in the case where p = 0.01.  Table A2. Multivariate Gelman-Rubin PSRF.

Appendix B. Simulation Study
We conduct a simulation study to assess sensitivity of the method to various assumptions. We analyze five simulation cases. In the first four cases, the daily new infections are obtained from the two-period SIR model in (3.4), but we explore several ways in which other assumptions can be violated. The parameters of the SIR model are: T 0 = 35 (corresponding to February 4, since we index January 1 as time 1), γ = 1/9, R 0 = 3.8, φ = 0.3, η = 1.4, and T 1 = 78 (corresponding to March 18, 2020). These parameters were selected to give a death series looking roughly like that observed in New York with p = 0.01. In every case, the Poisson model in (3.3) is used to generate

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 24 deaths given new infections with p = 0.01 and θ shown in Figure 2. We then statistically infer using our method. Since we already provide results for a range of values of p, we do not include variation in p here.
(1) The model is properly specified.
(2) We use the alternative (incorrect) value of θ displayed in Figure 5 when statistically inferring.
(3) We assume that T 1 is one week too early (corresponding to March 11 instead of the true value of March 18) when statistically inferring. (4) We assume that T 1 is one week too late (corresponding to March 25 instead of the true value of March 18) when statistically inferring. (5) The true new infection series ν is generated by a susceptible-exposed-infected-removed (SEIR) model with two time periods rather than an SIR model with two time periods, but otherwise the model is correctly specified. The SEIR model is a popular alternative to the SIR model for modeling SARS-CoV-2. We choose the parameters of the SEIR model to give a death series similar to that for the other four cases. Figure A2 shows the simulated data used in Cases 1-4 (left column) and Case 5 (right column). The top panels show deaths, the middle row of panels shows new infections, and the bottom row shows the state variables of the ODE model (SIR in the left column and SEIR in the right column) through a period of 121 days (corresponding to the dates January 1, 2020 through April 30, 2020). Figure A2. Simulated data for examples. Case 1 (shown in the left column) is an SIR model. Case 2 (shown in the right column) is an SEIR model. Values for the bottom panel displaying the SEIR model are displayed as a proportion of the total population and the S compartment has been omitted so that detail for the E, I, and R compartments would be visible.
We report estimates for R 0 , ρ T , and the proportion of the total population that has ever been infected as of the last day of the simulation in Table A3. In Cases 1-4, the true value of ρ T is 0.72 and the true value of the total proportion of the population that has ever been infected is 11.7%, while in case 5 they are 0.72 and 12.1%, respectively. Case 1 demonstrates that we recover

Just Accepted
Estimating the number of SARS-CoV-2 infections and the impact of mitigation policies in the United States 25 the true parameters using our method (all values lie within the posterior credible intervals). Case 2 shows that the only result that has noticeable sensitivity to incorrect specification of θ is the total proportion infected. Since the true θ has a longer expected time to death than the θ used for inference in this case, it makes sense that we slightly underestimate the true proportion of the population that has been infected. Cases 3 and 4 show that estimates of R 0 , and, to a lesser extent, ρ T are sensitive to incorrect specification of the time point at which the model parameters change T 1 , but the total proportion of the population that has ever been infected is actually not very sensitive to misspecification of this parameter. Note, however, that being wrong by one week about T 1 typically has a noticeable effect on the model fit, to the extent that one would suspect a problem simply by looking at posterior summaries. This can be seen in Figure A3, which shows the observed death data and the posterior predictive point estimates and intervals for deaths for Case 4. The model fits poorly and is unable to find any set of parameters that reproduces the observed deaths. Thus while there is considerable sensitivity to misspecification of T 1 , in the cases we considered, this misspecification can be easily detected by a posterior predictive check. Finally, Case 5 shows that, even when the SIR model itself is misspecified and the true generating process includes an exposed period in which people are not infectious, we are able to recover the truth with reasonable accuracy.

R0
ρT  Table A3. Estimated posterior mean and 95% posterior credible intervals for each simulation scenario. Figure A3. Observed death series (red), posterior mean of deaths (dashed black line), and posterior 95% credible interval for deaths (gray region) for Case 4.
Overall, the results suggest that, although some types of misspecification can lead to poor estimates of R 0 , these types of misspecification may be detectable by posterior predictive checks. Furthermore, the total proportion of the population that has ever been infected, while highly sensitive to the p parameter, is reasonably robust to the types of misspecification we explore in this section. This suggests that estimates of the undercount-which will be good if and only if the estimate of the total population ever infected is good-are fairly robust to misspecification when p is known.