On Identifying and Mitigating Bias in the Estimation of the COVID-19 Case Fatality Rate

The relative case fatality rates (CFRs) between groups and countries are key measures of relative risk that guide policy decisions regarding scarce medical resource allocation during the ongoing COVID-19 pandemic. In the middle of an active outbreak when surveillance data is the primary source of information, estimating these quantities involves compensating for competing biases in time series of deaths, cases, and recoveries. These include time- and severity- dependent reporting of cases as well as time lags in observed patient outcomes. In the context of COVID-19 CFR estimation, we survey such biases and their potential significance. Further, we analyze theoretically the effect of certain biases, like preferential reporting of fatal cases, on naive estimators of CFR. We provide a partially corrected estimator of these naive estimates that accounts for time lag and imperfect reporting of deaths and recoveries. We show that collection of randomized data by testing the contacts of infectious individuals regardless of the presence of symptoms would mitigate bias by limiting the covariance between diagnosis and death. Our analysis is supplemented by theoretical and numerical results and a simple and fast open-source codebase at https://github.com/aangelopoulos/cfr-covid-19 .


Introduction
As of May 18, 2020, the 2019 novel Coronavirus (SARS-CoV-2) outbreak has claimed at least 317, 000 lives out of 4.8 million confirmed cases worldwide, of which 1.8 million recovered (Dong et al., 2020). Because the basic reproduction number R 0 of the virus is high (estimated to fall between 2 and 3 by Liu et al. 2020), public health organizations and local, state, and national governments must allocate scarce resources to populations especially susceptible to death during this pandemic. Therefore it is critical to have good estimates of the proportion of fatal infections of COVID-19: this quantity is referred to as the absolute case fatality rate (CFR). 2 It is additionally important to understand the relative CFRs between different subpopulations (i.e., the ratio of their absolute CFRs). We view the relative CFR as a useful target for data-informed resource-allocation protocols because it is a key measure of relative risk. Indeed, the absolute CFR is a measure of absolute severity only for a particular population, since it averages out effects of medical care, age, geography, genetics, and more. Practical decisions will ultimately be made based on coarse stratifications of these covariates; for example, a relative CFR that is specific to a geographical region may be needed for resource planning and allocation. Similarly, a relative CFR that is specific to sex or race is often sought to monitor for and ensure equitable treatment of patients within hospitals across demographics. To facilitate such planning, we target the relative number of deaths among total cases between groups of people (e.g., senior citizens in Italy) as a critical measure of relative risk that informs decisions affecting human lives. Other such measures include prevalence and risk of hospitalization.
It is widely believed that the naive estimator of CFR, E naive , obtained from a simple ratio of reported deaths to reported cases (and which has a value of 6.6% when applied to the data of May 18, 2020), is biased (Battegay et al., 2020;Fauci et al., 2020). Indeed, an extensive epidemiological literature has asserted this bias and presented methods that attempt to mitigate it (Donnelly et al., 2003). Bias-mitigation methods are also present in a large literature on survey sampling and weighting (Gelman, 2007). Despite this academic background, naive estimates continue to be used, reported, and cited in major publications (Lipsitch, 2020;Novel Coronavirus Pneumonia Emergency Response Epidemiology Team, 2020;Wu and McGoogan, 2020).
Since publicly available health surveillance data for COVID-19 are heterogeneous and partially observed, it is problematic to assert that any estimator uniformly outperforms the naive estimator. A variety of competing (and unknown) biases, both negative and positive, could conceivably cancel, causing the naive estimator to be closer to the true CFR despite its theoretical inadequacy. Statistical wisdom would suggest that the conundrum of conflicting biases can be remedied by studying the multi-stage process that relates data obtained by surveillance sampling to the populations that are the target of inferential assertions. It is the goal of this article to present such a statistical perspective and explore some of its consequences for COVID-19.
Clarity on the potential biases underlying the use of data from surveillance sampling can help to determine what additional datasets may be needed to mitigate bias. Examples that will inform our discussion include the New York seroprevalence study reported in Goodman and Rothfeld (2020), which helped correct significant under-ascertainment of mild cases, and Verity et al. (2020), who made use of individualized case data, polymerase-chain reaction (PCR) prevalence data, and Bayesian inferential methods, resulting in a CFR estimate of 1.38%. These studies can improve public-health response to COVID-19 insofar as they are accompanied by an understanding of their implicit assumptions, including putative control of possible biases.
The remainder of this article is organized as follows. In Section 2, we provide a statistical perspective on the many potential biases affecting absolute and relative CFR estimation. In Section 3, we employ the general perspective to isolate some restricted contexts in which two naive estimators are unbiased, with implications on the need for contact tracing. In Section 4, we consider how model-based inference can expand the contexts in which unbiased estimation is possible. We provide an illustrative example, showing how an (approximate) maximum likelihood estimator from Reich et al. (2012) can be applied to correct bias from relative reporting rates of fatal and resolved cases using only surveillance data and an approximate horizon distribution of deaths. We discuss how the general principle of coping with incomplete data via Poisson approximation and a log-linear likelihood model can be more widely applied. In Section 5, we present results of this method on COVID-19 data. Finally, in Section 6, we give a mathematical justification for contact tracing as a data-collection methodology (Chowell et al., 2009;Eames and Keeling, 2003), and discuss how it would mitigate many of the problematic biases at their source.

Sources of Bias in COVID-19 Surveillance Data
In Figure 1, we present a graphical model that captures aspects of the data-generating process for COVID-19 surveillance data. Graphical models provide a general formal language for reasoning about the probabilistic and causal structure of collections of random variables (see, e.g., Jordan 2004). In this article, it suffices to think of the model in Figure 1 informally as a depiction of dependencies among population-level and sampled quantities in surveillance data. Our objective is to consider biases that may arise along each edge of the graph. Prior work on SARS, H1N1, H7N9, H5N1, MERS, and HIV has identified or even quantified many of these biases (Atkins et al. 2015;Woodruff et al. 2014;see Lipsitch et al. 2015 for a review). The diagram also depicts the population eligible to be sampled (the 'sampling frame') through collection of surveillance data from standard hospital reports and death certificates. Asymptomatic carriers are excluded from the sampling frame because testing is currently not recommended or available for asymptomatic individuals. We roughly categorize biases as: under-ascertainment of mild cases, time lags, interventions, group characteristics, and imperfect reporting and attribution. Extensive (but not comprehensive) discussion of the magnitude and direction of these biases corroborated by COVID-19-specific evidence is included in the following subsections.
These competing biases can often be expressed in terms of ratios of edge weights (conditional probabilities) from Figure 1. For example, with A → B denoting the value of the edge between nodes A and B, if the probability of reporting a fatal case is greater than reporting an infection (DF → RF > D → RC), then ignoring other biases, E naive will be upwardly biased by a factor b > 1. However, because biases can compete with one another, this does not mean (1/b)E naive is always a better estimator than E naive . Speaking loosely, the bias incurred by under-ascertainment of asymptomatic cases could be 1/b, canceling out the former bias. Accordingly, the total error of any estimator is indeterminate. Therefore, the use of estimators based on surveillance data requires being clear on underlying assumptions. Statisticians and public health officials should proceed with multiple estimation strategies, armed with an understanding of the accompanying biases, and they should endeavor to collect additional data that mitigates the biases (as we describe in Section 6). Figure 1 conveys two important takeaways regarding the estimation of CFR. First and foremost, information about edges outside the sampling frame cannot be inferred from data within the sampling frame alone. Even within the frame, data is compromised by the biases listed in the subsections below. Each incoming edge to D can bias CFR estimates up or down, depending on its ratio to other incoming edges. This cannot be disentangled only by looking at the value of D. Secondly, even within the sampling frame, the relative values and time lags of D → DF and D → DR affect the estimation of CFR. However, these edges may be the only ones subject to correction using population-level surveillance data alone. This motivates our choice of an illustrative estimator of relative CFR, adapted from Reich et al. (2012). In particular, the estimator is based on assumptions under which estimation of relative CFR is possible while correcting for the relative values and time lags of D → DF and D → DR.
In the following subsections, we will cite evidence for the existence, magnitudes, and directions of certain biases from Figure 1. Our analysis was done in mid-April 2020. We categorize biases resulting from one of five phenomena: under-ascertainment of mild cases, time lags, interventions, group characteristics, and imperfect reporting and attribution.  For simplicity, nodes in this diagram can be thought of as representing the number of people in the labeled state. Each edge represents a conditional probability of transitioning between states and has an associated time lag. S1 represents COVID-19 negative people with flu symptoms. S2 represents people with symptoms caused by COVID-19. S3 represents COVID-19 positive people with symptoms caused by another underlying health condition. The sampling frame of population surveillance data based on standard hospital reporting is in light blue; values of edges outside the sampling frame cannot be inferred only with data from within the sampling frame. Relative time lag introduces bias across all edges.

Under-ascertainment of Mild Cases
Diagnosing severe cases more often than mild cases will falsely increase CFR. In Figure 1, this bias corresponds most directly to spuriously increasing AC → U and S2 → U and/or decreasing AC → D and S2 → U. This bias may have high magnitude, since the true number of infections is likely to be several times as high as the reported number of cases in countries where testing is limited (Fauci et al., 2020). The significantly lower CFR in South Korea, a country with widespread testing, corroborates this explanation (Dong et al., 2020). A recent serology study from New York City (reported in Goodman and Rothfeld 2020) suggests the prevalence of COVID-19 is 21.2%, much higher than the confirmed case count. This indicates that the number of infections may be much larger than surveillance data implies globally.

Time Lags
Deaths and recoveries are reported after cases are confirmed, which artificially decreases the naive CFR (Wilson et al., 2020). More specifically, when a number of new cases is reported without delay, E naive becomes biased downward since the deaths yet to occur from the new cases will be missing from the numerator. In Figure 1, this means a time lag is incurred across edges D → DF and D → DR. The value of this time lag depends on how early on in the disease's course a patient is diagnosed. The median value of the lag across D → DF was estimated by Linton et al. (2020) in China in early February to be 6.7 days (Lognormal 95% CI: 5.3−8.3). Tracking individual cases and including them only after death or recovery would solve this problem. However, data on individuals is rare, as hospitals and/or governments generally report population-level data only. Fortunately, these edges are within the sampling frame, so we can have some hope of correcting the bias incurred by this time lag. In Section 4, we implement an estimator that handles this correction directly, and discuss the many further assumptions that must be made to assure its validity even within the sampling frame. In reality however, time lags affect every edge in Figure 1. The 'incubation period,' for example, creates lag across edges E → AC, S1, S2, and S3, and was estimated at median 4.3 days based on Linton et al. (2020). The time between onset and hospital admission, which is not perfectly represented by the graphical model, creates lag across edges S1, S2, S3 → D and had an estimated median of 1.5 days among recovered cases and 5.1 days among fatal ones (Linton et al., 2020). The large discrepancy between hospitalization lags of patients with different outcomes suggests the presence of an unknown bias factor. For example, earlier hospitalization may result in more effective treatment, canceling out the propensity of severe cases to seek care more quickly in the data collected by Linton et al. A complete discussion of the effects of all time lags across edges in Figure 1 is outside the scope of this article.

Interventions
Data collected after a recent government intervention targeted to lower transmission of COVID-19 could produce a spuriously increased CFR. One primary tool of governments is the imposition of social-distancing measures, which decrease the amount and initial dose of exposures, thereby lowering S → E. One incubation period after such a measure is enacted, the number of new cases will decrease, but the number of new deaths will not, since these deaths will be from cases diagnosed before the government intervention. This will upwardly bias the CFR for a few weeks after the intervention.
As in other pandemic influenzas, there may be a direct biological effect of increasing the infectious dose, leading to higher fatality rates (Paulo et al., 2010). Given an effective government intervention like social distancing, the infectious dose would decrease, directly lowering U → UF and D → DF and increasing U → UR and D → DR. This will upwardly bias current CFR estimates for some weeks after the initial intervention, since new deaths will still occur from cases whose onset time was before the intervention. The Centre for Evidence Based Medicine has a helpful page dedicated to COVID-19 viral dynamics like these (Heneghan et al., 2020).
Interventions to improve the quality of medical care can cause a drastic decrease in CFR, particularly when treatment options (e.g., drugs, blood transfusions, ventilators) become available or if training of health care workers (HCWs) improves. The effect can be highly pronounced in developing countries (Siddique, 1994) and is the subject of active study today (Hsiang et al., 2020;Warne et al., 2020). By the same logic as above, these interventions can lead to a spuriously increased CFR estimate. However, interventions that improve accessibility of medical care, such as the new health facilities being constructed around the world (Lardieri, 2020;Wang et al., 2020a), can also increase testing and reporting. This increase will likely result in better data in the long term due to higher ascertainment of mild cases, although we have no data to support this conjecture.

Group Characteristics
It is already well understood that certain groups have a higher CFR than other groups. In other words, the edges D → DF, D → DR, U → UF, and U → UR will have different values based on the characteristics of the sampled population, which could cause bias in either direction when estimating CFR. For example, the risk of death may be 34 to 73 times lower in people under 65 years old compared to those over 65 . Furthermore, the incidences of comorbidities such as obesity, heart disease, smoking, genetics, and diabetes correlate with nation, socioeconomic status, race, sex, and more (Cai, 2020;Lee et al., 2014;Sliwa et al., 2008). In the context of surveillance data, without knowing the proportion of these groups in the sampling frame, which may not be uniform in time, the CFR can be biased in either direction. Chin et al. (2020) argue that incorporating county-level data about these covariates can result in a more equitable public-health response.

Imperfect Reporting and Attribution
Both the definition of a 'case' and also the criteria under which an individual is eligible for testing can bias CFR estimates. Case definition, even within one nation, can change case counts dramatically. On February 12, for example, the Chinese government changed the definition of 'confirmed case' to include symptom-based diagnoses, resulting in a 600% increase in cases that day (Worldometer, 2020). Without information on how deaths were attributed beforehand, we do not know the magnitude of this bias. Serious problems have been introduced by poor reporting on behalf of governments. For example, the Johns Hopkins GitHub stopped providing surveillance data on recovered cases within the United States, because the quality of the data was too low (CSSEGISand-Data, 2020). Furthermore, because testing is often reserved for the most severe cases, S2 → D is inflated while AC → D is deflated (Mostahari and Emanuel, 2020). This will spuriously increase E naive . Evidently, detailed knowledge of how cases, deaths, and recoveries are defined and reported are prerequisite to understanding these biases, even if it will be impossible to correct for them without finer-grained data.
Sensitivity and specificity of COVID-19 tests certainly affect all of AC, S1, S2, S3 → D and AC, S1, S2, S3 → U. A diagnostic test with a high false discovery rate will increase S → S1, incorrectly inflating the denominator of E naive and spuriously decreasing the CFR. Nonetheless, assays have improved with time; the initial test developed by the U.S. Centers for Disease Control was ineffective (Sharfstein et al., 2020). Still, the serology assay used by Bendavid et al. (2020) had a putative sensitivity of 80% and specificity of 99.5%, which may still be too low to provide estimates of a small prevalence.
Distinctly from under-ascertainment, reporting of infectious disease by health care providers in the United States is often incomplete and normally has a mean time delay of 12 to 40 days depending on the pathogen (Jajosky and Groseclose, 2004). This means edges D → RC, DF → RF, and DR → RR are not 1.0. Because deaths are more likely to be reported by health care providers than confirmed cases or recoveries, ignoring time delay, D → RC is less than DF → RF, biasing E naive upward. Depending on the relative time delays across these edges, estimators may be biased in either direction. For example, if D → RC is lagged more than D → RD, it would bias E naive downward during the growth phase of an epidemic. To our knowledge, these time delays, which occur on a hospital-by-hospital basis, have not been quantified, and it is not obvious in what direction they will skew. The magnitude of this bias could be quite large for COVID-19. On Friday, April 17, the Wuhan government reported 1,290 new fatalities, increasing their cumulative death toll by 50% in one day. They claimed the revision was because "medical workers . . . might have been preoccupied with saving lives, and there existed delayed reporting, underreporting, or misreporting" (Kuo, 2020). This is a salient example of a high-magnitude bias from reporting errors that we can not correct, since we do not know at what time those deaths truly occurred.
Evidence from past epidemics also indicates this bias may be significant for COVID-19. Even for severe illnesses such as Hepatitis C, health care providers can underreport cases by up to 12x (Klevens et al., 2014). Historically, the magnitude of underreporting depends heavily on ease of reporting for HCWs (e.g., electronic vs. paper systems) and also mandatory reporting laws (Chorba et al., 1989;Panackal et al., 2002).
Finally, although survivorship bias may be small, misattribution of deaths (i.e., increased weight of S3 → D) may be significant. A recent JAMA article argued that COVID-19 positive patients with cardiac injury have a relative risk of death of 4.26 compared to those with no cardiac injury. Most of those patients also had abnormal electrocardiograms (Shi et al., 2020). Another case study described a healthy 53-year-old woman who tested positive for COVID-19, did not show any respiratory involvement, but developed acute myopericarditis with systolic dysfunction (Inciardi et al., 2020). Kidney involvement has also been found (Ronco and Reis, 2020). It is unclear how deaths in the presence of multiple diagnoses are being counted, and indeed, to which disease they should be attributed. Disentangling these relationships may be possible with regression on highresolution clinical data. However, we have not seen this level of detail reflected in surveillance data. Comparisons with historical mortality data suggest tens of thousands of deaths are misattributed or unreported .

Naive Estimators
We access publicly available data courtesy of Johns Hopkins University, consisting of time-series data of recoveries, deaths, and confirmed cases stratified across several dozen groups (in this case, primarily geographic locations) (Dong et al., 2020). Our computations were performed on April 18, 2020. We denote cohorts or groups of cases by indices g, belonging to a set G. For example, g could be 'people under 60 years of age,' or 'people in Wuhan.' For time points t = 1, 2, . . . , T = 41, we collect daily data as follows: for each group g ∈ G we collect R g t , D g t , and C g t , which correspond to the number of new recoveries, new deaths, and new cases reported on day t within group g. We drop the group superscript g for population quantities:

An Estimator Based on Dividing Deaths by Cases
In early March 2020, the WHO estimate of the CFR, 3.4% was widely reported (Ghebreyesus, 2020;Stelter, 2020). This estimate is obtained from a naive estimator; 3 specifically, the raw proportion of deaths among confirmed cases. Formally, as of March 6, 2020, As of April 18, 2020, E naive was 6.9%. However, as we establish in Appendix A, in a setting without time delays, the naive CFR is asymptotically unbiased for the true CFR if and only if the probability of reporting is the same for fatal and nonfatal cases. Moreover, it is unbiased in finite samples if and only if reporting is perfect. As discussed in Section 2, this is not true in the case of COVID-19. We also derive the finite-sample expectation of the estimator. Even asymptotically, the expectation of this estimator can become unboundedly far away from the true CFR as reporting goes to zero or the CFR goes to zero. The naive estimator requires no complex modeling or tuning parameters and is easy to interpret. As we argued in Section 2, there is no uniformly best method of measuring the CFR, and the naive estimator should be viewed as one in a constellation of estimators giving a heuristic idea of the causal CFR. Nonetheless, the naive estimator can be improved at little cost, and indeed, in this work, we suggest applying a simple correction for time-dependent reporting rates and alleviate two problems with the naive estimator: time-lag between death and recovery, and time-dependence in the reporting rates of fatal and nonfatal cases.

An Estimator Based on Observed Outcomes
One can view the time lag in the numerator above (across the D → DF link) as a consequence of 'censoring' the data: a case has been identified, but the outcome is hidden. Methods for handling censored data have been studied for several decades in the statistical literature; in particular, in the context of the bootstrap (Efron, 1981). Although it is not the focus of our work, several others have already applied the bootstrap to COVID-19 data to find confidence intervals for other epidemiological parameters such as R 0 (Linton et al., 2020;Read et al., 2020). This should also be done for the CFR for COVID-19, as Jewell et al. (2007) did for SARS, although the structure of the data used in that work differs from the current setting.
Definition ψ t,g Probability of diagnosis, given death from COVID-19, onset time t, and group g. ϕ t,g Probability of diagnosis, given recovery from COVID-19, onset time t, and group g. p t,g Probability of death, given onset time t, within group g. η t Probability of death t days after onset, given death occurs. There is also a very simple estimator that avoids censoring by using only observed data, namely: The CFR calculated by this estimator is upwardly biased, and we will briefly discuss why. This estimator accounts for the inflation of the denominator in the naive estimator via the relative time lag between D → RC and D → DF. However, it assumes we observe the same fraction of recovered cases and fatal cases at the time of estimation. Thus, it has introduced a new bias, the relative reporting rate and time lag between D → DF and D → DR. We formalize the asymptotic inferential target of this estimator in Appendix B. Note that in all cases, E obs ≥ E naive . In fact, E obs is exactly 3E naive on April 18th. This large discrepancy is due to under-reporting of recoveries, specifically within the United States, as we note in Section 2. The United States has, as of April 19, roughly 40, 000 deaths and 70, 000 recoveries (Dong et al., 2020). Meanwhile, Spain has 20, 000 deaths and 80, 000 recoveries. Clearly, the reporting of recoveries in both nations is infrequent, and in the United States, it may be more than doubly so. The estimator E obs illustrates the dangers of correcting one of many biases without considering total error. The estimator E naive and the estimator we present in the next section do not use this recovery data.

Likelihood Models
In this section, we describe a parametric model that, with respect to several strong modeling assumptions, accounts for two biases: time-varying reporting and disease-delayed mortality. For definitions and discussion of our model parameters, see Table 1. With reference to Figure 1, the model accounts for the time-dependence of D → DF and from D → DR (i.e., it models how the values of these conditional probabilities change as a function of time), and also for the time delay across those same edges. This model was previously used by Reich et al. (2012) for CFR estimation of influenza. It is a covariate-independent reporting model that assumes all nonfatal cases eventually recover, so it does not utilize the time series of recoveries. Similar parametric models have been used for CFR estimation during other pandemics (Ejima et al., 2012;Frome and Checkoway, 1985). When none of the biases in Section 2 other than the time dependence and time delay across D → DF and D → DR are large and the mathematical assumptions in the remainder of Section 4 are satisfied, this estimator has a smaller total error than E naive and E obs , evidenced by empirical evaluations in Reich et al. (2012). Suppose that an individual is in group g and has infection onset at time t on . Such a case has three possible outcomes, whose probabilities we define in Equation 4. For further information, see also Table 2. First, the individual may eventually recover and be diagnosed. This occurs with probability ρ (1) ton,g , see Equation 4a. Secondly, they may eventually die, having been diagnosed. This

Diagnosed Undiagnosed Death
Included in scenario 1 Included in scenario 3 Recovery Included in scenario 2 Included in scenario 3 ton,g = ψ ton,g p ton,g , ρ (3) Accordingly, at each onset time t on and for each group g, there are N (1) ton,g , N ton,g , and N (3) ton,g individuals who eventually recover, die, or go undiagnosed respectively. Given a total number of cases within group g with onset at time t on , denoted N * ton,g , we model the outcomes via a multinomial model : ton,g , ρ ton,g , ρ ton,g , for all t on and g.
We assume these are independent across onset times and group. Furthermore, we assume knowledge of certain horizon probabilities. In order to define an estimator, we also need probabilities η t,ton,g , for t ≥ 0. These are probabilities that, given an individual is in group g and has onset of infection at time t on , they die t days later. We make the assumption that η t,ton,g ≡ η t for all t on , g. That is, these probabilities are time-and group-invariant. See Reich et al. (2012) for further analysis and evaluation of this model. Having stated the model, we now turn to the estimator. Let N ton,g denote the reported total number of cases of COVID-19 with onset at time t on in group g. Unfortunately, this is not the quantity of true interest. Instead, as mentioned above, we need N * ton,g , which is the number of number of both reported and unreported cases. Let E denote the expectation operator. In particular, E N * ton,g will be an expectation with respect to the multinomial model in Equation 5 indexed by N * ton,g . As a simplifying assumption, we assume ρ (2) ton,g ≡ p g ; that is, the group-specific death probability or CFR is time-invariant. If the p g are small, then N * t,g ≈ N t,g /ϕ t,g , in which case from our multinomial (Equation 5), it is easy to check that: In particular, if we assume that death is a rare event, then a Poisson approximation will be accurate: where N ton,g denotes the number of cases with onset at time t on within group g. In view of Equation 4b, this may be rewritten as: ton,g ∼ Poisson N ton,g ψ ton,g p ton ϕ ton,g .
If we make either an assumption that the reporting rates are group-invariant, or that there is perfect fatal-case reporting, ψ t,g ∝ ϕ t,g , then it is possible to rewrite the model in the form: where β 0 is a proportionality constant, γ g is a group-specific parameter (the relative CFR), and α ton is a time-specific parameter. Finally, given these values, along with the death probabilities η t , an expectation-maximization scheme can be carried out to compute a maximum-likelihood estimator. For further details, see sections 3.2 and 3.3 of Reich et al. (2012). They show empirically that as long as p g stays below 0.05, and their assumptions are approximately satisfied, the estimated CFR has relative error < 0.1 as compared to the ground truth. Their results also indicate that this model is insensitive to various misspecifications, including the distribution of deaths, η t . We confirm this in Figure 2 by sampling parameters of η t from their estimated confidence intervals (Linton et al., 2020).

Results
We report the results of our analysis of open-sourced COVID-19 data from Johns Hopkins, under the assumption that the reporting rates ψ t and ϕ t are group-invariant. We contribute an opensource multithreaded implementation of Reich et al. (2012) and a plotting utility that will allow reproducibility of these results, as shown in Figure 2. Finally, we report the relative CFR of women to men in Germany and Belgium using sex-disaggregated data from Riffe (2020).

Estimates of Relative CFRs
The corrected relative CFRs, calculated for six combinations of nations, are listed in Figure 2. In some cases, such as the comparison between England (GBR) and Italy (ITA), our estimator flips the direction of the relative CFR. In other words, E naive and E obs suggest that England has a higher CFR than Italy, while E Reich suggests otherwise. The same effect happens in the case of Switzerland (CHE) vs. Germany (DEU), with an additional shrinkage in the distance toward 1, indicating the relative CFR is more similar than E naive and E obs would suggest. The estimate of the relative CFR for Spain (ESP) to South Korea (KOR) predicted by E Reich is high at 30.27. Although we assumed in Section 4 that the relative CFR is constant in time, we report our results as a time series in Figure 2. We obtain this time series by calculating results as if we had run our estimator on every day from April 2, 2020, and April 16, 2020, using the cumulative data. Using the data from Riffe (2020), we calculated the relative CFR of women to men in Germany and Belgium. In Germany, E naive = 1.51 and E Reich = 1.14 (Sensitivity 1.14 − 1.22). In Belgium, E naive = 1.68 and E Reich = 1.25 (Sensitivity 1.13 − 1.26). Time-series data of recoveries is not available, so we could not calculate E obs . We chose Germany and Belgium because the data from these nations had about two months of seemingly reliable, day-by-day, sex-disambiguated data that roughly matched the numbers from Johns Hopkins. The dataset from Riffe was still under development at the time we ran these estimates.   Figure 2. The estimators E naive , E obs , and E Reich presented as time series from April 2, 2020 to April 16, 2020. Our estimator, in red, implements the correction for time-dependent relative reporting rates between countries identified by their ISO abbreviations. Sensitivity of our results to misparameterization of η t is reported by setting η t to be a discretized gamma distribution with mean 12.8 − 17.5 and variance 5.2 − 9.1, the lower and upper extremes of the 95% confidence intervals referenced (Linton et al., 2020). The ribbon shows the maximum and minimum values of the estimator E Reich at each time point under any combination of these conditions. The expectation maximization algorithm converged in all cases with negligible variance. We include the relative CFR of Spain to South Korea as an example of two countries for which our assumptions are particularly badly violated. Consequently, the method is unstable in that case (although we have no ground truth data for confirmation). Notice each plot has a different vertical axis scaling. We have included an orange line at a relative CFR of 1 to indicate the point when two countries have the same estimated CFR; this provides a reference point between the plots.

Choosing η t
As described in Section 4, we assume access to probabilities η t that indicate the probability of death occurring for a fatal case t days post-onset of COVID-19. Since our model indexes time by day, we need to set η t for integers t ≥ 0. Our choice of distribution is the best-fit discretized gamma distribution to the fatality time horizons from Chinese data (shape parameter k = 4.726 and scale θ = 3.174) (Linton et al., 2020). These parameters were roughly consistent across several other studies (Mizumoto and Chowell, 2020;Wang et al., 2020b). We discretized the probability density function η t to the days t = 0, . . . , T . Formally, after selecting a mean parameter t avg > 0, we determine the probabilities η t by 5 η t ∝ t k−1 e −t/θ , t = 0, . . . , T.
Stated differently, for a given mean parameter t avg , we define a probability measure η ∈ R T + , on t ∈ {0, . . . , T }, according to Equation 10. See Figure 3 for an illustration of this distribution. Our method assumes knowledge of the probability of death for a fatal case t days post-onset of COVID-19. This data was estimated by fitting a gamma distribution to the fatality time horizons from Chinese data in early February, 2020 (Linton et al., 2020). We discretize the distribution by day and also truncate it to 25 days long, both for numerical stability and also because very few deaths occurred past this point in the real data. The mean time to death was 15.0 (95% CI 12.8 − 17.5).
The standard deviation was 6.9 (95% CI 5.2 − 9.1), which we also used in our sensitivity analysis above. The three separate gamma distributions plotted above have different choices of mean value for illustrative purposes, to show the qualitative effect of changing the parameter.
In our experiments, we truncate at T = 25, both for numerical stability and also because very few deaths occur after 25 days in the data used to fit the gamma distribution.

Discussion
We emphasize again that the procedure that we have presented for estimation of relative CFR seeks to address only a subset of the biases that impinge upon the ascertainment of this important population-level parameter. We explicitly account for the time-dependence of reporting rates that may differ between fatal and nonfatal cases. We have separate time-dependent reporting rates for cases that will eventually be fatal or nonfatal, addressing the fact that reporting is higher among severe cases. Deaths are known to vary with some combination of health care quality and age, which can be quantified with a relative CFR estimate. Our covariate-independent reporting rate assumption likely does not hold in practice. Indeed, the relative CFR of Spain with respect to Korea (two countries whose time-dependent reporting rates are probably different) yields a value of 30.27, likely speaking to the limitations of this method, although we do not have ground-truth. Although Reich et al. (2012) present extensive experimental evaluations and some theory indicating that the method outperforms E naive under given modeling assumptions, it is not generally possible to check how closely these assumptions hold, due to overparameterization of the unrestricted model. This issue may be mitigated by working with domain experts who understand each group's sampling and reporting patterns. Another issue is that our estimator uses parameters η t that are not estimated strictly from surveillance data but rather from individualized death times (Linton et al., 2020).  (Equation A.5) was used to calculate the values in the table. Notice N is a function of δ/p (the acceptable relative error) and q (the reporting rate). The empirical distribution functions of E naive with different parameters of p and N and a reporting rate q = 0.7 are plotted. Notice that as p decreases, detecting a case will require larger N .
We believe that the maximum-likelihood estimator that we have presented may provide a more valid correction of relative reporting rates between German women and men rather than between South Korean people and Italian people, given that reporting rates by sex may be closer to identical than reporting rates by country, although biases by sex still exist (Guerra-Silveira and Abad-Franch, 2013). Demographers have argued that releasing data stratified by sex, age, and other demographic groups would aid in understanding the spread and fatality rates of COVID-19 (Dowd et al., 2020). Although certain teams like Riffe (2020) are currently assembling this data, many agencies are reporting such strata infrequently or not at all, making data collection difficult. To our knowledge, there is no well-established data repository (like the Johns Hopkins repository) that contains time-series data of deaths and cases stratified by sex, age, and so on.
Many of the key biases that we reviewed in Section 2 remain unaddressed in current datacollection and data-analysis pipelines. Variations in the nature of the population within the sampling frame that gets tested, due to government-or geography-specific protocols, will cause any CFR estimate to be unreliable. In particular, details in the definitions of terms across countries and times can result in severe bias in time-series data; for example, China's explicit policy was that they would not report asymptomatic cases until April 1, 2020, when the policy changed (Jiang, 2020). Accounting for many of the biases we have discussed may be possible with great effort by many data analysts. However, it is equally important for the statistical community to channel much of its energy into a unison clarion call to governments: to obtain estimates to support consequential policy-making, we need more and better data.
Contact tracing is a particularly powerful way to obtain data that allow otherwise intractable biases to be controlled, since it expands the sampling frame to include a much larger portion of our target population, specifically mild cases. Contact tracing is the process of reaching out to all individuals ('contacts') who were recently exposed to a known COVID-19positive individual, removing them from circulation, and monitoring their health. The same is done for contacts of contacts, and so on, for an appropriate number of iterations. We suggest that all of these contacts should be tested for COVID-19 one incubation period after exposure, regardless of whether or not they are symptomatic. The number of data points gleaned from this strategy will be lower than the number of data points from surveillance data. However, the population sampled using this strategy would be closer to the target population, since it would include asymptomatic cases. Furthermore, there is no issue with time lag, since these cases can be tracked systematically. Specifically, assume the nonresponse rate to contact tracing is identical for asymptomatic and symptomatic cases. As we prove in Appendix A, this is the exact condition under which E naive is an asymptotically unbiased estimator. Moreover, the estimator has desirable finite-sample properties in such a setting. Letting p be the true CFR and q be the reporting rate among infected cases, we have that E naive lies within a range δ of p in N = log(δ/p) log(1−q) samples; see Equation A.5 below. As seen in Figure 4, N does not need to be too large to insure that the bias of E naive is small, although with small p, sampling any fatal cases requires N to be on the order of 1/p in this simplified model.
Contact tracing does not eliminate all biases.
The assumption that nonresponse rates do not vary by case severity will not hold unless responses are mandatory, possibly introducing significant error, especially as p becomes small. Assay sensitivity and specificity may still cause errors. Most importantly, care must be taken to make valid inferences about the desired target population based on individualized contact-tracing data that may come from a restricted sample. One major hurdle is estimation of p when it is small: as shown in Figure 4, if death is a very rare event, N would need to be large in order to ensure enough fatal cases are sampled. Finally, such data may be easier to collect and release in some countries and jurisdictions than others. For example, within the United States, medical privacy and consent laws may make it difficult to ever test a truly random sample of the population, or to release the fine-grained data necessary for corrected estimators. These challenges, outside the scope of our work, are well studied in the field of survey sampling and reweighting.

Disclosure Statement
The authors have no conflicts of interest to declare.

Appendix A Derivation of the Expectation of E naive
In this section we derive the expectation of E naive . Our derivation will employ a stripped-down notation, since here we deal with individual random variables for each COVID-19-infected person instead of time-series data. Index the infected population with the integers {1, · · · , N }. Let T i ∼ Ber(p) be a Bernoulli random variable representing whether or not person i ∈ {1, . . . , N } died, and let W i ∼ Ber(q) be a Bernoulli random variable representing whether or not person i was diagnosed with the virus. We want to estimate p, but we only have the reported number of deceased patients with COVID-19, V i = T i W i , i ∈ {1, . . . , N }. Defining r := Cov(T i , W i ), the joint distribution of (T i , W i ) can be expressed as a contingency table: Several of the results discussed in the main article follow as simple consequences of the calculation of the expectation of E naive : (1)E naive is unbiased for p in finite samples if and only if q = 1; (2) E naive is unbiased for p as N → ∞ if and only if r = 0; (3) if there is an ε > 0 error in the estimation of r, for example due to incorrect attribution of fatalities to COVID-19, then E naive has unbounded expectation q → 0 and unbounded relative error as p → 0; and (4) if r = 0, the smallest N such that |E [E naive ] − p| ≤ δ is N = log(δ/p) log(1−q) . The distribution of V i is Bernoulli with P [V i = 1] = r+pq. Also define γ 1 = P [W i = 1 | T i = 1] = (r + pq)/p. Applying the tower property of conditional expectation and using the exchangeability of the (T i , W i ) pairs, we have: Since W 1 is independent of W 2,...,N , we can express the sum in the denominator as a binomial random variable, B ∼ Bin(N − 1, q), since it is a sum of N − 1 i.i.d. Bernoulli random variables (W 2 , . . . , W N ) with parameter q. Note the fact that E 1 1+B = ((1 − (1 − q) N ))/N q. Then, evaluating the innermost expectation first: Finally, substituting for γ 1 , we obtain the final form: Recall that q is the probability of reporting given an infection, and p is the probability of death given an infection. Since 1 − q = 0 implies r = 0 because W becomes deterministic, E naive is unbiased for p if and only if q = 1. Moreover, if q = 1, taking N → ∞ shows E naive is asymptotically biased for p if and only if r = 0. Both of these conditions are violated for any real disease. Interestingly this empirical CFR is not constrained to be an underestimate, and can overestimate p if p ≤ r(1−(1−q) N ) q(1−q) N . Under the assumption r ≥ ε, the (asymptotic) overestimate can be unboundedly bad. This may arise if there is an ε error in estimating the covariance (which we assume to be nonnegative), because some people are diagnosed with COVID-19 but their death is not caused by COVID-19. In this context: lim In other words, as the rate of reporting, q, decreases or the covariance between death and reporting increases, the CFR estimate gets worse, ultimately becoming infinitely bad, as long as there is a spuriously positive relationship between death and diagnosis. Similarly, the ratio E naive /p can become infinitely bad as the product pq decreases. Note that if there is no spurious relationship, r → 0 as p → 0 or q → 0 since W and T become deterministic under those conditions. In the case of COVID-19, neither q nor p are near zero, but the limiting case helps to exhibit the qualitative performance of the estimator-it becomes more bias-prone with smaller p and q and with larger r. Finally, assume that W and T are independent, so r = 0. Then, |E [E naive ] − p| ≤ δ implies that p(1 − q) N ≤ δ, and by some simple algebra, N ≥ log(δ/p) log(1−q) . Constraining N to be the smallest N ∈ N such that |E [E naive ] − p| ≤ δ gives Appendix B Derivation of the Asymptotics of E obs We borrow notation and proof technique from Reich et al. (2012). We use the same notation as the main article. This proof applies to the group-independent reporting-rate model. In Reich et al., a similar proof is shown that applies in the case of the constant-proportion assumption. In addition to the notation in the main article, define: d t,g := E[D t,g ] = N * t,g p g ψ t and r t,g := E[R t,g ] = N * t,g (1 − p g )ϕ t . Also, introduce two nonrandom functions, F d : R + → [0, 1] and F r : R + → [0, 1], where F d (t) represents the fraction of confirmed, fatal cases who have died by time t. Similarly F r (t) represents the fraction of confirmed, nonfatal cases who have recovered by time t. During an active outbreak, we have F d < 1 and F r < 1. Finally, define T as the current time; all sums over time below have an upper limit of T unless otherwise specified. We seek an asymptotic limit for: By the weak law of large numbers, D t,g and R t,g converge to their expectations, so we have: and similarly, F r (T )R t,g N * t,g p → F r (T )r t,g N * t,g = F r (T )(1 − p g )ϕ t .

(B.3)
Now we focus on the denominator. We have to introduce a "smoothness" assumption: the number of infected people at each timestep, N * t 1 ,g , has a constant ratio with respect to the number of infected people at each other timestep, N * t 2 ,g . In particular, λ t 1 ,t 2 ,g corresponds roughly to the growth rate of the disease. Although this quantity would vary based on many factors in a real setting, we assume it to be a constant here. In particular, as N * t 1 ,g → ∞ and N * t 2 ,g → ∞, Therefore, we have by Slutsky's theorem that: F d (T )D t 1 ,g + F r (T )R t 1 ,g N * t 2 ,g p → λ t 1 ,t 2 ,g (F d (T )p g ψ t 1 + F r (T )(1 − p g )ϕ t 1 ).
(B.10) This is clearly a biased estimator of p g .