Time series models are developed for predicting future values of a variable that when cumulated is subject to an unknown saturation level. Such models are relevant for many disciplines, but here attention is focused on the spread of epidemics and the applications are for coronavirus. The time series models are relatively simple but are such that their specification can be assessed by standard statistical test procedures. In the generalized logistic class of models, the logarithm of the growth rate of the cumulative series depends on a time trend. Allowing this trend to be time-varying introduces further flexibility and enables the effects of changes in policy to be tracked and evaluated.
Keywords: generalized logistic, Gompertz curve, Kalman filter, negative binomial distribution, score-driven models, stochastic trend
As an epidemic takes hold, reliable projections of its path and of its eventual size are important to health care providers and policy makers, in order to help them meet the challenge of organizing medical resources to meet treatment requirements. The many unknowns surrounding the COVID-19 virus make it particularly important and urgent to have reliable estimates of the future course of the daily numbers of new cases and deaths.
Epidemiological models that seek to estimate epidemic trajectories fall into two broad classes. One class, called compartmental models, consists of models that seek to be faithful to the details of the routes and processes of disease transmission. More specifically, they project how individuals in an initially susceptible population come to be exposed to the virus, potentially infected, and, if infected, either recover from the disease or die of it. This class of models is used to understand the consequences of policies, for example, a lockdown. But many aspects of the transmission and progress of the disease are unknown, and so these models, which are inevitably complex, are typically forced to rely on a large number of assumptions.
Time series models, on the other hand, simply use historical data to make predictions. This article develops a new class of time series models that reflect epidemic trajectories even when they change over time. Unlike most standard time series models, the new models are able to make good forecasts even before new cases and/or deaths reach their peak. Furthermore they are able to track the epidemic as human behavior changes, and can be used to evaluate the effects of changes in policy. Applying these models to the study of COVID-19 dynamics in the United Kingdom and Germany, we illustrate their relative simplicity and transparency, as well as their ability to produce good forecasts. Although the focus of the article is on the spread of epidemics, and specifically COVID-19, the models are applicable in a wide range of disciplines.
The progress of an epidemic typically starts off with the number of cases following an exponential growth path. Over time the growth rate falls and the total number of cases approaches a final level—the ‘leveling of the curve.’ Complex behavioral modeling of the progress of the disease, for example, by using the ‘semi-mechanistic Bayesian hierarchical model’ implemented by the team at Imperial College London, depends on many assumptions and unknowns; see Flaxman et al. (2020). Simple and transparent time series models may offer an alternative way of making predictions of the trajectory of the epidemic; see, for example, section 2 in Chowell et al. (2016), where such approaches are called “phenomenological.” Similar issues arise in economics where there is a contrast between calibrated and Bayesian models based on economic theory and data-based time series models. Avery et al. (2020) give an excellent review of these issues.
The progress toward an upper bound or saturation level can be taken on board with a sigmoid curve, such as the logistic
$(1.1) \ \ \ \ \ \mu (t)=\overline{\mu }/(1+\gamma _{0}e^{-\gamma t}),\text{ \ \ \ \ \ \ \ }% \gamma _{0},\gamma ,\overline{\mu }>0,\text{\ \ \ }-\infty <t<\infty ,$
where $\overline{\mu}$ is the final level, $\gamma$ is a rate of progress parameter, $\gamma _{0}$ takes account of the initial conditions, and $e$ is Euler’s number. Logistic curves can be shown to arise from a model of a simple epidemic; see, for example, chapter 2 in Daley & Gani (1999). More generally, sigmoid curves are used in many disciplines for a variety of applications, such as estimating the demand for new products and population growth of mammals subject to space and resource limitations. An early and influential discussion of growth curves was given by Gregg et al. (1964). More recently Panik (2014) describes growth curves and statistical methods for fitting them. Here we concentrate on an extension of the logistic curve called the generalized logistic (GL) or Richards curve; see pp. 78-80 in Panik (2014). The Gompertz curve emerges as an important special case. Figure 1 shows two Gompertz growth curves together with the corresponding changes as given by the first derivatives. The effect of what turns out to be a key parameter, $\gamma ,$ on the peak is evident.
By formulating a statistical model, parameters such as $\gamma_{0}, \overline{\mu }$ and $\gamma$ can be estimated from observations, denoted $Y_{t},$ made at discrete times $t=1,...,T,$ on $\mu (t)$. One option is to work directly with the level, by basing a time series model on a deterministic trend, as in (1.1); for example, Meade & Islam (1995) report fitting a variety of growth curves to telecommunications data, while Ciufolini & Paolozzi (2020) report fitting a deterministic trend to total coronavirus cases in Italy. The methods discussed in Panik (2014) are restricted to deterministic curves, possibly with a first-order autoregressive disturbance term. Our view is that, as with most economic and social time series, a deterministic trend fitted to the level is too inflexible. Instead we prefer to work with the change or the growth rate, with specification of the model informed by an assumption that the total follows a growth curve. The saturation level can be continually updated as new observations become available. A second advantage of working with the difference or the growth rate is that when logarithms are taken, estimation of the basic models derived from the GL class can be carried out by least squares regression, as proposed in Harvey (1984). Thus, the method is viable even for a small number of observations. Finally the deterministic time trend in the estimating equations can be replaced by a stochastic one. This is particularly effective with a Gompertz function. The flexible trend, which can be estimated by the Kalman filter, as in the STAMP package of Koopman et al. (2020), allows parameters to evolve over time. As a result, the model can adapt to significant events and changes in policy. Figure 2 shows predictions made for new cases of coronavirus in Germany. The predictions, which include a day of the week effect, were made using data up to the end of March 2020 and show a downward trend even though the series was only just reaching what subsequently turned out to be its peak. Moving beyond the peak signals that $R_{t}$, the effective reproduction number at time $t$, has dipped below one; see Flaxman et al. (2020) and Aronson et al. (2020).
When numbers are small, as is the case with deaths at the beginning or end of an epidemic, there is a strong argument for adopting a negative binomial distribution. Models formulated under an assumption of Gaussianity need to be modified accordingly and we show how this may be done using the score-driven approach described in Harvey (2013) and implemented in the TSL computer package of Lit et al. (2020).
Section 2 describes the GL class of growth curves, while Section 3 discusses estimation. These methods are then applied in Section 4 to data on coronavirus in U.K. hospitals and in Germany. Forecasts are set out and evaluated. The effects of policy, primarily the lockdown imposed on March 21 in the United Kingdom, are assessed in Section 5. There is concern about a potential second wave of infections as restrictions start to be eased and ways in which this might be monitored are discussed in Section 6. Section 7 concludes.
Let $\mu (t)\geq 0$ be a monotonically increasing function defined over the real line. The rate of change or ‘incidence curve’ is $d\mu (t)/dt\geq 0.$ The generalized logistic is
$(2.1) \ \quad\quad\quad \ \ \ \ \mu (t)=\overline{\mu }/(1+(\gamma _{0}/\kappa )e^{-\gamma t})^{\kappa },% \text{ \ }\gamma _{0},\gamma ,\kappa >0,\text{\ }$
where $\gamma$ is a growth rate parameter. The parameter $\kappa$ must be positive for there to be an upper asymptote; allowing $\kappa$ to be negative gives the class of general modified exponential (GME) growth curves. The logistic is obtained by setting $\kappa =1,$ while letting $\kappa \rightarrow \infty$ yields the Gompertz curve. When $\gamma _{0}$ is determined by the value of the curve at $t=0,$ it is
$(2.2) \ \ \ \ \quad\quad\quad\quad \ \gamma _{0}=\kappa \left[ (\overline{\mu }/\mu (0))^{1/\kappa }-1\right] .$
Differentiation yields
$(2.3) \ \ \ \ \quad \ \quad \quad \ln d\mu (t)/dt=\rho \ln \mu (t)+\delta -\gamma t,\text{ \ \ \ }$
where $\delta =\ln(\gamma_{0} \gamma/ \overline{\mu}^{\;1/\kappa})$ and $\rho =(\kappa +1)/\kappa ,$ so $0<\kappa <\infty$ implies $$$1< \rho <\infty .$ Alternatively, because $d\mu (t)/dt=g(t)\mu (t),$
$(2.4) \ \ \ \ \ \quad \quad \quad \ln g(t)=(\rho -1)\ln \mu (t)+\delta -\gamma t ,$
where $g(t)$ is the growth rate of $\mu (t).$ Note that $\rho -1=1/\kappa .$
The generalized logistic differential equation implied by Equation 2.1 is
$(2.5) \ \ \ \quad \ \quad \quad \ \frac{d\mu (t)}{dt}=\gamma \kappa \left[ 1-\left( \frac{\mu (t)}{\overline{\mu }}\right) ^{1/\kappa }\right] \mu (t).$
The term in square brackets is less than one and tends to zero as $\mu(t)\rightarrow \overline{\mu }.$ The growth rate implied by Equation 2.5 is
$(2.6) \ \ \ \ \ \quad \quad \quad g(t)=\gamma \kappa \left[ 1-\left( \frac{\mu (t)}{\overline{\mu }}\right)^{1/\kappa }\right] .$
When $\kappa =1$, Equation 2.5 is a Riccati equation.
The point of inflection on the growth curve is the point at which the number of new cases, $d\mu (t)/dt,$ peaks. Differentiating the logarithm of $d\mu (t)/dt=g(t)\mu (t)$ yields the condition
$(2.7) \ \ \quad \quad \quad \quad \ \ \ g(t)=-g_{g}(t),\text{\ }$
where $g_{g}(t)$ is the growth rate of the growth rate. It follows from Equation 2.4 that the point of inflection in the GL occurs when $g(t)=\gamma /\rho .$ The corresponding value of $\mu (t)$ is
$(2.8) \ \ \ \ \ \quad \quad \quad \mu ^{\ast }=\overline{\mu }\rho ^{-\kappa }=\overline{\mu }(\kappa /(\kappa+1))^{\kappa }.$
The change declines more slowly than it ascends when $\kappa >1$; see, for example, Figure 1. When $\gamma _{0}$ is determined by $\mu (0),$ as in Equation 2.2, the peak is at
$(2.9) \ \ \ \quad \quad \quad \quad \ \ t^{\ast }=\ln \gamma _{0}/\gamma ,\text{ \ \ }\gamma _{0}>1,$
so it comes forward as $\gamma$ increases.
The Gompertz curve is
$(2.10) \ \ \ \ \ \mu (t)=\overline{\mu }\exp (-\gamma _{0}e^{-\gamma t}),\text{ \ }\gamma_{0},\gamma >0,\text{\ \ \ \ }-\infty <t<\infty .$
When the GL is parametrized as in Equation 2.1, letting $\kappa \rightarrow \infty$ gives Equation 2.10. Taking the logarithm of $\mu (t)$ gives
$(2.11) \ \ \ \ \quad\quad\quad \ln \mu (t)=\ln \overline{\mu }-\gamma _{0}e^{-\gamma t}\text{ \ } ,$
leading to Equation 2.4 with $\rho =1,$ that is,
$(2.12) \ \ \ \ \ \quad\quad\quad \ln g(t)=\delta -\gamma t,\text{\ }$
where $\delta =$ $\ln (\gamma _{0}\gamma) .$ The point of inflection is given in Equation 2.9, which corresponds to $\mu ^{\ast }=\overline{\mu }/e\thickapprox 0.368\overline{\mu }$. The maximum value of the change is $0.368\gamma \overline{\mu },$ while $g(t)=\gamma .$ Letting $\kappa \rightarrow \infty$ in Equation 2.8, so that $\rho \rightarrow 1,$ also gives $\mu ^{\ast }=\overline{\mu }/e$.
Writing a GL growth curve as $\mu (t)=\overline{\mu }F(t)$ allows $F(t)$ to be interpreted as the cumulative distribution function (CDF) of the log of a Dagum distribution (see pp. 212-13 in Kleiber & Kotz, 2003). Hence, the corresponding probability density function (PDF), $f(t),$ is a special case of an Exponential Beta of the Second Kind (EGB2) distribution (see McDonald & Xu, 1995).
The Gompertz distribution written using Equation 2.10 has $f(t)=\gamma \gamma _{0}F(t)\exp (-\gamma t).$ Figure 1 shows two Gompertz PDFs with the red bold dotted curve corresponding to $\gamma _{0}=20$ and $\gamma =0.1$ and the blue dotted curve corresponding to $\gamma _{0}=20$ and $\gamma =0.15.$ The effect of increasing $\gamma$ is to raise the peak and bring it forward. Since $d\mu (t)/dt=\overline{\mu}f(t)$ the peak is brought down by a lower $\overline{\mu }.$
When estimating a deterministic growth curve, some researchers, for example, Ciufolini & Paolozzi (2020), prefer to use a sigmoid defined in terms of the Gaussian error function, which is the CDF of a Gaussian variate. Similarly, Murray (2020) use the Gaussian error function to model the logarithm of total coronavirus cases in U.S. states. It is difficult to see why the Gaussian error function, which has no closed form, might be preferred to the more flexible GL curve.
It follows from Equation 2.5 that
$(2.13) \ \ \ \quad\quad\quad\ \ d\mu (t)/dt=\overline{\mu }F(t)(1-F(t)^{1/\kappa }),$
where $\mu (t)=\overline{\mu }F(t)$. In a simple epidemic, $d\mu (t)/dt$ is proportional to a logistic growth curve, $F(t)(1-F(t))$. Allowing $\kappa$ to be other than one gives more flexibility and is a useful generalization if the model provides a good fit to the data. Indeed, complex mechanistic models of epidemics, with the population assigned to compartments labeled susceptible, infected, and recovered (SIR), often produce incidence curves that are positively skewed. An example is the model of Giordano et al. (2020), which is based on a system of eight differential equations.
In the observational model, the cumulative total at time $t-1$ replaces $\mu(t)$ and the (positive) change, $y_{t}=\Delta Y_{t}=Y_{t}-Y_{t-1},$ replaces $d\mu (t)/dt.$ Equation 2.3 leads to
$(3.1) \ \ \ \ \ \quad \ln y_{t}=\rho \ln Y_{t-1}+\delta -\gamma t+\varepsilon _{t},\text{ \ }\rho \geq 1,\text{ \ \ \ \ }t=2,...,T,$
where the disturbance terms $\varepsilon _{t}$ are assumed to be independently and identically distributed with mean zero and constant variance, $\sigma _{\varepsilon }^{2},$ that is, $\varepsilon _{t}\sim$ IID$(0,\sigma _{\varepsilon }^{2}).$ Subtracting $\ln Y_{t-1}$ from both sides gives the form corresponding to Equation 2.4, namely,
$(3.2) \ \ \ \quad\ \ \ln g_{t}=(\rho -1)\ln Y_{t-1}+\delta -\gamma t+\varepsilon _{t},$
where $g_{t}=y_{t}/Y_{t-1},$ although it may also be defined as $\Delta \ln Y_{t}.$ The parameters $\rho ,\delta$ and $\gamma$ can therefore be estimated by regression. If $\rho$ takes a specified value, $\ln y_{t}-\rho \ln Y_{t-1}$ is simply regressed on a constant and time trend. The estimators of $\delta$ and $\gamma$ are then efficient when the disturbance is Gaussian, that is, $\varepsilon _{t}\sim$ NID$(0,\sigma_{\varepsilon }^{2}).$
The observational model for the Gompertz curve is
$(3.3) \ \ \ \quad \quad\ \ \ln y_{t}=\ln Y_{t-1}+\delta -\gamma t+\varepsilon _{t},\text{ \ \ } t=2,...,T,$
or the simple time trend regression
$(3.4) \ \ \quad\quad \ \ \ \ln g_{t}=\delta -\gamma t+\varepsilon _{t},\text{ \ \ }t=2,...,T.$
A stochastic trend may be introduced into Equation 3.4 to give the dynamic Gompertz model
$\qquad\qquad \ln g_{t}=\delta _{t}+\varepsilon _{t},\text{\ \ }\varepsilon _{t}\sim \textrm{NID}(0,\sigma _{\varepsilon }^{2}),\text{ \ \ \ \ }t=2,...,T,$
where
$(3.5) \ \ \ \ \ \quad \begin{array}{lllll} \delta _{t} & = & \delta _{t-1}-\gamma _{t-1}+\eta _{t}, & \quad & \eta_{t}\sim \textrm{NID}(0,\sigma _{\eta }^{2}), \\ \gamma _{t} & = & \gamma _{t-1}+\zeta _{t}, & \quad & \zeta _{t}\sim \textrm{NID}(0,\sigma _{\zeta }^{2}),% \end{array}$
and the normally distributed irregular, level and slope disturbances, $\varepsilon _{t},$ $\eta _{t}$ and $\zeta _{t}$, respectively, are mutually independent. When $\sigma _{\eta }^{2}$ $=\sigma _{\zeta }^{2}=0$, the trend is deterministic, that is, $\delta _{t}=\delta -\gamma t$ with $\delta =\delta _{0}.$ On the one hand, when only $\sigma _{\zeta }^{2}$ is zero, the slope is fixed and the trend reduces to a random walk with drift. On the other hand, allowing $\sigma _{\zeta }^{2}$ to be positive, but setting $\sigma _{\eta}^{2}=0$ gives an integrated random walk (IRW) trend, which when estimated tends to be relatively smooth. The degree of smoothness depends on the signal-noise ratio, $q_{\zeta }=\sigma _{\zeta }^{2}/\sigma _{\varepsilon}^{2}$.
The STAMP package of Koopman et al. (2020) can estimate a stochastic trend using techniques based on state space models and the Kalman filter. The Kalman filter outputs estimates of the state vector $(\delta _{t},\gamma_{t})^{\prime }.$ Estimates of the state at time $t$ conditional on information up to and including time $t$ are denoted $(\delta _{t \shortmid t},\gamma _{t\shortmid t})^{\prime }$ and given by the contemporaneous filter; the predictive filter, which outputs $(\delta_{t+1 \shortmid t}\gamma _{t+1\shortmid t})^{\prime },$ estimates the state at time $t+1$ from the same information set$.$ It may sometimes be useful to review past movements by the smoother, which is the estimate of the state at time $t$ based on all $T$ observations in the series. Estimation of the unknown variance parameters is by maximum likelihood (ML). Tests for normality and residual serial correlation are based on the standardized innovations, that is, one-step ahead prediction errors, $v_{t}=y_{t}-\delta _{t\shortmid t-1},$ $t=3,...,T.$
A stochastic trend can be introduced into the more general GL model. However, unless $\rho$ is fixed, it may be hard to estimate in small samples.
The Kalman filter can be bypassed by adopting the reduced form, which comes from the innovations form of the Kalman filter; see pp. 3, 71 in Harvey (2013). For the GL curve, the steady-state innovations form is
$(3.6) \ \ \ \ \quad\quad\ \ln y_{t}=\rho \ln Y_{t-1}+\delta _{t\shortmid t-1}+v_{t},\text{ \ \ } t=3,...,T,$
or
$(3.7) \ \ \ \ \quad\quad \ \ln g_{t}=(\rho -1)\ln Y_{t-1}+\delta _{t\shortmid t-1}+v_{t},\text{ \ \ } t=3,...,T,$
where
$\qquad\qquad \begin{aligned} \delta _{t+1\shortmid t} &=\delta _{t\shortmid t-1}-\gamma _{t \shortmid t-1}+\alpha _{1}v_{t}, \\ \gamma _{t+1\shortmid t} &=\gamma _{t\shortmid t-1}+\alpha _{2}v_{t}.\end{aligned}$
The filter for an IRW has $\alpha _{2}=\alpha _{1}^{2}/(2-\alpha_{1})$ with $0\leq \alpha _{1}\leq 1;$ (see pp. 78-9 in Harvey, 2013). The implied value of $q_{\varsigma }$ is
$(3.8) \quad\quad \qquad\qquad q_{\varsigma }=\frac{\alpha _{1}^{4}}{(2-\alpha _{1})^{2}(1-\alpha _{1})}.$
The parameter $\alpha _{1}$ thus plays a role similar to that of the signal-noise ratio, $q_{\zeta },$ and can be estimated by ML. When it is zero, the model reverts to Equation 3.2.
Forecasts of future observations and an estimate of the final level can be obtained from the predictive recursions
$\begin{aligned} (3.9) \quad\quad\ \ \ \ \ \widehat{y}_{T+\ell \mid T} &=\text{\ }\widehat{\mu }_{T+\ell -1 \mid T}^{\rho }\exp (\delta _{T})\exp (-\gamma \ell ), \\ (3.10) \ \ \quad\quad \ \ \ \widehat{\mu }_{T+\ell \mid T} &=\widehat{\mu }_{T+\ell -1\mid T}+\widehat{y}_{T+\ell \mid T},\text{ \ \ \ \ }\ell =1,2,... \end{aligned}$
where $\delta _{T}$ is the level at time $T,$ that is, $\delta _{T}=\delta -\gamma T$ in Equation 3.1, and $\widehat{\mu }_{T\mid T}=Y_{T}.$ As $\ell \rightarrow \infty ,$ $\widehat{\mu }_{T+\ell \mid T}\rightarrow \overline{\mu }.$
Alternatively,
$\begin{aligned} (3.11) \ \ \ \ \ \quad\quad \widehat{g}_{T+\ell \mid T} &=\widehat{\mu }_{T+\ell -1\mid T}^{\rho -1}\exp (\delta _{T}-\gamma \ell ),\text{\ \ \ }\ell =1,2,..., \\ (3.12) \ \ \ \quad\quad \ \ \widehat{\mu }_{T+\ell \mid T} &=\widehat{\mu }_{T+\ell -1\mid T}(1+\widehat{g}_{T+\ell \mid T}), \text{ \ \ \ }\end{aligned}$
so that $\widehat{y}_{T+\ell \mid T}=\widehat{g}_{T+\ell \mid T}\widehat{\mu}_{T+\ell -1\mid T}$ and $\widehat{Y}_{T+\ell |T}=\widehat{\mu }_{T+\ell \mid T}.$
For the Gompertz model, where $\rho =1,$ the forecasts for the growth rate are simply
$(3.13) \ \ \ \ \ \quad\quad \widehat{g}_{T+\ell \mid T}=\exp (\delta _{T})\exp (-\gamma \ell ),\text{\ }$
so Equations 3.11 and 3.12 yield
$(3.14) \ \ \ \quad\quad \ \ \widehat{\mu }_{T+\ell \mid T}=\widehat{\mu }_{T\mid T}\prod\limits_{j=1}^{\ell }(1+\exp (\delta _{T})\exp (-\gamma j)).$
A future point of inflection is given at $\ell ^{\ast }=(\delta _{T}-\ln \gamma )/\gamma .$ When $\ell \rightarrow \infty ,$ $\widehat{\mu }_{T+\ell \mid T}\simeq Y_{T} \exp ( (\exp \delta _{T})/(\exp \gamma -1))).$ This follows from writing $\ln \big[ \prod_{j=1}^{\ell }\left(1+g_{T+j}\right) \big]$ as $\sum_{j=1}^{\ell }$ $\ln \left( 1+g_{T+j}\right) \simeq \sum_{j=1}^{\ell}g_{T+j}.$ Higher order terms can be neglected when $g_{T+j}$ is small, so $\sum_{j=1}^{\ell }g_{T+j}=\exp (\delta _{T})\sum_{j=1}^{\ell }\exp (-\gamma j)\rightarrow \exp (\delta _{T})/(\exp \gamma -1))$ as $\ell \rightarrow \infty$.
An assumption of normality for the disturbances in Equation 3.1 implies that, conditional on information at time $t-1,$ $y_{t}$ is log-normal. Thus there may be a case for adding $0.5\sigma ^{2}$ to $\delta _{T}$, where $\sigma^{2}$ is the prediction error variance. (When there is no stochastic trend, $\sigma ^{2}=\sigma _{\varepsilon }^{2}.)$ Predictive distributions of future observations may be obtained by simulation.
A stochastic model for the logistic curve can be based on Equation 2.6, as in Levenbach & Reuter (1976), by adding a serially independent Gaussian disturbance term so that
$(3.15) \ \ \ \ \quad\quad \ g_{t}=\gamma -(\gamma /\overline{\mu })\widehat{\mu }_{t\mid t-1}+\varepsilon _{t},\text{ \ \ }t=1,...,T,$
where $\widehat{\mu }_{t\mid t-1}$ is an estimate of $\mu _{t}$, such as $Y_{t-1}$, based on information at time $t-1$ and $\varepsilon _{t}$ is a serially independent Gaussian disturbance with mean zero and constant variance, $\sigma _{\varepsilon }^{2},$ that is, $\varepsilon _{t}\sim$ NID$(0,\sigma _{\varepsilon }^{2}).$ Regressing $g_{t}$ on $Y_{t-1}$ gives estimates of the key parameters $\gamma$ and $\overline{\mu }.$ Generalization of this approach is based on Equation 2.6 but leads to a nonlinear equation
$(3.16) \ \ \ \ \quad\quad \ g_{t}=\gamma \kappa -\left( \frac{Y_{t-1}}{\overline{\mu }}\right) ^{1/\kappa }+\varepsilon _{t},\text{ \ \ }t=1,...,T,$
which requires a search over the range of $\kappa$ if it is unknown. In the Gompertz case, estimation is as straightforward as for the logistic model because
$(3.17) \ \ \ \ \quad\quad \ g_{t}=\gamma \ln \overline{\mu }-\gamma \ln Y_{t-1}+\varepsilon _{t},\text{\ \ }t=1,...,T.$
Models based on the logarithm of the growth rate are preferred because they have better statistical properties. For example, the disturbance term is less likely to be heteroscedastic.
When $y_{t}$ is small, it may be better to specify its distribution, conditional on past values, as discrete. The usual choice is the negative binomial, which, when parameterized in terms of a time-varying mean, $\xi_{t\shortmid t-1},$ and a fixed positive shape parameter, $\upsilon ,$ has probability mass function (PMF)
$p(y_{t}|y_{t-1},y_{t-2},..)=\frac{\Gamma (\upsilon +y_{t})}{y_{t}!\Gamma (\upsilon )}\xi _{t\shortmid t-1}^{y_{t}}(\upsilon +\xi _{t\shortmid t-1})^{-y_{t}}(1+\xi _{t\shortmid t-1}/\upsilon )^{-\upsilon },\text{ \ \ \ }% y_{t}=0,1,2,...,$
with $\textrm{Var}_{t-1}(y_{t})=\ \xi _{t\shortmid t-1}+\xi _{t\shortmid t-1}^{2}/\upsilon$. An exponential link function ensures that $\xi_{t\shortmid t-1}$ remains positive and at the same time yields an equation similar to Equation 3.1:
$(3.18) \ \ \ \ \quad \ \ln \xi _{t\shortmid t-1}=\rho \ln Y_{t-1}+\delta -\gamma t,\text{ \ }\rho \geq 1,\text{ \ \ \ \ }t=3,...,T.$
A stochastic trend may be introduced into the model, as in subsection 3.2, by developing a filter for the time-varying trend similar in structure to that of Equation 3.6. Because the observations are not Gaussian, the dynamic conditional score (DCS) framework described in Harvey (2013) is used to give
$(3.19) \ \ \ \quad \ \ \ln \xi _{t\shortmid t-1}=\rho \ln Y_{t-1}+\delta _{t\shortmid t-1},\text{ \ \ }t=3,...,T,$
where $\delta _{t\shortmid t-1}$ is like the filter for the IRW in the Gaussian model, that is,
$\qquad\qquad\begin{aligned} \delta _{t+1\shortmid t} &=\delta _{t\shortmid t-1}-\gamma _{t \shortmid t-1}+\alpha _{1}u_{t},\text{ \ \ }& 0\leq \alpha _{1}\leq 1, \\ \gamma _{t+1\shortmid t} &=\gamma _{t\shortmid t-1}+\alpha _{2}u_{t},\text{\ \ \ }&\alpha _{2}=\alpha _{1}^{2}/(2-\alpha _{1}),\end{aligned}$
but with $u_{t}=y_{t}/\xi _{t\shortmid t-1}-1,$ which is the conditional score for $\ln \xi _{t\shortmid t-1},$ that is $\upsilon (y_{t}-\xi_{t\shortmid t-1})/(\upsilon +\xi _{t\shortmid t-1}),$ divided by the information quantity. The dynamic Gompertz model has $\rho =1.$ Estimation is by maximum likelihood.
Predictions of future observations and the final level can be obtained from Equations 3.9 and 3.10.
We began working on this project at the beginning of April 2020. At that time coronavirus was not as far advanced in the United Kingdom as in Italy, and our initial exploration was focussed on Italy. Figure 3 shows new cases, as measured by hospital admissions on the European Centre for Disease Prevention and Control (ECDC) website, in the United Kingdom and Italy; see the Appendix. Italy is seen to be clearly ahead and it is apparent that the decline is slower than the rise. This asymmetry is not consistent with a simple logistic model, Equation 1.1, and later analysis confirmed this to be the case for most other countries, including the United Kingdom.
Economic and social time series are typically subject to periodic variation, due to seasonal, day of the week, and other effects. Preliminary analysis of data for hospital admissions and deaths in Italy indicate a day of the week pattern and this is confirmed for the United Kingdom; one reason is that laboratory confirmation for the virus tends to slow down during weekends. Day of the week effects are initially included in the equations by the introduction of a single harmonic cycle with period seven; this possesses only two parameters as opposed to the six needed for a full set of dummy variables.
Although we estimated many models in early April, the first results reported here for the United Kingdom are those obtained on April 13, that is, with data up to April 12, starting on March 5. The models were updated on April 20 and 27. Unfortunately, the data available to us after April 29 were not consistent with the earlier data as they were not confined to just tests carried out on hospital admissions.
Table 1 shows April 13 estimates for the GL, Equation 3.2, the Gompertz, where $\rho$ is set to one, and the dynamic Gompertz where the trend is an IRW. The results for the logistic model are not reported because it gives an inferior fit and is rejected by a $t$ test on $\widehat{\rho };$ the standard error is in parentheses. The diagnostics presented are the Durbin-Watson (DW) statistic for residual serial correlation, a portmanteau $Q-$statistic for serial correlation based on $P$ autocorrelations, the Bowman-Shenton normality test statistic and a heteroscedasticity statistic consisting of the squares of the last third of the residuals over the first third.
There is a substantive reduction in the prediction error variance for all models when daily effects are included. The peak day is Friday, the same day as was found for Italy. The results for the GL are shown in the penultimate column and may be compared with those in the first column. The likelihood ratio statistic is which is significant at the 5% level of a $\chi _{2}^{2}$ distribution. Figure 4 shows the fit and residuals from the GL. Figure 5 shows the histogram and correlogram for the GL residuals but using data up to April 20.
Statistic | GL | Gompertz | Dynamic Gompertz | GL | Dynamic Gompertz* |
---|---|---|---|---|---|
$\gamma$ | 0.081 | 0.033 | 0.054 | 0.091 | 0.061 |
$\delta _{T}$ | -5.62 | -2.48 | -2.63 | -6.32 | -3.15 |
$\rho$ | 0.264 (0.153) | 1 | 1 | 0.323 (0.129) | 1 |
$q_{\varsigma }$ | 0 | 0 | 0.001 | 0 | 0.001 |
$\ln L$ | 38.350 | 37.765 | 39.368 | 42.590 | 55.595 |
$DW$ | 1.62 | 1.43 | 1.67 | 2.10 | 2.01 |
$Q(6)$ | 7.32 | 5.30 | 8.25 | Q(9)=9.38 | |
$Normality$ | 0.47 | 0.46 | 1.09 | 1.29 | 1.64 |
$Hetero$ $F$ | 0.37 | .72 | .45 | .52 | 0.45 |
^{Note}^{. Fixed parameters are in italics. The maximized log-likelihood is 1n }^{L}^{; note that the last column is based on a different set of data and so its likelihood value should not be compared to those in the other columns.}
The forecasts of new cases made on April 13 are shown in Figure 6, together with the actual values up to May 29. The upper line is the dynamic Gompertz, while the lower lines are the GL with and without the daily effect. The actual outcome turns out to lie between the dynamic Gompertz and the GL. As with the German predictions shown in Figure 2, the important point is that for both models the forecasts are moving downward even though the observations have barely peaked.
Predictions from the GL made on April 20 show little change from those made on April 13. The final level under current policies is estimated to be 186,000. By contrast the dynamic Gompertz predictions change significantly. They are still higher than the GL predictions, but the final level predicted (by the dynamic Gompertz model) is now 237,790, whereas on April 13 it was 308,960. The dynamic Gompertz predictions made on April 27 show little movement; the final predicted level is 253,800. Figure 7 shows the dynamic Gompertz predictions without the daily component, but the daily component was included when the models were estimated. The flexibility of the dynamic model resides in its ability to adapt to a situation in which the observations are falling less rapidly than indicated by the static GL.
The data for new cases in Germany are from the Robert Koch Institut (RKI) and constitute the confirmed cases of COVID-19 in all national hospitals and testing centers.1
Fitting with data for all of March give the results in Table 2 for a dynamic Gompertz model with daily effects. Estimation of a GL give $\widetilde{\rho }$ = 1.21 (0.33), so the $t$ statistic is only 0.64. The likelihood is smaller at 16.43. There is no evidence of residual serial correlation in either model and although the heteroscedasticity statistic indicates a diminishing variance, its value seems to be heavily influenced by just one observation (March 9) near the beginning of the series. The fit is shown in Figure 8.
Statistic | Dynamic Gompertz (April 1) | Dynamic Gompertz (May 7) |
---|---|---|
$\gamma$ | 0.077 | 0.062 |
$\delta _{T}$ | -2.41 | -5.15 |
$q_{\varsigma }$ | 0.0015 | 0.0015 |
$\ln L$ | 18.43 | 65.43 |
$DW$ | 2.02 | 2.01 |
$Q(9)$ + $Q(11)$ | 10.02 | 16.83 |
$Normality$ | 0.86 | 6.11 |
$Hetero$ $F$ | 0.42 | 0.22 |
The forecasts shown earlier in Figure 2 are quite remarkable in their accuracy over the next 36 days, that is, up to May 6; see also Figure 9. The observations had not yet started to go down on April 1, yet the sigmoid nature of the underlying growth curve means that the forecasts pick up the subsequent downward movement.
Fitting the dynamic Gompertz using data up to and including May 6 shows the model to be stable; see Table 2 and Figure 10. In particular, the signal-noise ratio is almost unchanged. The observations in April are less variable than in March so the residuals are much smaller. The slope appears to fall in early April but it then returns to a value similar to what it was at the end of March. In summary, the dynamic Gompertz model works extremely well.
Deaths in Germany behave in a way similar to new cases: Figure 11 shows the log growth rates. There are some zero values in the first three weeks of March, so the dynamic Gompertz is estimated using data from March 22. Because the numbers are smaller it is perhaps not surprising that the signal-noise ratio is estimated to be zero. In other words, we end up with the static Gompertz model.
The score-driven negative binomial model, introduced in subsection 3.4, can be estimated with the TSL package of Lit et al. (2020). Observations from March 11 up to, and including, May 6, are used. There are some zero values. Setting the $\rho$ parameter to unity gives $\widetilde{\alpha }=0$ —corresponding to a deterministic trend—and $\widetilde{\gamma }$ =0.071. The correlogram of the scores has a relatively high value at lag 7, indicating a daily effect. Modeling a fixed daily effect with dummy variables produces the fit in Figure 12, with the significantly increased log-likelihood. The parameter estimates are $\widetilde{\gamma }$ = 0.070, $\widetilde{\delta }_{T}$ = -4.14, and $\widetilde{\upsilon }$ = 13.25. It is reassuring that the values of $\widetilde{\gamma }$ and $\widetilde{\delta }_{T}$ are consistent with those reported for new cases. The total number of deaths on May 6 was 6,996 and the final total is predicted to be 8,714. When we revised our article on June 19 the total was 8,946 and daily deaths had fallen to very low numbers.
Up to April 27, the U.K. data set on deaths from coronavirus was restricted to deaths in hospitals. After April 28, deaths in the community were included and the earlier figure revised. Although the relationship between deaths and new cases before April 28 is similar to that in Germany, there is a disconnect after that date and the data on new cases are of very little help in predicting deaths.
Fitting a negative binomial model, with daily effects, to the series on U.K. deaths from March 17 to May 14 lead to an estimated $\alpha$ of 0.34 (0.05). The changing trend is consistent with the Gaussian models fitted to new cases; the implied value of $q_{\varsigma }$ is 0.007. The slope at the end of the series is 0.042, while $\widetilde{\upsilon }$ =29.91 (6.13). Figure 13 shows the forecasts of the underlying trend up to one month ahead. New observations after the forecasts were made are marked by dots.
The effect of significant interventions, which may be a policy change, such as the introduction of a lockdown, or an external event, such as the arrival of a cruise ship in a small port, may be modeled by intervention (dummy) variables. The difficulty is that the pattern of the response is rarely known and so it becomes difficult to obtain a meaningful estimate of the final effect. Nevertheless, some notion of the response can be obtained by making forecasts at the time the policy is thought to have become effective and comparing these forecasts with the actual outcome. In order to investigate this possibility, the dynamic Gompertz model was estimated on U.K. data up to and including March 31 (10 days after lockdown) using a fixed $q_{\zeta }$ of 0.001 (as estimated later with a larger sample and reported in Table 1). No day-of-the week effect was included, because the series is rather short. Figure 14 shows there is considerable overprediction. However, if the signal-noise ratio is increased to $q_{\zeta }$ = 0.01, so the most recent observations receive more weight, the predictions are excellent. The final level is now approximately 290,000 as against 1.8 million. New cases are at their maximum, with a value of 5,400, on April 18, whereas with $q_{\varsigma }$ = 0.001, they do not peak until May 20, when they reach 21,500. Although the variation in predictions is huge, the same is true of many of the large models where the output can be very sensitive to the assumptions made.
Information about the pattern of the response can also be obtained by fitting a dynamic Gompertz model and graphing estimates of the slope. Figure 15 shows filtered and smoothed estimates of slope in such a model based on data up to April 29. The filtered estimates are most informative as they show the evolving changes in the slope after the lockdown of March 21. As can be seen, the big falls occur at the beginning of April, with little or no change after mid-April. Figure 8 shows a very similar movement in Germany, but taking place a few weeks earlier.
Growth curves can be used to parameterize a gradual response to an intervention. A permanent change is captured with the CDF and a temporary one with the PDF. A logistic CDF gives a response curve $W(t)=1/(1+\gamma_{0}^{I}\exp (-\gamma ^{I}(t-t^{I})),$ where $t^{I}$ is the median. The $I$ superscript distinguishes the parameters $\gamma _{0}^{I}$ and $\gamma ^{I}$ from the ones used to model the time series itself. With $t^{L}$ and $t^{U}$ denoting the beginning and the end of the time span during which gradual response to the intervention occurs, the intervention dummies are defined by $w_{t}=0$ for $t<t^{L},$ $w_{t}=W(t)$ for $t=$ $t^{L},$ $t^{L}+1,...,t^{I},...,t^{U}$ and $w_{t}=1$ for $t=t^{U}+1,...,T,$ or even just by $w_{t}=W(t)$ for $t=1,...,T.$ With the Gompertz CDF the response is $W(t)=\exp (-\gamma _{0}^{I}\exp (-\gamma ^{I}(t-t^{I}));$ in this case the point of inflection, corresponding to the maximum change, comes before the median of the time span between $t^{L}$ and $t^{U}$. Figure 16 shows the CDFs for logistic and Gompertz distributions with the median set to 7; for logistic $\gamma ^{I}=0.5$ and for Gompertz $\gamma ^{I}=0.2$.
If the effect of a policy is to change $\gamma$ in the model, a slope intervention is needed in Equation 3.2. Thus
$(5.1) \ \ \ \ \quad \ \ln g_{t}=(\rho -1)\ln Y_{t-1}+\delta -\gamma t-\beta tw_{_{t}}+\varepsilon_{t},\text{ \ \ }t=1,...,T,$
but unless the sample is moderately large, $\rho$ will need to be fixed. When the full effect is realized, the slope on the time trend will have moved from $\gamma$ to $\gamma +\beta .$ A positive $\beta$ will lower the growth rate, $g_{t},$ the peak of the incidence curve, and the final level.
Fitting the static model in Equation 5.1 to new cases in the United Kingdom with $\rho =1$ and a logistic intervention, starting on March 26 and ending on April 12, gives an estimate of $\beta$ equal to 0.020 (0.004) and an estimate of $\gamma$ also equal to 0.020. The picture in Figure 17 is not unreasonable, but the estimate of the overall effect is 0.041, which may be a slight underestimate because minus one times the final slope in Figure 15 is close to 0.05. When the slope is allowed to be stochastic, $\beta _{T}+\gamma$ is estimated at 0.054. However, a stochastic slope risks some confounding with the intervention variable and in fact the estimate of $\beta$ is reduced to 0.014 (0.006). Although neither model is completely satisfactory, both give a significant coefficient for the intervention variable, albeit after a degree of data mining.
With the relaxation of a lockdown, the unwelcome prospect of a second wave of infections arises. Dynamic GL models can monitor this possibility by tracking changes in the growth rate of the incidence curve. This growth rate, $g_{y}(t),$ depends on a potentially time-varying $\gamma$ parameter, $\gamma (t)$, and the growth rate, $g(t).$ Differentiating the logarithm of $d\mu (t)/dt$ gives
$(6.1) \ \ \ \quad\quad\quad\ \ g_{y}(t)=g(t)+g_{g}(t) ,$
and for GL curves, it follows from Equation 2.4 that $g_{g}(t)=(\rho -1)g(t)-\gamma (t).$ In the discrete time dynamic Gompertz model, Equation 3.5, the negative of the growth rate of the growth rate, $g_{g}(t),$ is tracked by the filtered estimates of the slope, that is, $\gamma _{t\mid t},$ while the growth rate itself is tracked by the exponent of the filtered level, that is, $g_{t\mid t}=\exp (\delta _{t\mid t}).$ Figure 18 shows $\gamma _{t\mid t}$ and $g_{t\mid t}$ for Germany from a dynamic Gompertz model, together with the daily number of new cases. The maximum is identified as April 3 and after this date $g_{t\mid t}<\gamma _{t\mid t};$ see Equation 2.7. The possible onset of a second wave is raised if at some point $\gamma _{t\mid t}$ starts to fall below $g_{t\mid t}$, or below $\rho g_{t\mid t}$ in the general case. This would signal that the reproduction number, $R_{t},$ has moved up above one.
The filtered estimates, $g_{t\mid t}$ and $\gamma _{t\mid t}$, are obtained by discounting of past observations, with the rate of discounting depending on the signal-noise ratio, $q_{\varsigma }.$ When a new policy is implemented, $q_{\varsigma }$ may need to increase so that past observations are discounted at a faster rate, as illustrated in Figure 14. In these circumstances, the only viable tracking option may be to try a range of $q_{\varsigma }$ values, bearing in mind the risk of triggering a false alarm. However, if the effects of a policy are spread over a period of time, as with a gradual relaxation of a lockdown, a value of $q_{\varsigma }$ estimated with the complete sample may be perfectly satisfactory.
If $\gamma _{t\mid t}$ decreases, some idea of sampling variability is needed to assess the implications. This may be obtained from the standard deviation of $\gamma _{t\mid t},$ as given by the Kalman filter. The variation in $g_{t\mid t}$ is harder to determine, but since it is typically much smoother than $\gamma _{t\mid t}$, it is likely to be small in comparison to the variation in $\gamma_{t\mid t}.$
A new class of time series models is developed for predicting future values of a variable that when cumulated is subject to an unknown saturation level. Such situations arise in a wide range of disciplines. The models provide a simple and viable alternative, or complement, to the forecasts produced by large-scale mechanistic models.
Generalized logistic growth curves provide the basis for our models. Estimating equations for the logarithm of the growth rate of the cumulative variable provide a good fit to the data, as assessed by standard statistical tests. Such models feature a time trend that can be made time-varying using the Kalman filter. The Gompertz model is a special case that works particularly well with a stochastic trend and, when this modification is made, the fit is often better than with an unrestricted generalized logistic model with a deterministic trend. The dynamic Gompertz can adapt to changing conditions and tracking the slope, especially the filtered slope, can be informative about the effect of interventions.
Additional components can be included in the models. These include seasonal or day-of-the week effects. The latter turned out to be relevant for new cases and deaths in our application to coronavirus in the United Kingdom and Germany. The dynamic Gompertz model worked well for both countries but was particularly impressive for German new cases. Deaths were successfully modeled by a Gompertz model with a negative binomial conditional distribution and a dynamic equation driven by the conditional score.
Estimating a model up to the point at which the effect of an intervention is likely to make itself felt and then making (unconditional) forecasts provides information about the effects of the intervention. The effectiveness of this approach is illustrated by fitting a dynamic Gompertz model to U.K. data and then investigating the effect of the coronavirus lockdown. Further insight into the effect of the lockdown is given by tracking the filtered estimate of the growth rate of the growth rate. The impact of the lockdown can be estimated ex post by including a growth curve as an explanatory variable, thereby allowing the intervention to have a gradual response.
Finally, we suggest that the possibility of a second wave can be monitored by tracking the filtered estimates of new cases or deaths given by our model. Of course, if these methods are to be useful in practice they require reliable up-to-date observations on new cases, preferably at a disaggregated level. Implementing viable tracking procedures and relating them to current methods for tracking the reproduction number, $R,$ is the next phase in our research program.
We would like to thank Yucheng Bian, Anthony Kattuman, Dario Palumbo, Tom Pape, Mark Salmon, Stefan Scholtes, Jerome Simons, Robert Taylor, Steven Thiele, and Qingyuan Zhao for helpful comments and suggestions. We are also grateful to the Editor, Associate Editor, and three referees for their constructive comments. Jonas Knecht supplied valuable research assistance.
Andrew Harvey and Paul Kattuman have no financial or non-financial disclosures to share for this article.
Angelopoulos, A. N., Pathak, R., Varma, R., & Jordan, M. I. (2020). On identifying and mitigating bias in the estimation of the COVID-19 case fatality rate. Harvard Data Science Review, (Special Issue 1). https://doi.org/10.1162/99608f92.f01ee285
Aronson, J. K., Brasset, J., & Mahtani, K. R. (2020). When will it be over? An introduction to viral reproduction numbers. The Centre for Evidence-Based Medicine. https://www.cebm.net/oxford-covid-19-evidence-service/
Avery, C., Bossert, W., Clark, A., Ellison, G., & Ellison, S. F. (2020). Policy implications of models of the spread of coronavirus: Perspectives and opportunities for economists. Working Paper 27007. National Bureau of Economic Research. https://doi.org/10.3386/w27007
Chowell, G., Sattenspiel, L., Bansal, S., & Viboud, C. (2016). Mathematical models to characterize early epidemic growth. Physics of Life Reviews, 18, 66-97. https://doi.org/10.1016/j.plrev.2016.07.005
Ciufolini, I., & Paolozzi, A. (2020). Mathematical prediction of the time evolution of the COVID-19 pandemic in Italy by a Gauss error function and Monte Carlo simulations. The European Physical Journal Plus, 135, Article 355. https://doi.org/10.1140/epjp/s13360-020-00383-y
Daley, D., & Gani, J. (1999). Epidemic modelling: An introduction. Cambridge University Press. https://doi.org/10.1017/CBO9780511608834
Flaxman, S., Mishra, S., Gandy, A., Unwin, H. J. T., Coupland, H., Mellan, T. A., Zhu, H., Berah, T., Eaton, J. W., Guzman, P. N. P., Schmit, N., Callizo, L., Team, I. C. C.-19 R., Whittaker, C., Winskill, P., Xi, X., Ghani, A., Donnelly, C. A., Riley, S., … Bhatt, S. (2020). Estimating the number of infections and the impact of non-pharmaceutical interventions on COVID-19 in 11 European countries. Report 13, Imperial College London, MRC Centre for Global Infectious Disease Analysis. https://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/covid-19/report-13-europe-npi-impact/
Giordano, G., Blanchini, F., & Bruno, R. (2020). Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy. Nature Medicine, 26(6), 855–860. https://doi.org/10.1038/s41591-020-0883-7
Gregg, J. V., Hossell, C. H., & Richardson, J. T. (1964). Mathematical trend curves: An aid to forecasting. Imperial Chemical Industries Monograph no. 1. Oliver and Boyd.
Harvey, A. C. (1984). Time series forecasting based on the logistic curve. Journal of the Operational Research Society, 35(7), 641–646. https://doi.org/10.1057/jors.1984.128
Harvey, A. C. (2013). Dynamic models for volatility and heavy tails: With applications to financial and economic time series. Econometric Society Monograph. Cambridge University Press. https://doi.org/10.1017/CBO9781139540933
Kleiber, C., & Kotz, S. (2003). Statistical size distributions in economics and actuarial sciences. Wiley. https://doi.org/10.1002/0471457175
Koopman, S. J., Lit, R., & Harvey, A. C. (2020). STAMP 8.4: Structural time series analyser, modeller and predictor. Timberlake Consultants.
Levenbach, H., & Reuter, B. E. (1976). Forecasting trending time series with relative growth rate models. Technometrics, 18(3), 26–72. https://doi.org/10.1080/00401706.1976.10489446
Lit, R., Koopman, S. J., & Harvey, A. C. (2020). Time series lab - Score edition. https://timeserieslab.com
McDonald, J. B., & Xu, Y. J. (1995). A generalization of the beta distribution with applications. Journal of Econometrics, 66(1–2), 133–152. https://doi.org/10.1016/0304-4076(94)01612-4
Meade, N., & Islam, T. (1995). Forecasting with growth curves: An empirical comparison. International Journal of Forecasting, 11(2), 199–215. https://doi.org/10.1016/0169-2070(94)00556-R
Murray, C. J. L. (2020). Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months. MedRxiv. https://doi.org/10.1101/2020.03.27.20043752
Panik, M. J. (2014). Growth curve modeling: Theory and applications. Wiley.
Surveillance data is the primary source of information for study of the dynamics of the pandemic. The data we use in this article on time series of confirmed cases and deaths is sourced from the European Centre for Disease Prevention and Control (ECDC), which is a European Union agency with responsibility for EU-level surveillance of infectious diseases. Since the beginning of the pandemic, through an Early Warning and Response System, ECDC has obtained daily numbers of confirmed cases of COVID-19 and deaths attributed to the virus from EU/EEA Member States and the United Kingdom.
ECDC’s data collection involves systematic screening of daily reports from health authorities by a team of epidemiologists, and an extensive validation process. The screened and validated daily reports are documented in a database, and the daily extract consisting of the total number of diagnosed COVID-19 infections and the number of deceased patients for each country is published on the ECDC website.
It is important to understand the possible sources of biases in the COVID-19 surveillance data. Angelopoulos et al. (2020) provide a systematic and insightful review of the many sources of bias in the time series of deaths, cases, and recoveries relating to COVID-19. Their particular focus is on implications for the estimation of relative case fatality ratios between groups and countries. They show how some, though not all, biases can be addressed. Even when biases cannot be addressed, it is important to know their sources and implications.
In terms of the time series of cases, the definition of a ‘case’ can vary from country to country, and may also change over time. Confirmation of cases is affected by the testing regime—test availability, and the criteria under which an individual is eligible for testing—which is liable to change over time. In the early days of the pandemic, severe cases will have been diagnosed more often than milder cases. True numbers of infections are likely to be much higher than reported in countries where testing is limited. Biases also arise from imperfect or delayed reporting and errors in reporting.
Overall, deaths are more likely to be reported accurately by health care providers, but sources of bias remain. Time delays in reporting and the misattribution of deaths are not uncommon. In addition, national reporting guidelines during the early days are often focused on hospital deaths to the exclusion of deaths in community and care homes.
We focus on the United Kingdom and Germany due to our familiarity with the health data infrastructures and validation procedures in these countries. The data we use in this article can be accessed from:
https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide
©2020 Andrew Harvey and Paul Kattuman. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.