incidental-tutorial

incidental

This tutorial provides a walkthrough on how to use incidental to estimate incidence. We will use the Spanish flu mortality data provided with this package.

library(incidental)
library(ggplot2)

Data Format

The fit_incidence function requires two basic inputs:

The function assumes reported records are evenly spaced. Let’s examine how these values look for the flu data.

The data has daily mortality records for three locations (Indiana, Kansas, and Philadelphia). The delay distribution specifies the probability of a death occurring x days after infection.

Model Fitting

This package uses an empirical Bayes deconvolution method to fit an incidence curve from a series of recorded cases. The method uses a regularized Poisson likelihood function on a set of Poisson basis functions. We give basic use cases below.

Model Outputs

The fitted incidence models have three outputs:

Let’s plot these outputs for Kansas. The data set also contains data for Indiana and Philadelphia. The function, incidence_to_df extracts the date (Date), the reported cases (Reported), the MAP incidence curve (Ihat), 90% pointwise credible interval bands around the incidence curve (CI_low, CI_high), and the implied smooth case curve (Chat), from a model and the data set that it was trained on. Another function, plot, plots these values.

Advanced Usage

The tunable parameters have been set so that they are robust for the majority of COVID-19 case records in the USA. This makes fits for influenza mortality data somewhat overregularized. More advanced settings include changing basis function degrees of freedom and regularization parameter selection methods, and extrapolation parameters for right censoring. We note, however, that this method is sensitive to bad data in the last two to three weeks of the time series (zero counts, counts that are too high, etc). We recommend either cleaning that data or not using the incidence estimates from that period.

Linear Extrapolation

This package uses a Bayesian linear model as a mean function for an AR extrapolation to cope with right censoring. There are two parameters that control how that fit is conducted:

These values control right tail fits. Let’s try a few combinations on the Kansas data using the functions that we made above.

The parameters are stable across a wide variety of settings for most data, but more care is needed when there is steep growth or decline at the point of truncation.

Hyperparameter Tuning

The empirical Bayes model fits a regularized likelihood model on a spline basis. There are two main hyperparameters:

Each of these values are found by doing a sweep over a pre-specified set of possibilities. We can change those sets with the parameters:

Note that these can be set as scalars to specify a fixed degree of freedom or lambda. Models output the selected degrees of freedom and lambda. Let’s look at the values for Kansas and change the grids.

These parameter settings are fairly robust, but larger data sets may need higher degrees of freedom or smaller lambda values.

Hyperparameter Selection Methods

For each hyperparameter, fit_incidence offers three ways of selecting the “best” value from the candidate set:

The first two methods are model-based optimism approximations, which run very quickly. The third involves randomized training/validation splits of the data to find which hyperparameter values maximize (up to a certain threshold) held out log likelihood. The default selection method is bic for degrees of freedom, and val for lambda. The methods can be changed with the dof_method and lam_method flags. Let’s fit the Kansas data with AIC for both hyperparameters.

AIC and BIC are reasonable selection methods for degrees of freedom, but can lead to overfitting when used for lambda selection.

Validation Likelihood Hyperparameter Selection

Validation likelihood breaks the data into a training and a validation set. For each potential hyperparameter value, a model is fit on the training set and and a held out log likelihood is computed on the validation set found by thinning. This yields a vector of validation log likelihoods, \(v_{1:d}\), where \(d\) is the number of candidate parameters.

Blindly choosing the model with the highest held out log likelihood can lead to some overfitting on the validation set. Therefore, the validation method chooses the smallest degrees of freedom or largest lambda that produces a value within percent_thresh of the highest validation log likelihood. The hyperparameter index \(i \in \{1, \dots, d\}\) is selected by: \[\max \, i : \ \frac{\max(v) - v_i}{|\max(v)|} \leq \frac{\mathtt{percent\_thresh}}{100},\] where the indices are ordered from most regularization (lowest degrees of freedom or largest lambda) to least regularization.

The default is percent_thesh = 2; lower values will generally give less regularized models while higher values will give more regularized models.

The training/validation splits are random for this method, which can lead to instability if a single fit is used. Validation likelihood uses val_restarts fits, where the returned value is the average of the best lambdas or the median of the best degrees of freedom. Setting val_restarts to a lower value trades stability for speed.