Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches

Bioinformatics

Data acquisition

Three data sources were used in our models: influenza-like illness rates from ILINet, Internet search frequencies from Google Trends, and electronic health records from athenahealth, as described below. Weekly information from each data source was collected for the time period of October 4, 2009 to May 14, 2017.

Influenza-like illness rates

Weekly influenza-like illness rates reported by outpatient clinics and health providers for each available state were used as the epidemiological target variable of this study. The weekly rate, denoted %ILI, is computed as the number of visits for influenza-like illness divided by the total number of visits. Data from October 4, 2009 to May 14, 2017 for 37 states were obtained from the CDC. For inclusion in the study, a state must have data from October 2009 to May 2016, with no influenza seasons (week 40 of one year to week 20 of the next) missing. Some states were missing data, usually due to not reporting in the off-season (between week 20 and week 40 of each year). Missing or unreported weeks, as well as weeks where 0 cases were reported, were excluded from analysis on a state-by-state basis. While in real-time ILI values for a given week may be revised in subsequent weeks, we only had access to the revised version of these historical values.

Internet search frequencies

Search volumes for specific queries in each state were downloaded through Google Trends, which returns values in the form of frequencies scaled by an unknown constant. While our pipeline used the Google Trends API for efficiency, search volumes can be publicly obtained from www.trends.google.com for reproducibility. Relevant search terms were identified by downloading a complete set of 287 influenza-related search queries for each state, and keeping the terms that were not completely sparse. Because Google Trends left-censors data below an unknown threshold, replacing values with 0, a query with high sparsity indicates low frequency of searches for that query within the state.

In an ideal situation, relevant search queries at the state-level resolution would be obtained by passing the historical %ILI time series for each state into Google Correlate, which returns the most highly correlated search frequencies to an input time series. However, such functionality is only supported at the national level, at least in the publicly accessible tool. Given this limitation, we used two strategies to select search terms:

An initial set of 128 search terms was taken from previous studies tracking influenza at the US national level6.

To search for additional terms, we submitted multiple state %ILI time series into the Google Correlate and extracted influenza-related terms, under the assumption that some of the state-level terms would show up at the national level.

To minimize overfitting on recent information, the %ILI time series inputted into Google Correlate were restricted from 2009–2013. State-level search frequencies for the union of these terms and the 128 previous terms were then downloaded from the Google Trends API, resulting in 282 terms in total (Supplementary Table 3).

Electronic health records

Athenahealth is a cloud-based provider of electronic health records, medical billing, and patient engagement services. Its electronic health records system is currently used by over 100,000 providers across all 50 states. Influenza rates for patients visiting primary care providers over a variety of settings, both inpatient and outpatient, are provided weekly from athenahealth on each Monday. Three types of syndromic reports were used as variables: ‘influenza visit counts’, ‘ILI visit counts’, and ‘unspecified viral or ILI visit counts’, which were converted into percentages by dividing by the total patient visit counts for each week. The athenahealth network and influenza rate variables are detailed in Santillana et al.24.

Google flu trends

In addition to the above data sources, we downloaded GFT estimates as a benchmark for our models. Google Flu Trends provided a public influenza prediction system for each state until its discontinuation in August 201529. GFT values were downloaded and scaled using the same initial training period of 104 weeks used in all of our models (October 4, 2009 to September 23, 2012).

ARGO model

The time series prediction framework ARGO (AutoRegression with General Online information) issues influenza predictions by fitting a multivariable linear regression each week on the most recent available Internet predictors and the previous 52 %ILI values. Because of many potentially redundant variables, L1 regularization (Lasso) was applied to produce a parsimonious model by setting the coefficients for weak predictors to 0. The model was re-trained each week on a shifting 104-week training window in order to adapt to the most recent 2 years of data, and the regularization hyperparameter was selected using 10-fold cross-validation on each training set. Details about the ARGO model and its applicability in monitoring infectious diseases such as influenza, dengue, and zika are presented in previous work5,25,26. Refer to the Supplement for a detailed mathematical formulation of ARGO.

To fine-tune predictive performance, adjustments to the procedure were introduced on a state-by-state basis:

Filtering features by correlation: For each week, non-autoregressive features ranked outside the top 10 by correlation were removed to reduce noise from poor predictors. Based on previous research, this complementary feature selection process benefits the performance of Lasso, which can be unstable in variable selection23,25.

Regularization hyperparameters: Features with high correlation to the target variable over the training set received a lower regularization weight, which makes them less subject to the L1 penalty (see the Supplement for derivation).

Weighting recent observations: Although ARGO dynamically trains on the last 104 weeks of observations, more recent observations likely contain more relevant information. Thus the most recent 4 weeks of data received a higher weight (set to be twice the weight of the other variables) in the training set.

Network-based approach

Historical CDC %ILI observations show synchronous relationships between states, as shown in Fig. 2a (generated using corrplot30 with the complete-linkage clustering method). To identify these relationships with the goal of improving our %ILI predictions, for each state, we dynamically constructed a regularized multi-linear model for each week that has the following predictors: %ILI terms for the previous 52 weeks for the target state, and the synchronous (same week’s values) and the past three weeks of observed CDC’s %ILI terms from each of the 36 other states. Notice that to produce predictions of %ILI for a given state in a given week, the model requires synchronous %ILI from the other states, which would not be available in real-time. Instead, we use ARGO predictions for the current week as surrogates for these unobserved values. Like ARGO, this model is trained with a rolling 104-week window with 10-fold cross-validation to determine the L1 regularizer (formulation in Supplement). This model is denoted Net.

Ensemble approach

In order to optimally combine the predictive power of ARGO and Net, we trained an ensemble approach based on a winner-takes-all voting system, which we named ARGONet. ARGONet’s prediction for a given week is assigned to be Net’s prediction if Net produced lower root mean square error (defined in Comparative analyses) relative to the observed CDC %ILI over the previous K predictions than ARGO. Otherwise, ARGONet’s prediction is assigned to be ARGO’s prediction. To determine the hyperparameter K for each state, we trained ARGONet using the first 104 out-of-sample predictions of ARGO. We constrained K to take value in {1, 2, 3}. For each state, we utilized and fixed the identified best value of K to produce out-of-sample predictions for the time period of September 28, 2014 onward. The value of K chosen for each state is shown in Supplementary Table 2.

Comparative analyses

To assess the predictive performance of the models, we produced state-level retrospective estimates using two benchmarks: (a) AR52, an autoregressive model, which uses the %ILI from the previous 52 weeks in a Lasso regression to predict %ILI of the current week, and (b) GFT, made by scaling each state’s Google Flu Trends time series to its official revised %ILI from October 4, 2009 to September 23, 2012.

The performances of all models and benchmarks compared to the official (revised) %ILI were scored using three metrics: root mean squared error (RMSE), Pearson correlation coefficient, and mean absolute percent error (MAPE). These were computed over the entire study period (September 30, 2012 to May 14, 2017) and over each influenza season (defined as week 40 of one year to week 20 of the next) within the study period.

The models and benchmarks were further scored over two specific sub-periods: (1) the window when GFT was available (September 30, 2012 to August 9, 2015), and (2) the window starting with the first available ARGONet prediction (September 28, 2014 to May 14, 2017).

ARGO model estimates were generated using scikit-learn in Python 2.731, while Net and ensemble models were generated in R 3.4.1. Data analysis was conducted in Python except for Fig. 2a, b, which used the R packages corrplot30 and ggplot32. The United States maps in Fig. 2b and Supplementary Figure 2 were made in ggplot using the shapefile from the maps package33.

Disclaimer

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Code availability

The code supporting the results of this study is available from: https://github.com/fl16180/argonet.

Articles You May Like

How To Answer Ridiculous Arguments Against Science
Astronomers just found evidence of a curious weather phenomenon on Titan
A viral expression factor behaves as a prion
Every bottle of prosecco may erode 4.4 kilograms of Italian hillside
Methods for annotation and validation of circular RNAs from RNA-Seq data

Leave a Reply

Your email address will not be published. Required fields are marked *