Citation: Osthus D, Daughton AR, Priedhorsky R (2019) Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited. PLoS Comput Biol 15(2):
e1006599.
https://doi.org/10.1371/journal.pcbi.1006599
Editor: David A. Broniatowski,
The George Washington University, UNITED STATES
Received: May 21, 2018; Accepted: October 30, 2018; Published: February 1, 2019
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: The ILI data for this project can be accessed through the Delphi epidemiological data API, from the Delphi group at Carnegie Mellon University, available at https://github.com/cmu-delphi/delphi-epidata. We are unable to redistribute the Google Health Trends volume data used in this paper under Google’s terms of service. However, access to Google Health Trends can be requested. The procedure to request access to the Google Trends API is to fill out this form: https://docs.google.com/forms/d/e/1FAIpQLSenHdGiGl1YF-7rVDDmmulN8R-ra9MnGLLs7gIIaAX9VHPdPg/viewform. Person to contact is via this e-mail address: trends-api-support@google.com. The model target log scores, nowcasts, and non-zero features have been uploaded with this resubmission.
Funding: This work was supported in part by the U.S. Department of Energy through the Los Alamos National Laboratory’s Laboratory Directed Research and Development program (DO, ARD, RP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Influenza forecasting background
Seasonal influenza causes substantial morbidity and mortality, including an estimated ten to fifty thousand deaths annually [1]. The ability to accurately monitor and forecast influenza outbreaks is thus of substantial public health priority. Traditional flu surveillance in the United States consists of a large network of volunteer reporting physicians and associated mortality, laboratory, and hospital surveillance, organized by the Centers for Disease Control and Prevention (CDC) [2]. This network, while expansive, has known limitations, including frequent revisions to previously issued data and a 1–2 week reporting lag. Recent work has suggested that internet data, for example search query volume or symptom reports on social media, might help fill in this gap created by the reporting lag by providing real-time estimates of sick individuals. The estimate that fills the gap is called a nowcast, and computing this estimate is called nowcasting.
Much work has been done to provide evidence that forecasting models which incorporate internet data can outperform those without. A fairly exhaustive discussion of these models is provided in our previous work [3]; some particular examples are highlighted here. In general, studies have primarily incorporated data from social media, in particular Twitter [4, 5], Wikipedia [3, 6, 7], and Google search query volume [8–10]. Initial work focused on the observation that internet data (e.g., tweet volumes) and the number of influenza infections are correlated [5]. Subsequent studies found that models including internet data perform better than autoregressive baseline models (e.g., [4, 11]). Others added internet data to compartmental models (i.e., models governed by a system of differential equations that group individuals into compartments based on their health status). Using this approach, for instance, [9] added Google Flu Trends data to a compartmental model and found that this addition allowed them to predict the peak of two outbreaks several weeks in advance. Although the vast majority of studies have focused on incorporating an individual internet data stream, some have observed that ensemble methods using multiple data streams could outperform those where only one is added [10, 11].
The internet paradox
Though there is evidence that influenza forecasting models can be improved by incorporating internet data, it is unclear how useful these models are operationally for two reasons. First, there are a number of concerns that internet data may be unreliable, as there is no reference data to know which individuals are actually sick; untangling which traces correspond to relevant health statuses and which do not is a challenging problem. These potential data problems have now become notorious. Google Flu Trends, while initially an effective model, later substantially overestimated seasonal flu for several years in a row [12]; it no longer publishes flu estimates. Also, of the seven models that participated in the second year of the CDC’s flu forecasting challenge (four using internet data and three not), the two best performing models did not use internet data [13]. Second, previous research has typically compared relatively simple forecasting models, often autoregressive or compartmental models, to an identical forecasting model augmented with internet data (as previously mentioned). However, this approach fails to take into account the possible impact of internet data on more intricate models that arguably perform better than their more simplistic counterparts (though there has been recent work evaluating the value of nowcasting with more intricate models [14, 15]). For example, in our previous work [16], we developed the Dynamic Bayesian Model (DBM)—a flu forecasting model that does not use internet data. [16] compared DBM to all flu forecasting models that participated in the CDC’s 2015/16 and 2016/17 flu forecasting challenges, nationally, and found DBM outperformed all participating models, many of which used internet data for forecasting. These findings suggest that 1) DBM is a leading flu forecasting model and 2) internet data is not required for good flu forecasting performance.
Given all of this, it is unclear what we should conclude about internet data for flu forecasting. On one hand, there are numerous examples where internet data has improved flu forecasting in retrospective forecast settings. On the other hand, leading forecasting models make no use of internet data.
Problem statement
In this paper, we seek to provide clarity on the value of internet data when forecasting the flu in practice. We do so by conducting a focused experiment. The questions the experiment addresses are:
- Can a good flu forecasting model using no internet data be improved by adding internet data? Yes. Our experiment demonstrated that a good flu forecasting model can be improved in practice through the incorporation of internet-based nowcasts.
- If so, do the improvements hold across flu targets, season weeks, geographic regions, and flu seasons? Generally yes. The model augmented with internet-based nowcasts had equal or better performance than the model without such nowcasts across all considered flu targets, geographic regions, and flu seasons. Some short-term forecasts issued in December and early January exhibited worse forecasts than the no internet model, however.
- What are the limitations of nowcasts, internet-based or otherwise, in terms of improved forecasting performance? Perfect nowcasting can improve forecast accuracy for both seasonal and short-term targets. Improvements for seasonal targets, however, are much smaller than for short-term forecasts. The overall possible improvement to flu forecasting accuracy via nowcasting is limited by the uneven impact nowcasts have on forecasting various public health-relevant targets. If improvements above and beyond what can be achieved with perfect nowcasting are needed, other, non-nowcasting model improvements must be considered.
Materials and methods
Data
In the following sections, we describe the data sources used for the experiment: CDC’s influenza-like illness network, Wikipedia, and Google Health Trends.
Influenza-like illness.
We use outpatient influenza-like illness (ILI) data from the 2012/13 through 2016/17 seasons for our analysis, here denoted 2012 through 2016. ILI is the number of patients that visited an influenza-like illness network (ILINet) participating clinic with symptoms consistent with the flu (temperature of 100 degrees Fahrenheit or greater and a cough and/or a sore throat with no other known cause), divided by the number of patients that visited an ILINet clinic for any reason [2]. In this paper, ILI is a proportion and refers to weighted ILI—a state population-weighted estimate of ILI. Fig 1 shows validation ILI nationally and for each of the 10 Health and Human Services regions (HHS regions), as designated by the Department of Health and Human Services, for flu seasons 2012 through 2016. Validation ILI refers to the ILI estimates issued on the validation date: March 2nd, 2018.
Fig 1. Validation ILI for the 2012 through 2016 flu seasons nationally and for all 10 HHS regions.
For all regions, ILI is low at the beginning (October) and end (May) of the flu season, peaking between December and March. Different HHS regions have different ILI magnitudes, with Region 6 having more intense flu seasons, on average, than Region 1.
A validation date is needed because ILI estimates are perpetually updated, as the participating ILINet clinics are continuously evolving and ILI submissions are rolling. Every week when ILI estimates are issued by the CDC, previously issued ILI estimates may be modified. Changes to initial ILI (the ILI estimates first issued by the CDC) after the initial issue date are referred to as backfill. When retrospectively recreating realistic forecasting environments, ILI data available on historical forecast dates are needed. Carnegie Mellon University’s Delphi group provides access to ILI data available on historical dates (referred to as available ILI) through their real-time epidemiological data API [17], facilitating faithful recreations of forecasting environments.
Fig 2 shows the diminishing affect of backfill as a function of weeks since initial issue date.
- Initial ILI has historically been revised by up to ±25%.
- If 10 or more weeks have passed since the initial issue date, most ILI estimates have historically been revised by less than ±5%.
- If 40 or more weeks have passed since the initial issue date, most ILI estimates have historically been revised by less than ±2.5%.
Fig 2. Changes to available ILI estimates as a function of weeks since the initial issue date, across all seasons and regions.
50% and 90% uncertainty intervals are shown. The effect of backfill diminishes as a function of weeks since the initial issue date and stabilizes after roughly 40 weeks.
One of the issues with backfill is that even small modifications to ILI estimates can have large impacts on target evaluation. This is illustrated in Fig 3 for the 2014 flu season in Region 2. Using ILI issued on May 24th, 2015, the flu season peaked the week of January 25th, 2015 at 5.22%. Using the validation ILI, however, issued nearly three years later, the flu season peaked the week of December 21st, 2014 at 5.25%, a difference of six weeks. Even though the backfill from the end of May 2015 to March 2018 was minimal (0.03% for the peak intensity), the peak week shifted by over a month.
Fig 3. Backfill for the 2014 flu season for Region 2.
The validation peak intensity is 0.0525 and occurs in the middle of December. The peak intensity based on ILI issued on May 24th, 2015 is 0.0522 and occurs at the end of January, a difference of 6 weeks even though the intensity of the two weeks changed only by 0.0003.
The existence of backfill demands a specific validation date. Based on Fig 2, the impact of backfill diminishes as a function of weeks since the initial issue date, suggesting a good validation date should be “a long time” after the latest initial issue date. For this work, the latest initial issue date is May 14th, 2017 and the validation date is March 2nd, 2018—a difference of nearly 42 weeks.
Wikipedia and Google Health Trends.
We use Google Health Trends (GHT) web search volumes to nowcast the missing lag week of ILI data. This requires selecting search queries that are semantically related to influenza.
To do this, we used the Wikipedia inter-article link graph. A dataset [18] from our previous work [3] includes a list of 573 articles linked from “Influenza”, and our other previous work [19] mapped these articles to Google search query strings (i.e., what web users typed into the Google search box). Briefly, this mapping is a mechanical translation based on regular expressions. The only manual step beyond identification of the seed article (“Influenza”) is enumerating stop phrases such as “and”, “list of”, and “influenza virus subtype”.
We used the Google Health Trends API [20] to download the weekly search volume for the 573 queries over the study period for the U.S. nationally and each state. This was done in batches from February 16 to March 5, 2018. The GHT API returns summaries of search activity based on a random sample and thus cannot be deterministically reproduced.
Methods
In the following sections, we describe the model used for flu forecasting, the internet-based nowcasting procedure, the experimental setup, and the procedure for assessing the experiment.
Dynamic Bayesian Model.
The Dynamic Bayesian Model (DBM) is a hierarchical Bayesian model. A simplified, graphical representation of DBM is shown in Fig 4. The quantity π_{t} represents a latent state of ILI transmission on week t, linked in time through directed edges. Observations of available ILI (y_{t}) are linked to latent ILI states. For a given forecast date, some of the available ILI nodes are observed (shaded) while others are unobserved (unshaded). DBM is a statistical model that estimates the unobserved nodes, given the observed nodes. As the flu season progresses, more available ILI nodes are observed and estimation of the shrinking set of unobserved nodes is refined.
Fig 4. A simplified graphical representation of DBM.
π_{t} represents latent ILI states, while y_{t} represents available ILI data. Shaded nodes represent observed quantities; unshaded nodes represent unobserved quantities. Directed edges represent dependencies between nodes.
The full description of DBM is somewhat involved; the interested reader is directed to [16] for more details than are presented here. For the purposes of this paper, it suffices to know that DBM is a hybrid model, combining the structure of a compartmental disease transmission model with the flexibility of a statistical model.
Forecasting with DBM is performed by sampling from the posterior predictive distribution—the distribution of unobserved available ILI nodes given observed available ILI nodes—via Markov chain Monte Carlo (MCMC). The MCMC sampling is done using the rjags package [21] within R [22]. In this paper, all results are based on runs of three chains of 15,000 iterations per chain with a burn-in of 5,000 iterations and thinning every 5th iteration. Thus, summaries of the posterior predictive distribution are based on 6,000 samples. Multiple chains were run to assess convergence to the stationary distribution, 15,000 iterations per chain were selected for chain convergence, and thinning every 5th iteration was chosen to address chain mixing.
The posterior predictive distribution does not explicitly capture the effect of backfill. Thus, a modification is applied to each draw (j) from the posterior predictive distribution () and the available ILI (y_{t}) on week t as follows:
(1)
where ỹ_{jt} is a modified estimate of validation ILI accounting for backfill on week t, n_{t} ≥ 1 is the number of weeks since the initial issue date of y_{t}, and κ is a tuning parameter that controls the weight of the convex combination of and y_{t}. Eq 1 is motivated by Fig 2. Available ILI becomes a better estimate of validation ILI as the number of weeks since the initial issue date increases. Eq 1 places more weight on available ILI y_{t} as the number of weeks since the initial issue date n_{t} increases, capturing the diminishing affect of backfill illustrated in Fig 2. The parameter κ was set at 0.75—a pragmatic choice. Alternatively, κ could have been chosen more formally via cross-validation.
Fig 5 shows summaries of the backfill-modified forecasts (ỹ_{jt}) for Region 2 for the 2014 flu season. We see as more available ILI is incorporated into DBM fitting, forecasts are updated. Note that there is forecast uncertainty for weeks where available ILI is observed. That uncertainty accounts for backfill. The degree of forecast uncertainty around available ILI diminishes as the number of weeks since the initial issue date increases.
Fig 5. DBM forecasts for validation ILI for the 2014 flu season for Region 2.
Posterior predictive summaries of ỹ_{jt} are presented, based on the available ILI for each panel. Notice there is forecast uncertainty even for weeks where available ILI exists, accounting for backfill.
DBM does a good job capturing the validation ILI, with respect to its predictive uncertainty, for all forecasts throughout the season. In fact, [16] found that, when comparing DBM forecasts nationally to all models that participated in the CDC’s 2015/16 and 2016/17 flu forecasting challenges, DBM outperformed all competing models; i.e., DBM is a leading forecasting model. It is worth mentioning that DBM only uses ILI estimates as a data source; it does not make use of any external data, such as internet-based information. In subsequent sections, we discuss how DBM, an already-good forecasting model, can be augmented to incorporate internet-based information and assess the value of that augmentation.
Internet-based nowcasting.
The purpose of ILI nowcasting is to fill the gap created by the CDC’s ILI reporting lag. In this paper, we treat the reporting lag as one week. Our approach, however, can trivially be extended to a k-week reporting lag, with k ≥ 1.
A good nowcast is one that predicts the yet-to-be-issued one-week-ahead validation ILI. We use LASSO regression for ILI nowcasting. It is important to note that our objective in this section is not to identify the “best” model for nowcasting. Rather it is to 1) identify a “good” model for nowcasting and 2) understand the limits on forecasting improvement that nowcasting provides, independent of the selected nowcasting model. That said, the remainder of the section provides evidence in support of our decision to nowcast with LASSO regression.
The space of nowcasting approaches presented in the literature is large, including time series autoregression [23], regularized regression [24], support vector machines [11], neural nets [25], and random forests [26]. We consider elastic net regression models, a form of regularized regression, for nowcasting. Elastic net regression solves the following minimization:
(2)
where λ ≥ 0 and α ∈ [0, 1] are tuning parameters [27]. When α = 1, elastic net regression reduces to LASSO regression with penalty parameter λ. When α = 0, elastic net regression reduces to ridge regression with penalty parameter λ/2. We use the glmnet package in R to carry out the elastic net parameter estimation [28].
We considered elastic net nowcasting models corresponding to α = 0, 0.1, 0.2, …, 1. The following procedure was used to produce one nowcast for each region/season/forecast week. The procedure was repeated for each region/ season/forecast week.
- Let X_{j,t} be the N_{j,t} × (P + 1) training matrix. Each non-intercept column corresponds to one of the P = 573 GHT search queries and each row corresponds to a week. N_{j,t} includes data from flu seasons j − 2 and j − 1, as well as the first t weeks of flu season j. Thus, the nowcasting model is always trained on between two and three full flu seasons, inclusive.
- When nowcasting for a HHS region (as opposed to nationally), is constructed for each state in the HHS region. is then constructed by taking a population-weighted average of training matrices, where the state weights are proportional to their 2010 Census population estimates.
- Let Y_{j,t} be the N_{j,t} × 1 vector of available ILI estimates issued on week t of flu season j corresponding to the weeks of X_{j,t}.
- For each value of α, fit elastic net regression, using 10-fold cross-validation for the selection of λ that minimizes the cross-validated root mean-squared error (RMSE).
- Given the fitted elastic net model and X_{j,t+1} (the GHT search query activity for the nowcasting week), predict ÿ_{j,t+1}, the validation ILI for week t + 1. Call the nowcast ŷ_{j,t+1}.
For each nowcasting model, we computed summaries of the correlation and RMSE across all 55 region/season pairs. Results are shown in Table 1. We see that ridge regression (α = 0) produces qualitatively worse nowcasts than all other considered elastic net regressions, suggesting that some amount of ℓ_{1} penalty on the regression coefficients improves nowcasting. The ℓ_{1} penalty is the mechanism by which elastic net regression can “zero out” either unhelpful and/or redundant features. In general, the average number of estimated non-zero coefficients reduces as α increases, with a steep drop between α = 0 and α = 0.1. Elastic net for α ≥ 0.1 results in correlations, averaged over all regions and seasons, of ∼ 0.88 and RMSEs of ∼ 0.005. There is little difference between the average estimated correlations and RMSEs, as well as their estimated ranges, for α ≥ 0.1. For these reasons, we use LASSO regression (i.e., α = 1) for nowcasting.
Table 1. Nowcast performance for elastic net models of α = 0.0, 0.1, …, 1.0.
The correlation is the average correlation over all 55 region/season pairs, while the min and max correlations are the min and max correlations over all region/season pairs. Similarly for the RMSE. The average number of estimated non-zero features are shown in the “Features” column. Ridge regression is qualitatively worse than all other α > 0 models. There is little difference between all α ≥ 0.1 models with respect to correlation and RMSE. The number of non-zero features tends to decline as an increasing function of α.
Experimental setup.
Our goal in this paper is to interrogate the value of internet-based nowcasts when incorporated into a good internet-free forecasting model. To do so, we conduct an experiment where we compare three statistical models across various scenarios. The three statistical models are illustrated in Fig 6.
Fig 6. The simplified graphical representations of DBM, DBM+, and DBMperfect.
π_{t} is the latent state of ILI for week t, while y_{t} is available ILI. Observed nodes are shaded; unobserved nodes are unshaded. The ILI reporting lag creates a gap between the day a forecast is rendered (vertical “Today” line) and the most recent available ILI estimate (y_{3}). DBM does not fill in the reporting gap, denoted by the unshaded node y_{4}. DBM+ fills in the reporting gap with a nowcast, denoted by the shaded node ŷ_{4}. DBMperfect cheats by filling the reporting gap with the actual validation ILI, denoted by the shaded node ÿ_{4}.
The first model is DBM. Due to the ILI reporting lag, there is a one week gap between the most readily available ILI estimates and the forecast date. DBM does not fill in that gap with a nowcast.
The second model is DBM+. DBM+ is the same as DBM, except DBM+ fills in the ILI gap with an internet-based nowcast. For DBM+, the nowcast is treated just like a CDC-issued ILI estimate. Based on Table 1, the nowcasts are relatively highly correlated with the validation ILI and have an average RMSE of ∼ 0.005. That being said, the nowcasts do not perfectly match the validation ILI they are estimating.
The third model we consider is DBMperfect. DBMperfect cheats by filling in the ILI gap with the actual validation ILI estimate. Unlike DBM and DBM+, DBMperfect cannot be used for forecasting in real-time, as the validation ILI estimates required to run DBMperfect are not available when forecasts are rendered. DBMperfect, however, serves as an upper bound on the possible DBM improvement one can expect when using any nowcast. How well DBM+ approximates the performance of DBMperfect is directly related to the quality of the nowcasts.
The experiment we conduct compares the forecasting performance of DBM, DBM+, and DBMperfect across three factors:
- Geography (11 levels): Models are fit at the national level as well as the 10 HHS regions (data shown in Fig 1).
- Flu Seasons (5): Models are fit to the 2012 through the 2016 flu seasons.
- Epidemic Weeks (25): Models are fit for 25 consecutive weeks, starting on epidemic week 44, roughly the end of October, and ending roughly at the middle-to-end of April.
Thus, each model renders forecasts for 11 × 5 × 25 = 1375 scenarios.
Experimental assessment.
We assess the models based on their ability to forecast seven targets specified by the CDC’s flu forecasting challenge [29]:
- Onset: the first of three consecutive weeks equal to or above baseline, where baseline is a region/season-specific quantity determined by the CDC
- Peak intensity (PI): the maximum validation ILI estimate for the flu season
- Peak timing (PT): the week(s) the maximum validation ILI estimate for the flu season is observed
- Short-term forecasts: one- through four-week-ahead forecasts, where week-ahead forecasts are with respect to available ILI. As a concrete example, in Fig 6, a one-week-ahead forecasts for all models means forecasting week 4, as the last available ILI week is week 3.
Forecasts for each target are in the form of a probability distribution, capturing the uncertainty in the target-specific forecasts. Accuracy is assessed with a modified log scoring rule following that of the CDC’s flu forecasting challenge [29]. Let
(3)
be a length n_{τ} vector of probabilities corresponding to bins for model m, region r, flu season s, target τ, and issue week t, such that all elements of p_{m,r,s,τ,t} are non-negative and . For targets onset and PT, bins correspond to weeks. For all other targets, bins are equally spaced intervals of width 0.001 from 0 to 0.13, with a final bin equal to (0.13, 1.00].
For a target, let the correct value as determined by the validation ILI be equal to bin ĩ. Then, the modified log score l_{m,r,s,τ,t} is computed as
(4)
where
(5)
for onset and PT and
(6)
for PI and the short-term forecasts, and ln() is the natural log. If, for instance, the peak timing occurred on epidemic week 5 (ĩ = 5), then Ĩ = {4, 5, 6}. If the forecast assigned probability 0.05, 0.2, and 0.15 to the epidemic weeks of Ĩ, respectively, then the modified log score would be ln(0.05 + 0.2 + 0.15) = −0.92.
Note that l_{m,r,s,τ,t} ∈ [−10, 0] and is not particularly interpretable. We employ a simple, monotonic transformation to the modified log score,
(7)
referred to as the forecast skill, where s_{m,r,s,τ,t} ∈ [0, 1] (or more specifically, [exp(−10), 1]) and exp() is the exponential function.
Throughout this work, we compute summaries of the modified log score and, hence, summaries of the forecast skill, where we average over indices, denoted by “·”. For instance, the model-specific average modified log score is denoted as l_{m,.,.,.,.} and is the average modified log score, averaged over all regions, seasons, targets, and issue weeks. The corresponding model-specific average forecast skill is s_{m,.,.,.,} = exp(l_{m,.,.,.,.}). The model-specific average forecast skill then has the interpretation of the geometric mean of the model-specific average modified log score. For reference, the winning forecast skill for the CDC’s 2016/17 challenge was s_{m,.,2016,.,.} ≈ 0.45.
Results
Fig 7 displays the average forecast skill over all seasons, regions, and issue weeks for each target and model. The average forecast skill, smallest to largest, is DBM, DBM+, and DBMperfect for each target, indicating that DBM’s forecasting ability can be improved by including internet-based nowcasts in the manner outlined above. Further, DBMperfect provides bounds on the degree of improvement that can be achieved through nowcasting.
Fig 7. Target-specific average forecast skill (s_{m,.,.,τ,.}) averaged across all regions, seasons, and weeks.
DBMperfect has the best forecast skill for all short-term forecasts, followed by DBM+, and finally DBM, respectively. The largest improvement in forecast skill relative to DBM occurs for one-week-ahead forecasts (i.e., nowcasts), as expected, as DBMperfect has perfect knowledge of nowcasts (but s_{DBMperfect,.,.,τ,.} ≠ 1 because of backfill), while DBM+ has an external estimate of the nowcast. All models exhibit similar forecast skills for seasonal targets (i.e., onset, PI, and PT), with a slight improvement for DBMperfect over DBM for all three seasonal targets.
Not surprisingly, nowcasts provide the greatest potential for forecast improvement of the one-week-ahead target. This is because the nowcast provides a direct estimate of the one-week-ahead target. The reason the forecast skill is not 1 for DBMperfect’s one-week-ahead forecasts is backfill. Though we imputed the validation ILI for the nowcast, DBMperfect does not know the nowcast is perfect. The backfill component of DBM still runs (i.e., Eq 1), resulting in one-week-ahead forecasts for DBMperfect with uncertainty and hence, forecast skills not equal to 1. Because, however, DBM’s modeling components are all connected (see Fig 6), direct information for the one-week-ahead target has rippling effects to all other targets. We see that two- through four-week-ahead forecasts are also uniformly improved when comparing DBM’s average forecast skill to DBM+ and DBMperfect, though both the magnitude of improvement and magnitude of overall forecast skill declines with increased lead times. The seasonal targets (i.e., onset, PI, and PT) are improved by a relatively small amount when internet-based nowcasts are used, but improved nonetheless. The small difference between DBM and DBM+ forecast skill for seasonal targets relative to short-term targets is not wholly surprising, as nowcasts do not provide direct estimates of these targets. However, the fact that forecast skill did uniformly improve for seasonal targets when compared to DBM is encouraging, as the incorporation of nowcasts into DBM’s modeling framework can improve all aspects of forecasting, not just short-term forecasts.
When average forecast skill is considered on a weekly basis, however, we see there are weeks of the season when DBM+ has worse skill than DBM. Fig 8 shows the average forecast skill improvement relative to DBM for DBM+ and DBMperfect. DBM+ has better average forecast skill for most weeks of the season for all short-term targets, but actually can have worse average forecasting skill during December and early January—the time of year when ILI tends to rapidly increase. This suggests that nowcasts used in DBM+ provide biased information and DBM+ would, in fact, be better off using no nowcasts than the ones generated by LASSO during December and early January. Averaged over the whole season, however, the LASSO nowcasts used by DBM+ yield better short-term forecasts than DBM. Fig 8 makes clear the most substantial opportunity for nowcasting improvement is during December and January. In contrast, DBMperfect has better average forecast skill than DBM for all weeks of the season and all short-term forecasting targets, indicating that DBM’s forecast skill can be improved uniformly if nowcasting can be done perfectly.
Fig 8. Target and week-specific relative average forecast skill (s_{m,.,.,τ,t} − s_{DBM,.,.,τ,t}) averaged across all regions and seasons.
DBMperfect has better average forecast skill than DBM for all weeks and short-term targets. DBM+ has better average forecast skill than DBM most of the season, but tends to have worse average forecast skill during December and early January—the time of year ILI tends to rapidly increase.
Fig 9 plots the region-specific average forecast skills (s_{m,r,.,.,.}) for all regions. The overall forecast skill varies by region, with the lowest forecast skill for Region 2 and highest forecast skill nationally. For every region, however, we see DBM+ has better forecast skill than DBM, while DBMperfect has better forecast skill than DBM+. The improvement of DBM+ over DBM varies; for some regions the improvement is relatively small (e.g., Regions 2 and 4), while in other regions it approaches that of DBMperfect (e.g., Regions 8 and 9). For all regions, though, DBM+ improves forecast skill over DBM, an appealing result.
Fig 9. Average forecast skill by region, averaged over all seasons and targets (s_{m,r,.,.,.}).
Regions are ordered by increasing DBMperfect average forecast skill. For all regions, DBM has the lowest average forecast skill, followed by DBM+ and DBMperfect, respectively.
Finally, we compute the season-specific average forecast skill (s_{m,.,s,.,.}), displayed in Fig 10. Here we see that average forecast skill varies from season-to-season, with the lowest forecast skill for 2012 and the highest for 2016. For all seasons, DBMperfect has the best forecast skill, followed by DBM+ and finally DBM. The improvement of DBM+ relative to DBM varies across seasons. In some years, the improvement is relatively small (e.g., 2012 and 2015) while in others it is larger (e.g., 2013 and 2014). For all seasons, though, DBM+ improves the forecast skill over DBM.
Fig 10. Average forecast skill by season, averaged over all regions and targets (s_{m,.,s,.,.}).
DBMperfect has the highest average forecast skill for all seasons, followed by DBM+ and DBM, respectively.
Figs 7, 8, 9 and 10 provide strong evidence that DBM can be improved through the inclusion of internet-based nowcasts. These conclusions hold across all considered targets, regions, seasons, and most weeks of the season.
The improvement of DBM+ over DBM does not just exist, however, it is also appreciable. The model-specific average forecast skills, averaged over all regions, season, targets, and issue weeks are
The difference between s_{DBM+,.,.,.,.} and s_{DBM,.,.,.,.} is 0.034. While seemingly small, that is roughly a difference between 1st place and 7th place in the CDC’s 2016/17 30-model flu forecasting challenge.
The upper bound on forecast skill improvement is captured by the difference in average forecast skill between s_{DBMperfect,.,.,.,.} and s_{DBM,.,.,.,.}. That difference is 0.074, roughly the difference between 1st place and 11th place. An improvement equivalent to a 7th to 1st place jump in practice, with the potential for an 11th to 1st place jump simply by incorporating internet-based nowcasts represents an appreciably large improvement.
Finally, the difference between DBMperfect and DBM+ represents the magnitude of improvement left unexploited by our LASSO nowcasting model. This difference is 0.04—larger than the improvement of DBM+ over DBM. Thus, though DBM+ provides significant gains over DBM, there is still potential for significant improvements to DBM+ via improved nowcasting. Fig 8 highlights that improvements to nowcasting during December and early January are good places to start. If, however, improvements above and beyond what DBMperfect provide are desired, improving the forecasting model through nowcasting will be insufficient, as DBMperfect provides an upper bound on the improvement one can expect from nowcasting, regardless of nowcasting modeling and data choices.