Over the last ten years, the US Centers for Disease Control and Prevention (CDC) has organized an annual influenza forecasting challenge with the motivation that accurate probabilistic forecasts could improve situational awareness and yield more effective public health actions. Starting with the 2021/22 influenza season, the forecasting targets for this challenge have been based on hospital admissions reported in the CDC's National Healthcare Safety Network (NHSN) surveillance system. Reporting of influenza hospital admissions through NHSN began within the last few years, and as such only a limited amount of historical data are available for this signal. To produce forecasts in the presence of limited data for the target surveillance system, we augmented these data with two signals that have a longer historical record: 1) ILI+, which estimates the proportion of outpatient doctor visits where the patient has influenza; and 2) rates of laboratory-confirmed influenza hospitalizations at a selected set of healthcare facilities. Our model, Flusion, is an ensemble that combines gradient boosting quantile regression models with a Bayesian autoregressive model. The gradient boosting models were trained on all three data signals, while the autoregressive model was trained on only the target signal; all models were trained jointly on data for multiple locations. Flusion was the top-performing model in the CDC's influenza prediction challenge for the 2023/24 season. In this article we investigate the factors contributing to Flusion's success, and we find that its strong performance was primarily driven by the use of a gradient boosting model that was trained jointly on data from multiple surveillance signals and locations. These results indicate the value of sharing information across locations and surveillance signals, especially when doing so adds to the pool of available training data.
翻译:过去十年间,美国疾病控制与预防中心(CDC)每年组织流感预测挑战赛,其核心理念在于:准确的概率预测能够提升疫情态势感知能力,并催生更有效的公共卫生行动。自2021/22流感季起,该挑战赛的预测目标转为基于CDC国家医疗安全网络(NHSN)监测系统报告的住院病例数据。通过NHSN报告流感住院病例的制度始于近些年,因此该信号仅具备有限的历史数据。为在目标监测系统数据有限的条件下生成预测,我们通过两种历史记录更长的信号进行数据增强:1)ILI+(估算门诊患者中流感感染比例);2)特定医疗机构实验室确诊流感住院率。我们提出的Flusion模型采用集成架构,将梯度提升分位数回归模型与贝叶斯自回归模型相结合。梯度提升模型基于全部三种数据信号训练,而自回归模型仅针对目标信号训练;所有模型均采用多地区联合训练方式。在CDC 2023/24流感季预测挑战赛中,Flusion模型取得最优性能。本文深入探究Flusion成功的驱动因素,发现其卓越性能主要源于采用跨监测信号与跨地区联合训练的梯度提升模型。这些结果表明,跨地区与跨监测信号的信息共享具有重要价值,特别是在能够有效扩充训练数据池的应用场景中。