Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy. Recent top-performing models have shown that synthesizing multiple reporting systems for the same disease improves performance. Other recent work takes this idea a step further, using transfer learning to train a forecasting model for one disease using data from a different disease. We expand upon each of these approaches greatly, training machine learning models on data that span 66 infectious diseases and several data streams. We investigate the value of incorporating different data streams for forecasting 20 different disease data streams. We find that incorporating other data streams improves forecasting in the vast majority (84.9%) of time series and model structures considered. However, our work highlights that the quality of the added data matters, where adding data extremely different from the target data stream can sometimes degrade forecast performance. A major contribution of this work is in compiling a publicly-available database of data for use by the infectious disease forecasting community.
翻译:疾病预测模型通常依赖单一数据流,导致当历史数据较短或存在噪声时模型表现脆弱。近期表现最佳的模型已表明,整合针对同一疾病的多重报告系统可提升性能。另有后续研究进一步拓展此思路,利用迁移学习通过不同疾病的数据训练预测模型。本研究大幅推进这些方法,在涵盖66种传染病及多种数据流的训练集上构建机器学习模型。我们系统评估了整合不同数据流对20种疾病数据预测的价值,发现绝大多数(84.9%)时间序列和模型结构可通过引入其他数据流提升预测精度。然而,本研究强调附加数据质量的重要性——当引入与目标数据流差异过大的数据时,可能反而降低预测性能。本项工作的主要贡献在于为传染病预测研究领域构建了一个可供公开使用的综合数据库。