EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

翻译：数据驱动决策在公共卫生领域的日益普及，使流行病预测成为关键研究领域。近期多变量预测模型的进展相较于传统单变量方法（即对各个序列独立建模）能更好地捕捉复杂的时间依赖关系。然而，尽管具有这种潜力，稳健的流行病预测方法的发展仍受限于缺乏涵盖多种传染病和地理区域的多样化多变量数据集的高质量基准。为填补这一空白，我们提出EpiCastBench，一个大规模基准框架，包含40个经过精心筛选的（相关）多变量流行病数据集。这些公开数据集覆盖多种传染病，并在时间粒度、序列长度和稀疏性方面呈现多样化特征。我们分析这些数据集以识别其全局特征与结构模式。为确保可重现性与公平比较，我们建立了标准化评估设置，包括统一的预测窗口、一致的预处理流程、多样化的性能指标及统计显著性检验。基于该框架，我们对15种多变量预测模型进行了全面评估，涵盖从统计基线方法到最先进的深度学习与基础模型。所有数据集与代码均已发布在Kaggle（https://www.kaggle.com/datasets/aimltsf/epicastbench）和GitHub（https://github.com/aimltsf/EpiCastBench）上。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【KDD2025】DUET：双重聚类增强的多变量时间序列预测

专知会员服务

17+阅读 · 2024年12月30日

时间序列预测的全面综述：架构多样性与开放挑战

专知会员服务

35+阅读 · 2024年11月13日

【CMU博士论文】建模流行病学时间序列，66页·pdf

专知会员服务

27+阅读 · 2023年10月3日