Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, existing evaluation frameworks comprise benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, these frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks, such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks do not enforce a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench and maintain a live leaderboard at https://benchmark.smlcrm.com/.
翻译:基础模型已深刻变革自然语言处理与计算机视觉领域,而快速发展的时间序列基础模型(TSFM)研究正致力于在预测任务中复现这一成功。尽管近期开源模型展现了TSFM的潜力,该领域仍缺乏一个全面且被学界认可的模型评估框架。我们发现至少存在四大问题阻碍此类框架的发展:第一,现有评估框架所包含的基准预测任务多源自已过时的数据集(如M3),这些数据集普遍缺乏清晰的元数据,且与用于预训练TSFM的语料存在重叠;第二,现有框架仅沿有限维度(如预测长度或领域)评估模型性能,却忽略了非平稳性、季节性等核心统计特性;第三,领域特定模型(如XGBoost)常被不公平地比较,因为现有框架未对所有模型实施系统化、标准化的超参数调优规范;第四,缺乏用于解读模型对比性能的可视化工具。为解决上述问题,我们提出TempusBench——一个面向TSFM的开源评估框架。该框架包含:1)不包含于现有TSFM预训练语料的新数据集;2)超越现有基准任务的新颖任务集合;3)配备标准化超参数调优协议的模型评估流水线;4)基于TensorBoard的可视化界面。相关代码已开源至GitHub:https://github.com/Smlcrm/TempusBench,实时排行榜见https://benchmark.smlcrm.com/。