TempusBench: An Evaluation Framework for Time-Series Forecasting

Denizalp Goktas,Gerardo Riaño-Briceño,Alif Abdullah,Aryan Nair,Chenkai Shen,Beatriz de Lucio,Alexandra Magnusson,Farhan Mashrur,Ahmed Abdulla,Shawrna Sen,Mahitha Thippireddy,Gregory Schwartz,Amy Greenwald

Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, existing evaluation frameworks comprise benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, these frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks, such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks do not enforce a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench and maintain a live leaderboard at https://benchmark.smlcrm.com/.

翻译：基础模型已深刻变革自然语言处理与计算机视觉领域，而快速发展的时间序列基础模型（TSFM）研究正致力于在预测任务中复现这一成功。尽管近期开源模型展现了TSFM的潜力，该领域仍缺乏一个全面且被学界认可的模型评估框架。我们发现至少存在四大问题阻碍此类框架的发展：第一，现有评估框架所包含的基准预测任务多源自已过时的数据集（如M3），这些数据集普遍缺乏清晰的元数据，且与用于预训练TSFM的语料存在重叠；第二，现有框架仅沿有限维度（如预测长度或领域）评估模型性能，却忽略了非平稳性、季节性等核心统计特性；第三，领域特定模型（如XGBoost）常被不公平地比较，因为现有框架未对所有模型实施系统化、标准化的超参数调优规范；第四，缺乏用于解读模型对比性能的可视化工具。为解决上述问题，我们提出TempusBench——一个面向TSFM的开源评估框架。该框架包含：1）不包含于现有TSFM预训练语料的新数据集；2）超越现有基准任务的新颖任务集合；3）配备标准化超参数调优协议的模型评估流水线；4）基于TensorBoard的可视化界面。相关代码已开源至GitHub：https://github.com/Smlcrm/TempusBench，实时排行榜见https://benchmark.smlcrm.com/。