TempusBench: An Evaluation Framework for Time-Series Forecasting

Denizalp Goktas,Gerardo Riaño-Briceño,Alif Abdullah,Aryan Nair,Chenkai Shen,Beatriz de Lucio,Alexandra Magnusson,Farhan Mashrur,Ahmed Abdulla,Shawrna Sen,Mahitha Thippireddy,Gregory Schwartz,Amy Greenwald

Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.

翻译：基础模型已经改变了自然语言处理和计算机视觉领域，关于时间序列基础模型（TSFMs）的文献正在迅速增长，试图在预测领域复现这一成功。尽管最近的开源模型展示了TSFMs的潜力，但该领域仍缺乏一个全面且被社区接受的模型评估框架。我们发现至少存在四个主要问题阻碍着此类框架的发展。首先，当前的评估框架包含的基准预测任务通常来自过时的数据集（如M3），其中许多数据集缺乏清晰的元数据，并且与用于预训练TSFMs的语料库存在重叠。其次，现有框架仅沿着一组狭窄定义的基准预测任务（如预测视野长度或领域）对模型进行评估，却忽视了诸如非平稳性和季节性等核心统计属性。第三，领域特定模型（如XGBoost）往往受到不公平比较，因为现有框架缺乏对所有模型进行系统且一致超参数调优的惯例。第四，缺乏用于解释比较性能的可视化工具。为解决这些问题，我们提出了TempusBench，一个面向TSFMs的开源评估框架。TempusBench包含：1）未纳入现有TSFM预训练语料库的新数据集，2）一组超越现有基准的新颖基准任务，3）一个带有标准化超参数调优协议的模型评估流水线，以及4）一个基于tensorboard的可视化界面。我们已在GitHub上提供代码访问：https://github.com/Smlcrm/TempusBench。