Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.
翻译:基础模型已经改变了自然语言处理和计算机视觉领域,关于时间序列基础模型(TSFMs)的文献正在迅速增长,试图在预测领域复现这一成功。尽管最近的开源模型展示了TSFMs的潜力,但该领域仍缺乏一个全面且被社区接受的模型评估框架。我们发现至少存在四个主要问题阻碍着此类框架的发展。首先,当前的评估框架包含的基准预测任务通常来自过时的数据集(如M3),其中许多数据集缺乏清晰的元数据,并且与用于预训练TSFMs的语料库存在重叠。其次,现有框架仅沿着一组狭窄定义的基准预测任务(如预测视野长度或领域)对模型进行评估,却忽视了诸如非平稳性和季节性等核心统计属性。第三,领域特定模型(如XGBoost)往往受到不公平比较,因为现有框架缺乏对所有模型进行系统且一致超参数调优的惯例。第四,缺乏用于解释比较性能的可视化工具。为解决这些问题,我们提出了TempusBench,一个面向TSFMs的开源评估框架。TempusBench包含:1)未纳入现有TSFM预训练语料库的新数据集,2)一组超越现有基准的新颖基准任务,3)一个带有标准化超参数调优协议的模型评估流水线,以及4)一个基于tensorboard的可视化界面。我们已在GitHub上提供代码访问:https://github.com/Smlcrm/TempusBench。