It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a rigorous human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 representative TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.

翻译：时间序列基础模型（TSFMs）正在彻底改变预测领域，从特定数据集建模转向可泛化的任务评估。然而，我们认为现有基准在四个维度上存在普遍局限性：数据构成受限，主要由重复使用的历史数据源主导；数据完整性不足，缺乏严格的质量保证；任务定义与真实场景脱节；以及分析视角僵化，难以揭示可泛化的见解。为弥补这些差距，我们提出了TIME——一个面向下一代、以任务为中心的基准，包含50个全新数据集和98个预测任务，专为严格零样本TSFM评估设计，杜绝数据泄露。通过整合大语言模型与人类专业知识，我们建立了一个严格的人机协同基准构建流程，以确保数据的高完整性，并通过将预测配置与真实世界操作需求及变量可预测性对齐，重新定义了任务框架。此外，我们提出了一种新颖的模式级评估视角，超越了基于静态元标签的传统数据集级评估。该方法利用结构化的时间序列特征刻画内在时序属性，为模型在不同模式下的能力提供了可泛化的见解。我们评估了12个代表性TSFM，并建立了一个多粒度排行榜，以支持深度分析和可视化检验。排行榜可通过 https://huggingface.co/spaces/Real-TSF/TIME-leaderboard 访问。