Time Series Foundation Models (TSFMs) represent a new paradigm for time-series forecasting, promising zero-shot predictions without the need for task-specific training or fine-tuning. However, similar to Large Language Models (LLMs), the evaluation of TSFMs is challenging: as training corpora grow increasingly large, it becomes difficult to ensure the integrity of the test sets used for benchmarking. Our investigation of existing TSFM evaluation studies identifies two kinds of information leakage: (1) train-test sample overlaps arising from the multi-purpose reuse of datasets and (2) temporal overlap of correlated train and test series. Ignoring these forms of information leakage when benchmarking TSFMs risks producing overly optimistic performance estimates that fail to generalize to real-world settings. We therefore argue for the development of novel evaluation methodologies that avoid pitfalls already observed in both LLM and classical time-series benchmarking, and we call on the research community to adopt principled approaches to safeguard the integrity of TSFM evaluation.
翻译:时间序列基础模型(TSFMs)代表了时间序列预测的新范式,承诺在无需任务特定训练或微调的情况下实现零样本预测。然而,与大型语言模型(LLMs)类似,TSFMs的评估具有挑战性:随着训练语料库规模日益增大,确保用于基准测试的测试集的完整性变得困难。我们对现有TSFM评估研究的调查识别出两类信息泄露:(1)由数据集多用途复用导致的训练-测试样本重叠,以及(2)相关训练与测试序列的时间重叠。在TSFM基准测试中忽视这些形式的信息泄露,可能导致产生过于乐观的性能估计,这些估计无法推广到真实世界场景。因此,我们主张开发新的评估方法,以避免在LLM和经典时间序列基准测试中已观察到的缺陷,并呼吁研究社区采用原则性方法来保障TSFM评估的完整性。