Time Series Foundation Models (TSFMs) represent a new paradigm for time-series forecasting, promising zero-shot predictions without the need for task-specific training or fine-tuning. However, similar to Large Language Models (LLMs), the evaluation of TSFMs is challenging: as training corpora grow increasingly large, it becomes difficult to ensure the integrity of the test sets used for benchmarking. An investigation of existing TSFM evaluation studies identifies two kinds of information leakage: (1) train-test sample overlaps arising from the multi-purpose reuse of datasets and (2) temporal overlap of correlated train and test series. Ignoring these forms of information leakage when benchmarking TSFMs risks producing overly optimistic performance estimates that fail to generalize to real-world settings. We therefore argue for the development of novel evaluation methodologies that avoid pitfalls already observed in both LLM and classical time-series benchmarking, and we call on the research community to adopt principled approaches to safeguard the integrity of TSFM evaluation.
翻译:时间序列基础模型(TSFMs)代表了时间序列预测的新范式,承诺无需任务特定训练或微调即可实现零样本预测。然而,与大型语言模型(LLMs)类似,TSFMs的评估具有挑战性:随着训练语料库规模日益增大,确保用于基准测试的测试集完整性变得困难。对现有TSFM评估研究的调查识别出两类信息泄露:(1)由数据集多用途重复使用导致的训练-测试样本重叠,以及(2)相关训练与测试序列的时间重叠。在TSFM基准测试中忽略这些形式的信息泄露,可能导致产生过于乐观的性能评估,这些评估无法推广到实际应用场景。因此,我们主张开发新的评估方法学,以规避在LLM和经典时间序列基准测试中已观察到的缺陷,并呼吁研究社区采用原则性方法来保障TSFM评估的完整性。