The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the temporal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, leak-free and causally sound design, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our experiments reveal the flaws of the previous benchmarks and the biases in model evaluation, providing new insights into multiple existing forecasting models and LLMs across various evaluation tasks.
翻译:时间序列预测模型的评估因高质量基准的严重缺乏而受到阻碍,这可能导致进展的假象。现有数据集存在诸多问题,从大语言模型时代的预训练数据污染,到早期多模态设计中普遍存在的时间与描述泄露。为解决此问题,我们形式化了高保真基准测试的核心原则,重点关注数据来源的完整性、无泄露且因果合理的设计以及结构清晰性。我们基于这些原则,通过从实时API获取数据,从头构建并引入了Fidel-TS——一个全新的大规模基准。我们的实验揭示了先前基准的缺陷以及模型评估中的偏差,为多种现有预测模型和大语言模型在各类评估任务上提供了新的见解。