我们是否在错误的游戏中获胜？重新审视长期时间序列预测的评估实践 (Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting)

Long-term time series forecasting (LTSF) is widely recognized as a central challenge in data mining and machine learning. LTSF has increasingly evolved into a benchmark-driven ''GAME,'' where models are ranked, compared, and declared state-of-the-art based primarily on marginal reductions in aggregated pointwise error metrics such as MSE and MAE. Across a small set of canonical datasets and fixed forecasting horizons, progress is communicated through leaderboard-style tables in which lower numerical scores define success. In this GAME, what is measured becomes what is optimized, and incremental error reduction becomes the dominant currency of advancement. We argue that this metric-centric regime is not merely incomplete, but structurally misaligned with the broader objectives of forecasting. In real-world settings, forecasting often prioritizes preserving temporal structure, trend stability, seasonal coherence, robustness to regime shifts, and supporting downstream decision processes. Optimizing aggregate pointwise error does not necessarily imply modeling these structural properties. As a result, leaderboard improvement may increasingly reflect specialization in benchmark configurations rather than a deeper understanding of temporal dynamics. This paper revisits LTSF evaluation as a foundational question in data science: what does it mean to measure forecasting progress? We propose a multi-dimensional evaluation perspective that integrates statistical fidelity, structural coherence, and decision-level relevance. By challenging the current metric monoculture, we aim to redirect attention from winning benchmark tables toward advancing meaningful, context-aware forecasting.

翻译：长期时间序列预测（LTSF）被广泛认为是数据挖掘和机器学习领域的核心挑战。LTSF已逐渐演变为一种以基准测试驱动的“游戏”，其中模型主要基于聚合点误差指标（如均方误差和平均绝对误差）的边际降低进行排名、比较，并被宣称为最先进技术。在一小部分经典数据集和固定预测范围上，进展通过排行榜式的表格来传达，其中较低的数值分数定义了成功。在这种“游戏”中，被度量的内容成为被优化的对象，而误差的渐进减少则成为进步的主导衡量标准。我们认为，这种以指标为中心的体系不仅不完整，而且在结构上与预测的更广泛目标不一致。在现实场景中，预测通常优先考虑保持时间结构、趋势稳定性、季节一致性、对状态转移的鲁棒性以及支持下游决策过程。优化聚合点误差并不必然意味着对这些结构特性进行建模。因此，排行榜的改进可能越来越多地反映了对基准配置的专门化适应，而非对时间动态的深入理解。本文重新审视LTSF评估，将其视为数据科学中的一个基础性问题：衡量预测进展意味着什么？我们提出了一个多维评估视角，整合了统计保真度、结构一致性和决策层面的相关性。通过挑战当前单一的指标文化，我们旨在将注意力从赢得基准测试表转向推进有意义、情境感知的预测。