Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. We hypothesize that averaging performance over all samples dilutes relevant information about the relative performance of models. Particularly, conditions in which this relative performance is different than the overall accuracy. We address this limitation by proposing a novel framework for evaluating univariate time series forecasting models from multiple perspectives, such as one-step ahead forecasting versus multi-step ahead forecasting. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques. While classical methods (e.g. ARIMA) are long-standing approaches to forecasting, deep neural networks (e.g. NHITS) have recently shown state-of-the-art forecasting performance in benchmark datasets. We conducted extensive experiments that show NHITS generally performs best, but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, NHITS only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that, when dealing with anomalies, NHITS is outperformed by methods such as Theta. These findings highlight the importance of aspect-based model evaluation.
翻译:准确评估预测模型对于确保可靠预测至关重要。当前评估和比较预测模型的实践侧重于将性能汇总为单一分数,例如使用SMAPE等指标。我们假设对所有样本的性能进行平均会稀释有关模型相对性能的相关信息,特别是在相对性能与整体准确性不同的条件下。我们通过提出一个新颖的框架来解决这一局限性,该框架从多个角度评估单变量时间序列预测模型,例如一步超前预测与多步超前预测。我们通过比较最先进的深度学习方法与经典预测技术来展示该框架的优势。虽然经典方法(例如ARIMA)是长期使用的预测方法,但深度神经网络(例如NHITS)最近在基准数据集中展现了最先进的预测性能。我们进行了大量实验,结果表明NHITS通常表现最佳,但其优势随预测条件而变化。例如,在预测范围方面,NHITS仅在多步超前预测中优于经典方法。另一个相关见解是,在处理异常值时,NHITS的表现被Theta等方法超越。这些发现凸显了基于多方面模型评估的重要性。