Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we have shown empirically that the entire attention mechanism in the encoder can be reduced to an MLP formed by feedforward, skip-connection, and layer normalization operations for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projection, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance that the given network remains the top-tier compared to other SOTA methods. For spatio-temporal networks, the MLP-replace-attention network achieves a reduction in FLOPS of $62.579\%$ with a loss in performance less than $2.5\%$; for LTSF, a reduction in FLOPs of $42.233\%$ with a loss in performance less than $2\%$.
翻译:注意力架构在时间序列预测任务中已无处不在,包括时空预测(STF)和长期时间序列预测(LTSF)。然而,我们对其有效性的原因理解仍然有限。本研究提出了一种理解自注意力网络的新视角:我们通过实证表明,在多变量时间序列预测的时序和/或空间建模中,编码器中的整个注意力机制可以简化为一个由前馈、跳跃连接和层归一化操作构成的多层感知机(MLP)。具体而言,在注意力网络中移除Q、K、V投影、注意力得分计算、注意力得分与V的点积以及最终投影操作,并不会显著降低模型性能,且该网络相较于其他最先进(SOTA)方法仍保持顶尖水平。对于时空网络,用MLP替代注意力后,计算量(FLOPS)减少了$62.579\%$,性能损失低于$2.5\%$;对于LTSF,计算量减少了$42.233\%$,性能损失低于$2\%$。