The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformer is challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the unified embedding for each temporal token fuses multiple variates with potentially unaligned timestamps and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any adaptation on the basic components. We propose iTransformer that simply inverts the duties of the attention mechanism and the feed-forward network. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves consistent state-of-the-art on several real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting.
翻译:近期线性预测模型的兴起,对基于Transformer的预测器在架构修改方面的持续热情提出了质疑。这类预测器利用Transformer对时间序列中时间令牌(由同一时间戳的多个变量构成)的全局依赖关系进行建模。然而,随着回溯窗口增大,Transformer面临性能下降和计算爆炸的挑战。此外,每个时间令牌的统一嵌入会融合时间戳可能不对齐且物理测量值不同的多个变量,这可能导致难以学习以变量为中心的表示,并产生无意义的注意力图。本文重新审视了Transformer各组件的功能职责,并在不修改基本组件的前提下重新设计了Transformer架构。我们提出iTransformer,其核心是将注意力机制与前馈网络的功能进行倒置:具体而言,将各时间序列的时间点嵌入为变量令牌,利用注意力机制捕获多变量相关性;同时为每个变量令牌应用前馈网络学习非线性表示。iTransformer在多个真实世界数据集上持续取得最优结果,进一步提升了Transformer系列的性能、跨变量泛化能力以及对任意回溯窗口的利用效率,使其成为时间序列预测的优质基础骨干网络。