Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $O(n^3)$ time complexity of tensor attention poses a significant obstacle to its utilization in transformers, where $n$ is the input sequence length. In this work, we prove that the backward gradient of tensor attention training can be computed in almost linear time $n^{1+o(1)}$, the same complexity as its forward computation under the bounded entries assumption. We provide a closed-form solution for the gradient and propose a fast computation method utilizing polynomial approximation methods and tensor algebraic techniques. Furthermore, we prove the necessity and tightness of our assumption through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. Our theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.
翻译:张量注意力作为一种多视角注意力机制,能够捕捉多模态间的高阶相关性,从而克服经典矩阵注意力的表征局限性。然而,张量注意力 $O(n^3)$ 的时间复杂度(其中 $n$ 为输入序列长度)严重阻碍了其在 Transformer 架构中的应用。本研究证明,在输入元素有界的假设下,张量注意力训练的反向梯度计算可在近似线性时间 $n^{1+o(1)}$ 内完成,与其前向计算复杂度相同。我们给出了梯度的闭式解,并提出一种结合多项式逼近方法与张量代数技术的快速计算方案。此外,通过计算复杂性分析,我们证明了所提假设的必要性与紧致性:若略微弱化该假设,梯度计算问题将无法在严格亚立方时间内求解。我们的理论结果确立了高效高阶 Transformer 训练的可行性,有望推动张量注意力架构的实际应用。