Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

from arxiv, We provide a primal-dual representation for the asymmetric self-attention in transformer that allows to avoid explicit computation of the kernel matrix

Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.

翻译：近期，一系列新研究通过将Transformer中的自注意力机制视为核机器，以期理解和改进该机制。然而，现有方法将对称核的算法直接应用于非对称自注意力，导致理论分析与数值实现之间存在显著差距。本文通过非对称核奇异值分解（KSVD）提出了一种全新的自注意力表示与优化视角，其动机源于深度层中自注意力通常表现出的低秩特性。基于非对称KSVD：i) 构建了自注意力的原始-对偶表示，优化目标转化为最大化注意力输出的投影方差；ii) 通过KSVD的原始表示提出了新型注意力机制——原始-注意力，避免了对偶中核矩阵的显式计算；iii) 利用KKT条件证明，原始-注意力中KSVD优化的平稳解可实现零值目标函数。由此，KSVD优化仅需通过最小化正则化损失即可实现，从而无需额外分解即可促进低秩特性。数值实验表明，原始-注意力在提升效率的同时达到了最先进的性能。此外，我们证明所部署的KSVD优化使原始-注意力具有比标准自注意力更陡峭的奇异值衰减，进一步验证了该方法的巨大潜力。据我们所知，这是首个为自注意力中非对称核提供原始-对偶表示并成功应用于建模与优化的工作。