Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.

翻译：我们研究了在软最大自注意力层中执行线性回归的梯度下降训练动力学，并证明一种简单的一阶优化算法能够以几何速率收敛至全局最优的自注意力参数。我们的分析分为两个步骤。首先，我们证明在无限数据极限下，自注意力层所解决的回归问题等价于一个非凸矩阵分解问题。其次，我们利用这一联系设计了一种新颖的"结构感知"梯度下降变体，该算法能高效优化原始有限数据回归目标。我们的优化算法相较于标准梯度下降具有多项创新，包括一个有助于避免伪驻点的预条件器和正则化器，以及一个以高概率位于全局最小值流形附近的参数的数据相关谱初始化方法。

相关内容

自注意力

关注 13

利用注意力机制来“动态”地生成不同连接的权重，这就是自注意力模型（Self-Attention Model）. 注意力机制模仿了生物观察行为的内部过程，即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制。注意力机制可以快速提取稀疏数据的重要特征，因而被广泛用于自然语言处理任务，特别是机器翻译。而自注意力机制是注意力机制的改进，其减少了对外部信息的依赖，更擅长捕捉数据或特征的内部相关性

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

2+阅读 · 今天13:02

大模型时代还不理解自注意力(Self-Attention)？这篇文章教你从头写代码实现

专知会员服务

36+阅读 · 2024年2月12日

「深度学习视觉注意力」最新2022研究综述，概述50种软硬注意力机制方法

专知会员服务

113+阅读 · 2022年4月20日

专知会员服务

171+阅读 · 2020年5月10日

【CVPR2020-中科院计算所】弱监督语义分割的自监督等价注意力机制，Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation