We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.
翻译:我们研究了在软最大自注意力层中执行线性回归的梯度下降训练动力学,并证明一种简单的一阶优化算法能够以几何速率收敛至全局最优的自注意力参数。我们的分析分为两个步骤。首先,我们证明在无限数据极限下,自注意力层所解决的回归问题等价于一个非凸矩阵分解问题。其次,我们利用这一联系设计了一种新颖的"结构感知"梯度下降变体,该算法能高效优化原始有限数据回归目标。我们的优化算法相较于标准梯度下降具有多项创新,包括一个有助于避免伪驻点的预条件器和正则化器,以及一个以高概率位于全局最小值流形附近的参数的数据相关谱初始化方法。