We study the fundamental optimization principles of self-attention, the defining mechanism of transformers, by analyzing the implicit bias of gradient-based optimizers in training a self-attention layer with a linear decoder in binary classification. Building on prior studies in linear logistic regression, recent findings demonstrate that the key-query matrix $W_t$ from gradient-descent (GD) converges in direction towards $W_{mm}$, which maximizes the margin between optimal and non-optimal tokens across sequences. However, this convergence is local, dependent on initial conditions, only holds asymptotically as the number of iterations increases, and leaves questions about the potential benefits of adaptive step-size rules unaddressed. To bridge this gap, we first establish scenarios for which convergence is provably \emph{global}. We then analyze two adaptive step-size strategies: normalized GD and Polyak step-size, demonstrating \emph{finite-time} convergence rates for $W_t$ to $W_{mm}$, and quantifying the sparsification rate of the attention map. These findings not only show that these strategies can accelerate parameter convergence over standard GD in a non-convex setting but also deepen the understanding of the implicit bias in self-attention, linking it more closely to the phenomena observed in linear logistic regression despite its intricate non-convex nature.
翻译:我们通过分析二元分类中带线性解码器的自注意力层在基于梯度的优化器训练下的隐式偏差,研究了transformer核心机制——自注意力——的基本优化原理。基于线性逻辑回归的先前研究,近期结果表明梯度下降(GD)得到的键-查询矩阵 $W_t$ 在方向上会收敛于 $W_{mm}$,该矩阵在序列间最大化最优与非最优词元之间的间隔。然而,这种收敛是局部的,依赖于初始条件,仅当迭代次数趋于无穷时渐近成立,且未涉及自适应步长规则可能带来的优势。为填补这一空白,我们首先建立了收敛可证明为全局性的若干场景。随后,我们分析了两种自适应步长策略:归一化梯度下降与Polyak步长,证明了 $W_t$ 到 $W_{mm}$ 的有限时间收敛率,并量化了注意力图的稀疏化速率。这些发现不仅表明在非凸设定下这些策略能够加速参数收敛速度(相较于标准梯度下降),而且深化了对自注意力隐式偏差的理解,将其与线性逻辑回归中观察到的现象更紧密地联系起来——尽管自注意力具有复杂的非凸性质。