Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance. Towards developing the fundamental optimization principles of self-attention, we investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary classification. Drawing inspiration from the study of GD in linear logistic regression over separable data, recent work demonstrates that as the number of iterations $t$ approaches infinity, the key-query matrix $W_t$ converges locally (with respect to the initialization direction) to a hard-margin SVM solution $W_{mm}$. Our work enhances this result in four aspects. Firstly, we identify non-trivial data settings for which convergence is provably global, thus shedding light on the optimization landscape. Secondly, we provide the first finite-time convergence rate for $W_t$ to $W_{mm}$, along with quantifying the rate of sparsification in the attention map. Thirdly, through an analysis of normalized GD and Polyak step-size, we demonstrate analytically that adaptive step-size rules can accelerate the convergence of self-attention. Additionally, we remove the restriction of prior work on a fixed linear decoder. Our results reinforce the implicit-bias perspective of self-attention and strengthen its connections to implicit-bias in linear logistic regression, despite the intricate non-convex nature of the former.
翻译:自注意力机制作为Transformer的核心机制,使其区别于传统神经网络并驱动其卓越性能。为探索自注意力优化的基本原理,我们研究了在固定线性解码器的二分类任务中,梯度下降算法训练自注意力层时的隐式偏差。受可分数据线性逻辑回归中梯度下降研究的启发,近期工作表明,当迭代次数$t$趋于无穷时,关键-查询矩阵$W_t$(相对于初始化方向)局部收敛至硬间隔支持向量机解$W_{mm}$。我们的工作从四个方面改进了这一结论。首先,我们识别出可证明全局收敛的非平凡数据场景,从而揭示了优化景观的特性。其次,我们首次给出$W_t$收敛至$W_{mm}$的有限时间收敛率,并量化注意力图的稀疏化速率。第三,通过分析归一化梯度下降与Polyak步长,我们从解析角度证明自适应步长规则可加速自注意力收敛。此外,我们消除了先前工作中对固定线性解码器的限制。我们的结论强化了自注意力机制的隐式偏差视角,并加强了其与线性逻辑回归隐式偏差之间的联系,尽管前者具有复杂的非凸性质。