Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where, $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are tunable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as a token separation mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.
翻译:注意力机制是Transformer架构的核心组成部分,该架构推动了大语言模型的巨大成功。然而,注意力机制的理论原理尚未被充分理解,尤其是其非凸优化动力学特性。本文研究了经典的softmax注意力模型$f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$,其中$\boldsymbol{X}$表示标记序列,$(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$为可调参数。我们证明,对$\boldsymbol{p}$(或等价于$\boldsymbol{W}$)执行梯度下降时,其方向会收敛到最大间隔解,该解将$\textit{局部最优}$标记与非最优标记区分开。这明确将注意力机制形式化为一种标记分离机制。值得注意的是,我们的结论适用于一般数据,并通过值嵌入$\boldsymbol{Xv}$和问题几何结构精确刻画了标记的$\textit{最优性}$。我们进一步给出了更广泛的正则化路径分析,证明即使对于非线性预测头,注意力机制仍具有最大化间隔的特性。在对$\boldsymbol{v}$和$\boldsymbol{p}$同时进行逻辑损失优化的过程中,我们确定了正则化路径方向收敛到各自硬间隔SVM解的条件:此时$\boldsymbol{v}$根据输入特征的标签对其进行分离。有趣的是,$\boldsymbol{p}$的SVM形式会受到$\boldsymbol{v}$的支持向量几何结构的影响。最后,我们通过数值实验验证了理论发现并提供了相关见解。