Max-Margin Token Selection in Attention Mechanism

Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are trainable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.

翻译：注意力机制是Transformer架构的核心组成部分，其推动了大语言模型取得显著成功。然而，注意力机制背后的理论原理仍未被充分理解，尤其是其非凸优化动力学特性。在本工作中，我们探索了经典softmax注意力模型 $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$，其中 $\boldsymbol{X}$ 为令牌序列，$(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ 为可训练参数。我们证明，对 $\boldsymbol{p}$（或等价地对 $\boldsymbol{W}$）运行梯度下降时，其方向收敛于一个将$\textit{局部最优}$令牌与非最优令牌分离的最大边际解。这清晰地形式化了注意力作为一种最优令牌选择机制。值得注意的是，我们的结果适用于一般数据，并精确刻画了令牌$\textit{最优性}$与值嵌入 $\boldsymbol{Xv}$ 及问题几何结构的关系。我们还提供了更广泛的正则化路径分析，证明了即使在非线性预测头中，注意力仍具有最大化边际的特性。当使用逻辑损失同时优化 $\boldsymbol{v}$ 和 $\boldsymbol{p}$ 时，我们识别了正则化路径方向收敛到各自硬间隔SVM解的条件，其中 $\boldsymbol{v}$ 根据标签对输入特征进行分离。有趣的是，$\boldsymbol{p}$ 的SVM形式受 $\boldsymbol{v}$ 的支持向量几何影响。最后，我们通过数值实验验证了理论发现并提供了相关见解。