Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} σ(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.
翻译:理解基于softmax模型的复杂非凸训练动态对于解释Transformer架构的经验性成功至关重要。本文分析了值-softmax模型(定义为${L}(\mathbf{V} σ(\mathbf{a}))$,其中$\mathbf{V}$和$\mathbf{a}$分别为可学习的值矩阵和注意力向量)的梯度流动态。由于矩阵乘以softmax向量的参数化构成了自注意力机制的核心构建模块,我们的分析为Transformer的训练动态提供了直接洞见。我们揭示了该结构上的梯度流本质上会驱动优化过程趋向于以低熵输出为特征的解。我们证明了这种极化效应在包括逻辑损失和平方损失在内的多种目标函数中具有普适性。此外,我们讨论了这些理论结果的实际意义,为注意力汇聚现象和大规模激活等经验性观察提供了形式化机制。