Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ Δv_j = -η\sum_i α_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).

翻译：Transformer 在精心构建的“贝叶斯风洞”及大规模语言模型中，经验上能执行精确的概率推理，然而基于梯度的学习如何创建所需内部几何的机制仍不透明。我们对交叉熵训练如何重塑 Transformer 注意力头中的注意力分数与值向量进行了完整的一阶分析。我们的核心结果是注意力分数的**基于优势的路由定律**：\[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] 以及与值的**责任加权更新**相耦合：\[ Δv_j = -η\sum_i α_{ij} u_i, \] 其中 $u_i$ 是位置 $i$ 处的上游梯度，$α_{ij}$ 是注意力权重。这些方程诱导了一个正反馈循环，其中路由与内容共同专业化：查询更强烈地路由至对其误差信号具有高于平均水平的值的向量，而这些值向量被拉向使用它们的查询。我们证明这种耦合专业化行为类似于一个双时间尺度的 EM 过程：注意力权重实现 E 步（软责任分配），而值向量实现 M 步（责任加权的原型更新），查询与键则调整假设框架。通过受控模拟（包括一个黏性马尔可夫链任务，其中我们将闭式 EM 风格更新与标准 SGD 进行比较），我们证明了最小化交叉熵的相同梯度动力学也塑造了在我们配套工作中被识别为实现贝叶斯推理的低维流形。这产生了一个统一的图景：优化（梯度流）催生几何（贝叶斯流形），而几何又支持功能（上下文概率推理）。