A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training

This paper develops a mean-field theory for a simplified single-layer causal multi-head self-attention model trained by cross-entropy minimization. Each attention head is treated as a particle in parameter space, and the empirical law of the heads is used as the large-head state variable. In the infinite-head limit, the averaged attention logits define a risk functional on probability measures, whose first variation generates a nonlinear Wasserstein gradient-flow equation. Unlike classical mean-field analyses of shallow networks that often focus on square-loss regression, the present model contains the softmax residual from the cross-entropy objective and the query-key-value structure of masked self-attention. We prove a static finite-head approximation bound for the optimal risk, characterize global minimizers through a variational support condition, and establish a quantitative finite-time propagation-of-chaos estimate comparing finite-head stochastic gradient descent with the limiting PDE. We then study the long-time behavior of the PDE: energy dissipation, convergence to the stationary set under compactness, convergence to a single stationary measure under topological or Kurdyka--Łojasiewicz assumptions, and explicit convergence rates under gradient-domination conditions. Finally, we prove local exponential stability under a Wasserstein strong-monotonicity condition and give verifiable stability and instability criteria for Dirac stationary measures. The results provide a rigorous baseline mean-field framework for attention-head training and clarify the additional compactness, landscape, and curvature assumptions needed to pass from stationarity to convergence and stability.

翻译：本文针对简化单层因果多头自注意力模型在交叉熵最小化训练下，建立了其均值场理论。每个注意力头被视为参数空间中的粒子，头部的经验分布被用作大规模头状态变量。在无限头极限下，平均注意力logits定义了概率测度上的风险泛函，其一阶变分生成非线性Wasserstein梯度流方程。不同于常关注平方损失回归的浅层网络经典均值场分析，本模型包含交叉熵目标中的softmax残差以及掩码自注意力的查询-键-值结构。我们证明了最优风险的静态有限头近似界，通过变分支撑条件刻画全局最小化器，并建立了定量有限时间混沌传播估计，比较了有限头随机梯度下降与极限偏微分方程。随后，我们研究了偏微分方程的长时间行为：能量耗散、在紧致性条件下收敛到驻点集、在拓扑或Kurdyka--Łojasiewicz假设下收敛到单个平稳测度，以及在梯度主导条件下的显式收敛速率。最后，我们在Wasserstein强单调条件下证明局部指数稳定性，并给出Dirac平稳测度的可验证稳定与不稳定判据。这些结果为注意力头训练提供了严格的基线均值场框架，阐明了从驻点性过渡到收敛性与稳定性所需的额外紧致性、地形及曲率假设。