We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.
翻译:我们研究了一个深度多头自注意力机制的随机模型,其中权重在各层和头部之间独立重新采样,如同训练初始化时的情形。将深度视为时间变量,残差流在单位球面上定义了一个离散时间相互作用粒子系统。我们证明,在深度、残差步长和头部数量的适当联合缩放下,该动力学过程具有一个非平凡的齐次化极限。根据缩放方式的不同,极限可以是确定性的,也可以是具有公共噪声的随机性极限;在平均场机制下,后者导出了一个关于代表性标记的条件律的随机非线性福克-普朗克方程。在高斯设置中,极限漂移消失,使得齐次化动力学过程足够显式以研究表示坍塌。这揭示了维度、上下文长度和温度之间的定量权衡,并识别出可以缓解聚类的参数区间。