Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.
翻译:Transformer模型已在多个领域彻底改变了深度学习,但精确理解其token动态机制仍是一个理论难题。现有关于带层归一化的深度Transformer的理论通常预测token会聚集到单一极点;然而,这些结论依赖于确定性的权重假设,未能捕捉Transformer中标准的初始化方案。本研究表明,考虑随机初始化固有的随机性将改变这一图景。具体而言,我们分析了由值矩阵随机初始化产生噪声的深度Transformer模型。在扩散尺度与token级RMS归一化条件下,我们证明当Transformer层数趋于无穷时,离散token动态会收敛到球面上的一个相互作用粒子系统,其中token受到一个\emph{共同}的矩阵值布朗噪声驱动。在此极限下,我们证明初始化噪声能阻止确定性模型所预测的向单一簇的坍缩。针对两个token的情形,我们证明了由相互作用强度与token维度控制的相变现象:与确定性注意力流不同,对映构型以正概率成为吸引子。数值实验证实了预测的相变,揭示了对映构型在多于两个token时依然存在,并证明抑制固有噪声会导致准确率下降。