We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.
翻译:我们证明了具有多层感知器(MLP)模块的有限深度、有限宽度Transformer模型中词元逐层演化的逐轨道收敛性,该演化收敛至连续时间随机相互作用粒子系统。我们还识别了描述此极限下词元分布演化的随机偏微分方程,并证明了当词元数量很大时混沌传播的性质。所建立的界是定量的,且考虑的极限是可交换的。我们进一步证明了极限随机模型展现出噪声同步化现象,并建立了相互作用能量的平均指数耗散性,前提是公共噪声相对于确定性自注意力漂移具有充分的强制性。最后,我们刻画了满足前述条件的激活函数。