In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard $1/\sqrt{n}$-scaling with $n$ dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.
翻译:在现代神经网络的理论分析中,无限宽度极限常被用于论证神经元预激活的高斯近似(例如通过神经网络高斯过程或张量程序)。然而,这些基于高斯的渐近理论迄今未能捕捉注意力层的行为,除非在特殊机制下,如无限多头或定制的缩放方案。本文利用张量程序框架,在现实的架构维度和标准的 $1/\sqrt{n}$ 缩放($n$ 为维度)下,严格识别了单注意力层内变量的无限宽度极限分布。我们推导了这一极限定律的精确形式,无需借助无限多头近似或定制缩放,并证明其从根本上偏离高斯性。该极限分布呈现出源于层次结构的非高斯性,即在随机相似度分数条件下为高斯分布。数值实验验证了我们的理论预测,证实了我们的理论在有限宽度下的有效性以及对有限多头注意力机制的准确描述。除了刻画独立的注意力层外,我们的发现为在无限宽度机制下发展深度 Transformer 架构的统一理论奠定了基础。