Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.
翻译:自注意力层已成为现代深度神经网络的基本构建模块,然而其理论理解仍然有限,尤其是从随机矩阵理论的角度。本文对注意力矩阵的奇异值谱进行了严格分析,并建立了首个关于注意力机制的高斯等价性结果。在逆温度保持常数阶的自然条件下,我们证明注意力矩阵的奇异值分布渐近地由可处理的线性模型刻画。我们进一步证明,平方奇异值的分布偏离了先前研究认为的Marchenko-Pastur定律。该证明依赖于两个关键要素:对归一化项波动的精确控制,以及利用指数函数有利泰勒展开的精细线性化方法。这一分析还确定了线性化的阈值,并阐明了尽管注意力机制并非逐元素操作,但在该条件下为何仍能具有严格的高斯等价性。