Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.
翻译:自注意力层已成为现代深度神经网络的基本构建模块,然而其理论理解仍然有限,特别是从随机矩阵理论的角度来看。本文对注意力矩阵的奇异值谱进行了严格分析,并首次建立了注意力机制的高斯等价性结果。在逆温度保持常数阶的自然机制下,我们证明了注意力矩阵的奇异值分布渐近地可由一个易处理的线性模型刻画。我们进一步证明了奇异值平方的分布偏离了先前工作中普遍认为的Marchenko-Pastur定律。我们的证明依赖于两个关键要素:对归一化项波动的精确控制,以及利用指数函数有利泰勒展开的精细化线性化方法。该分析还确定了线性化的阈值,并阐明了为何注意力机制尽管不是逐元素运算,却在该机制下允许严格的高斯等价性。