Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $Δ$-th sub-diagonal for some offset $Δ$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
翻译:大型语言模型(LLM)常呈现斜线注意力模式,即注意力分数集中在偏移量为$Δ$的第$Δ$条次对角线上。这类模式在跨词元信息传递中起关键作用。但其为何会产生?本文从实证与理论双重视角揭示了斜线主导注意力头(SDH)的涌现机制。首先,通过分析开源LLM,我们发现SDH是模型固有特性,并能泛化至分布外提示。为解释其固有涌现性,我们分析了共同决定注意力分数的查询向量、键向量及旋转位置编码(RoPE)。实证分析揭示了SDH的两大特征条件:(1)查询向量与键向量近似秩为一;(2)RoPE由中高频分量主导。在此条件下,各词元的查询向量与键向量近乎相同,且RoPE中高频分量间的相互作用催生了SDH。除实证证据外,我们通过将上述条件形式化为建模假设,从理论上证明这些条件足以保证SDH的涌现。特别地,我们分析了满足这些条件的浅层Transformer在RoPE作用下的训练动态,并证明通过梯度下降训练的模型必然呈现SDH特性,且该特性可泛化至分布外提示。