Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $Δ$-th sub-diagonal for some offset $Δ$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
翻译:大型语言模型(LLM)常表现出斜线注意力模式,即注意力分数集中在某个偏移量$Δ$对应的第$Δ$条次对角线上。这些模式在跨词元信息传递中起关键作用。但其为何会出现?本文从实证与理论双重视角揭示了斜线主导注意力头(SDH)的涌现机制。首先,通过分析开源LLM,我们发现SDH是模型固有的特性,且能泛化至分布外提示。为解释其内在涌现机制,我们分析了共同决定注意力分数的查询向量、键向量及旋转位置编码(RoPE)。实证分析揭示了SDH的两个特征条件:(1)查询向量与键向量几乎呈秩一特性;(2)RoPE由中高频分量主导。在此条件下,各词元的查询向量与键向量近乎相同,而RoPE中高频分量间的相互作用催生了SDH。除实证证据外,我们通过将上述条件形式化为建模假设,从理论上证明了这些条件足以保证SDH的涌现。特别地,我们分析了在此条件下配备RoPE的浅层Transformer的训练动态,并证明通过梯度下降训练的模型必然呈现SDH特性,且该特性可泛化至分布外提示。