Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.
翻译:注意力汇是指那些尽管语义相关性有限却获得不成比例高注意力的词元,通常为序列起始(BOS)词元。在本研究中,我们识别出一类注意力汇,称之为次级汇,其与先前研究中关注的汇(我们称之为主汇)存在根本性差异。尽管先前工作已发现BOS以外的词元有时也会成为汇,但它们表现出与BOS词元类似的性质:均出现在相同网络层、持续贯穿整个网络并吸引大量注意力权重。与此不同,我们发现存在主要出现在中间网络层、可持续可变层数、并吸引较小但仍显著的注意力权重的次级汇。通过对11个模型系列的广泛实验,我们分析了这些次级汇的出现位置、特性、形成机制及其对注意力机制的影响。具体而言,我们证明:(1)这些汇由特定的中间层MLP模块形成;这些MLP将词元表征映射到与该层主汇方向对齐的向量。(2)这些向量的$\ell_2$范数决定了次级汇的汇强度及其持续层数,从而对注意力机制产生相应程度的影响。(3)主汇在中间层强度减弱,与次级汇的出现时机重合。我们观察到,在更大规模的模型中,汇的位置与持续时间(合称为汇层级)以更具确定性和更高频次的方式出现。具体而言,我们在QwQ-32B中识别出三个汇层级,在Qwen3-14B中识别出六个层级。