Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B. We open-sourced our findings at github.com/JeffreyWong20/Secondary-Attention-Sinks.
翻译:注意力汇是指那些尽管语义相关性有限却获得不成比例高注意力的词元,通常是序列起始(BOS)词元。在本工作中,我们识别出一类注意力汇,称之为次级汇,其与先前研究中探讨的汇(我们称之为主汇)存在根本差异。虽然先前工作已发现BOS以外的词元有时也能成为汇,但它们被发现表现出与BOS词元类似的性质。具体而言,它们出现在同一层,在整个网络中持续存在,并吸引大量注意力质量。然而,我们发现存在次级汇,它们主要出现在中间层,可以持续可变数量的层,并吸引较小但仍显著的注意力质量。通过对11个模型系列的广泛实验,我们分析了这些次级汇出现的位置、其性质、形成方式及其对注意力机制的影响。具体而言,我们证明:(1)这些汇由特定的中间层MLP模块形成;这些MLP将词元表示映射到与该层主汇方向对齐的向量。(2)这些向量的$\ell_2$范数决定了次级汇的汇分数,也决定了其持续存在的层数,从而相应地导致对注意力机制的不同影响。(3)主汇在中间层减弱,与次级汇的出现同时发生。我们观察到,在更大规模的模型中,汇的位置和持续时间(合称为汇层级)以更具确定性和更频繁的方式出现。具体而言,我们在QwQ-32B中识别出三个汇层级,在Qwen3-14B中识别出六个层级。我们在github.com/JeffreyWong20/Secondary-Attention-Sinks开源了我们的发现。