We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
翻译:本研究探讨Transformer语言模型中两种反复出现的现象:大规模激活(即少数词元在个别通道中呈现极端离群值)与注意力汇点(即某些词元无论语义相关性如何均吸引不成比例的注意力权重)。已有工作观察到这些现象常同时出现且常涉及相同词元,但其功能作用与因果关系尚不明确。通过系统实验,我们证明这种共现现象主要是现代Transformer架构设计的人工产物,且两种现象承担相关但不同的功能。大规模激活具有全局性作用:它们诱导近乎恒定的隐藏表示跨层持续存在,实质上充当模型的隐式参数。注意力汇点则具有局部性作用:它们调节各注意力头的输出,并使单个注意力头偏向短程依赖关系。我们确定前置归一化配置是实现共现的关键设计选择,并证明消除该配置会导致两种现象解耦。