Attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives the most attention despite limited semantic importance, suggest some heads may be inactive, and point to a significant source of computational redundancy. To analyze this phenomenon, we evaluate 12 score functions that measure different ways a head can be inactive. Thresholding these scores allows us to analyze different sets of potentially inactive attention heads. We evaluate whether identified heads are inactive through model interventions, finding that more than 12% of attention heads are inactive on average, and can be ablated in specific contexts while maintaining MMLU accuracy to within 1% of the pretrained LLM. Across 3 model families, our score functions that measure the average norm of a head's output consistently identify inactive heads that would not have been found by score functions that rely solely on attention weights. We establish that relying on a score function that measures a first token attention sink would underestimate the prevalence of inactive heads, failing to identify more than 7% of inactive heads on average. We also show how measuring score distributions can provide insights into attention behavior. For instance, we find evidence that finetuning causes little to no change in attention behavior, and that even within the same model family, large model scales present different attention behaviors.
翻译:注意力机制是大语言模型(LLM)的基础,它使得不同的注意力头能够对相关输入标记产生多样化的关注。然而,诸如注意力汇聚(attention sinks)等习得行为——即首个标记尽管语义重要性有限却获得最多关注——表明某些注意力头可能处于非活跃状态,并指向一个显著的计算冗余来源。为分析此现象,我们评估了12种评分函数,这些函数用于衡量注意力头在不同维度上可能表现出的非活跃性。通过对这些分数设定阈值,我们可以分析不同的潜在非活跃注意力头集合。我们通过模型干预实验来评估所识别出的注意力头是否确实非活跃,发现平均超过12%的注意力头处于非活跃状态,并且可以在特定上下文中将其剪除,同时使模型在MMLU基准上的准确率保持在预训练LLM的1%误差范围内。在三个模型家族中,我们用于衡量注意力头输出平均范数的评分函数,能够持续识别出那些仅依赖注意力权重的评分函数所无法发现的非活跃头。我们证实,依赖衡量首个标记注意力汇聚的评分函数会低估非活跃头的普遍性,平均未能识别超过7%的非活跃头。我们还展示了如何通过分析评分分布来洞察注意力行为。例如,我们发现证据表明微调对注意力行为的影响微乎其微,甚至在同一模型家族内,不同的大规模模型也呈现出不同的注意力行为。