Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific circuits or tasks. In this work, we seek a comprehensive mapping of the operations they implement in a model. We propose MAPS (Mapping Attention head ParameterS), an efficient framework that infers the functionality of attention heads from their parameters, without any model training or inference. We showcase the utility of MAPS for answering two types of questions: (a) given a predefined operation, mapping how strongly heads across the model implement it, and (b) given an attention head, inferring its salient functionality. Evaluating MAPS on 20 operations across 6 popular LLMs shows its estimations correlate with the head's outputs during inference and are causally linked to the model's predictions. Moreover, its mappings reveal attention heads of certain operations that were overlooked in previous studies, and valuable insights on function universality and architecture biases in LLMs. Next, we present an automatic pipeline and analysis that leverage MAPS to characterize the salient operations of a given head. Our pipeline produces plausible operation descriptions for most heads, as assessed by human judgment, while revealing diverse operations.
翻译:注意力头是大型语言模型(LLM)的基本构建模块之一。先前探究其运行机制的研究主要集中于分析其在特定电路或任务推理过程中的行为。在本工作中,我们致力于建立注意力头在模型中所实现操作的全面映射。我们提出MAPS(Mapping Attention head ParameterS)——一种高效的框架,该框架能够直接从注意力头的参数推断其功能,无需任何模型训练或推理过程。我们展示了MAPS在回答两类问题上的实用性:(a)给定预定义操作,映射模型中各头执行该操作的强度;(b)给定注意力头,推断其主要功能。在6个主流LLM上对20种操作进行评估的结果表明,MAPS的估计值与注意力头在推理过程中的输出具有相关性,且与模型预测存在因果关联。此外,其映射结果揭示了先前研究中被忽视的特定操作类别的注意力头,并为LLM中的功能普遍性与架构偏差提供了有价值的见解。随后,我们提出一种利用MAPS表征给定注意力头主要操作的自动化流程与分析框架。该流程能为大多数注意力头生成符合人类评估的合理操作描述,同时揭示出多样化的操作类型。