Retrieval Head Mechanistically Explains Long-Context Factuality

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.

翻译：尽管长上下文语言模型近期取得了进展，但基于Transformer的模型如何从长上下文中任意位置检索相关信息的机制仍不明确。本文旨在解答这一问题。通过对广泛模型族系的系统性研究，我们发现一类特殊的注意力头部主要负责信息检索，并将其命名为检索头部。我们揭示了检索头部具有以下有趣特性：(1)普遍性：所有具备长上下文能力的探索模型均存在一组检索头部；(2)稀疏性：仅少量（不足5%）的注意力头部属于检索型；(3)固有性：检索头部在短上下文预训练阶段已存在，当通过持续预训练扩展上下文长度时，执行信息检索的仍是同一组头部；(4)动态激活：以Llama-2 7B为例，无论上下文如何变化，12个检索头部始终关注所需信息，其余检索头部则在不同上下文中被激活；(5)因果性：完全剪枝检索头部会导致无法检索相关信息并产生幻觉，而随机剪枝非检索头部不影响模型检索能力。我们进一步证明检索头部显著影响思维链推理——该过程需要模型频繁回溯问题和先前生成的上下文。相比之下，利用固有知识直接生成答案的任务受检索头部掩码的影响较小。这些发现从整体上揭示了模型内部哪些部件从输入标记中提取信息。我们相信这些见解将推动减少幻觉、改进推理以及压缩KV缓存领域的未来研究。