Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

In this paper, we delve into several mechanisms employed by Transformer-based language models (LLMs) for factual recall tasks. We outline a pipeline consisting of three major steps: (1) Given a prompt ``The capital of France is,'' task-specific attention heads extract the topic token, such as ``France,'' from the context and pass it to subsequent MLPs. (2) As attention heads' outputs are aggregated with equal weight and added to the residual stream, the subsequent MLP acts as an ``activation,'' which either erases or amplifies the information originating from individual heads. As a result, the topic token ``France'' stands out in the residual stream. (3) A deep MLP takes ``France'' and generates a component that redirects the residual stream towards the direction of the correct answer, i.e., ``Paris.'' This procedure is akin to applying an implicit function such as ``get\_capital($X$),'' and the argument $X$ is the topic token information passed by attention heads. To achieve the above quantitative and qualitative analysis for MLPs, we proposed a novel analytic method aimed at decomposing the outputs of the MLP into components understandable by humans. Additionally, we observed a universal anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. We mitigate this suppression by leveraging our interpretation to improve factual recall confidence. The above interpretations are evaluated across diverse tasks spanning various domains of factual knowledge, using various language models from the GPT-2 families, 1.3B OPT, up to 7B Llama-2, and in both zero- and few-shot setups.

翻译：本文深入研究了基于Transformer的语言模型（LLMs）在事实回忆任务中采用的若干机制。我们概述了一个包含三个主要步骤的流程：（1）给定提示“法国的首都是”，任务特定的注意力头从上下文中提取主题词（如“法国”）并将其传递给后续的MLP。（2）由于注意力头的输出以等权重聚合并添加到残差流中，后续的MLP充当“激活器”，其功能是擦除或放大来自各个注意力头的信息。因此，主题词“法国”在残差流中凸显出来。（3）一个深层MLP接收“法国”并生成一个组件，该组件将残差流重定向至正确答案的方向，即“巴黎”。这一过程类似于应用一个隐式函数，如“get_capital($X$)”，其中参数$X$是由注意力头传递的主题词信息。为了实现对MLP的上述定量和定性分析，我们提出了一种新颖的分析方法，旨在将MLP的输出分解为人类可理解的组件。此外，我们在模型的最后一层观察到一种普遍的反过度自信机制，该机制会抑制正确的预测。我们利用我们的解释来缓解这种抑制，从而提高事实回忆的置信度。上述解释在涵盖不同领域事实知识的多样化任务中进行了评估，使用了从GPT-2系列、1.3B OPT到7B Llama-2的各种语言模型，并在零样本和少样本设置下进行了验证。