In this paper, we deeply explore several mechanisms employed by Transformer-based language models in factual recall tasks. In zero-shot scenarios, given a prompt like ``The capital of France is,'' task-specific attention heads extract the topic entity, such as ``France,'' from the context and pass it to subsequent MLPs to recall the required answer such as ``Paris.'' We introduce a novel analysis method aimed at decomposing the outputs of the MLP into components understandable by humans. Through this method, we quantify the function of the MLP layer following these task-specific heads. In the residual stream, it either erases or amplifies the information originating from individual heads. Moreover, it generates a component that redirects the residual stream towards the direction of its expected answer. These zero-shot mechanisms are also employed in few-shot scenarios. Additionally, we observed a widely existent anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. We mitigate this suppression by leveraging our interpretation to improve factual recall confidence. Our interpretations have been evaluated across various language models, including the GPT-2 families, 1.3B OPT, and 7B Llama-2, encompassing diverse tasks spanning various domains of factual knowledge.
翻译:本文深入探讨了基于Transformer的语言模型在事实回忆任务中采用的若干机制。在零样本场景下,给定类似“法国的首都是”的提示,任务特定的注意力头从上下文中提取主题实体(如“法国”),并将其传递给后续多层感知器(MLP)以回忆所需答案(如“巴黎”)。我们引入了一种新颖的分析方法,旨在将MLP的输出分解为人类可理解的组件。通过该方法,我们量化了这些任务特定注意力头之后MLP层的作用。在残差流中,该层要么擦除、要么放大源自单个注意力头的信息。此外,它生成一个组件,将残差流引向预期答案的方向。这些零样本机制同样适用于少样本场景。同时,我们观察到模型最终层普遍存在一种反过度自信机制,该机制抑制了正确预测。我们利用这一解读来减轻这种抑制,从而提升事实回忆的置信度。我们的解读已在多种语言模型上进行了评估,包括GPT-2系列、1.3B OPT和7B Llama-2,涵盖了跨事实知识各领域的多样化任务。