Successor Heads: Recurring, Interpretable Attention Heads In The Wild

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.

翻译：本文提出后继头（successor heads）：一种能够对具有自然顺序的token（如数字、月份和日期）进行递增操作的注意力头。例如，后继头可将"周一"递增为"周二"。我们基于机械可解释性（mechanistic interpretability）的方法论解释后继头的行为——该领域旨在以人类可理解的方式阐明模型完成任务的内在机制。现有研究已在小型玩具模型中发现可解释的语言模型组件，但针对前沿模型的内部机理尚未形成突破性认知，当前对大型语言模型的内部运算机制仍知之甚少。本文系统分析了大型语言模型（LLMs）中后继头的行为特征，发现其实现了跨不同架构共有的抽象表征。这类注意力头在参数量低至3100万、高达120亿的LLM中均可形成，涵盖GPT-2、Pythia和Llama-2等模型。我们发现了支撑不同架构与规模LLM中后继头递增操作的"模10特征"集合，通过向量算术操作这些特征可编辑注意力头行为，从而揭示LLM内部的数值表征机制。此外，我们还研究了后继头在自然语言数据中的行为表现，在Pythia模型的后继头中识别出可解释的多语义性（polysemanticity）现象。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

Query2box: 使用盒嵌入对向量空间中的知识图谱进行推理，Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings

专知会员服务

46+阅读 · 2020年5月11日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日