In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.
翻译:本文提出后继头(successor heads):一种能够对具有自然顺序的token(如数字、月份和日期)进行递增操作的注意力头。例如,后继头可将"周一"递增为"周二"。我们基于机械可解释性(mechanistic interpretability)的方法论解释后继头的行为——该领域旨在以人类可理解的方式阐明模型完成任务的内在机制。现有研究已在小型玩具模型中发现可解释的语言模型组件,但针对前沿模型的内部机理尚未形成突破性认知,当前对大型语言模型的内部运算机制仍知之甚少。本文系统分析了大型语言模型(LLMs)中后继头的行为特征,发现其实现了跨不同架构共有的抽象表征。这类注意力头在参数量低至3100万、高达120亿的LLM中均可形成,涵盖GPT-2、Pythia和Llama-2等模型。我们发现了支撑不同架构与规模LLM中后继头递增操作的"模10特征"集合,通过向量算术操作这些特征可编辑注意力头行为,从而揭示LLM内部的数值表征机制。此外,我们还研究了后继头在自然语言数据中的行为表现,在Pythia模型的后继头中识别出可解释的多语义性(polysemanticity)现象。