Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.
翻译:预训练语言模型在未经显式训练的任务上往往表现出令人惊讶的熟练程度,但人们对它们如何实现这些能力仍知之甚少。本文研究了预训练语言模型通常获取的基本数学能力。具体而言,我们采用机制可解释性技术来解释GPT-2小模型(有限的)数学能力。作为案例研究,我们考察了模型处理诸如“战争从1732年持续到17年”这类句子的能力,并预测有效的两位数结束年份(大于32的年份)。我们首先识别出一条电路——GPT-2小模型计算图中负责完成该任务输出的一小部分子集。随后,我们解释了每个电路组件的作用,表明GPT-2小模型的最终多层感知机提升了输出结束年份大于起始年份的概率。最后,我们发现了能够激活该电路的相关任务。我们的研究结果表明,GPT-2小模型通过一种复杂但通用的机制来计算大于关系,该机制能够在多种不同语境下被激活。