Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.
翻译:预训练语言模型往往能在未明确训练的任务上展现出惊人的能力,但其实现这些能力的机制尚不明确。本文探究了预训练语言模型常具备的基础数学能力,具体而言,我们运用机械可解释性技术来阐释GPT-2 small(有限)的数学能力。以"战争从1732年持续到17__年"这类句子为例,研究其预测有效两位数年(即大于32的年份)的能力。我们首先识别出一个电路,即GPT-2 small计算图中用于完成该任务的一小部分子图。随后,我们解释了每个电路组件的作用,表明GPT-2 small最终的多层感知机层会提升大于起始年份的结束年份的概率。最后,我们发现了能够激活该电路的相关任务。实验结果表明,GPT-2 small通过一种复杂但通用的机制实现"大于"运算,该机制能在多种不同语境中被激活。