Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.
翻译:预训练语言模型在未被明确训练的任务上可能表现出惊人的熟练度,但其实现这些能力的机制尚不明确。本文研究了预训练语言模型通常获得的基本数学能力。具体而言,我们运用机制可解释性技术来解释GPT-2 small(有限的)数学能力。作为案例研究,我们考察其处理诸如"战争从1732年持续到17年"这类句子并预测有效两位数结束年份(年份>32)的能力。首先,我们识别出一个计算子图(即GPT-2 small计算图中执行该任务输出的一个小子集)。然后,我们解释每个子图组件的作用,证明GPT-2 small的最终多层感知机通过提升结束年份大于起始年份的概率来运作。最后,我们发现能激活该子图的相关任务。研究结果表明,GPT-2 small通过一种复杂但通用的机制来计算大于关系,该机制在不同语境下被激活。