Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.
翻译:预训练语言模型在未经过明确训练的任务上往往表现出惊人的熟练度,但其实现这些能力的机制尚不明确。本文研究了预训练语言模型通常具备的基础数学能力。具体而言,我们运用机械可解释性技术阐释GPT-2小型模型(有限)的数学能力。以模型理解“战争持续自1732年至17__年”这类句子并预测有效两位数年(年份>32)的能力为案例,首先识别出完成该任务的计算子图——即GPT-2小型模型计算图中的一个电路子集。随后,我们解释每个电路组件的作用,揭示该模型末端的多层感知机层会提升预测结束年份大于起始年份的概率。最后,我们发现能激活该电路的相关任务。研究结果表明,GPT-2小型模型通过一种复杂但通用的机制实现“大于”关系判断,该机制可在不同语境中被激活。