A primary criticism towards language models (LMs) is their inscrutability. This paper presents evidence that, despite their size and complexity, LMs sometimes exploit a simple computational mechanism to solve one-to-one relational tasks (e.g., capital_of(Poland)=Warsaw). We investigate a range of language model sizes (from 124M parameters to 176B parameters) in an in-context learning setting, and find that for a variety of tasks (involving capital cities, upper-casing, and past-tensing) a key part of the mechanism reduces to a simple linear update typically applied by the feedforward (FFN) networks. These updates also tend to promote the output of the relation in a content-independent way (e.g., encoding Poland:Warsaw::China:Beijing), revealing a predictable pattern that these models take in solving these tasks. We further show that this mechanism is specific to tasks that require retrieval from pretraining memory, rather than retrieval from local context. Our results contribute to a growing body of work on the mechanistic interpretability of LLMs, and offer reason to be optimistic that, despite the massive and non-linear nature of the models, the strategies they ultimately use to solve tasks can sometimes reduce to familiar and even intuitive algorithms.
翻译:语言模型的主要批评之一是其难以理解性。本文提供证据表明,尽管语言模型规模庞大且结构复杂,但它们在处理一对一关系任务(例如 capital_of(Poland)=Warsaw)时有时会利用简单的计算机制。我们研究了多种规模的语言模型(从1.24亿参数到1760亿参数)在上下文学习场景下的表现,发现对于各类任务(涉及首都城市、大写转换和过去时态),该机制的关键部分简化为通常由前馈网络执行的简单线性更新。这些更新倾向于以内容无关的方式促进关系输出(例如编码 Poland:Warsaw::China:Beijing 的映射关系),揭示了这些模型解决任务时采用的可预测模式。我们进一步证明,该机制仅适用于需要从预训练记忆而非局部上下文中检索的任务。我们的研究结果丰富了关于大语言模型可解释机制的工作,为以下观点提供了依据:尽管模型具有大规模和非线性的特性,但它们最终用于解决问题的策略有时可以简化为熟悉甚至直观的算法。