A primary criticism towards language models (LMs) is their inscrutability. This paper presents evidence that, despite their size and complexity, LMs sometimes exploit a computational mechanism familiar from traditional word embeddings: the use of simple vector arithmetic in order to encode abstract relations (e.g., Poland:Warsaw::China:Beijing). We investigate a range of language model sizes (from 124M parameters to 176B parameters) in an in-context learning setting, and find that for a variety of tasks (involving capital cities, upper-casing, and past-tensing), a key part of the mechanism reduces to a simple linear update applied by the feedforward networks. We further show that this mechanism is specific to tasks that require retrieval from pretraining memory, rather than retrieval from local context. Our results contribute to a growing body of work on the mechanistic interpretability of LLMs, and offer reason to be optimistic that, despite the massive and non-linear nature of the models, the strategies they ultimately use to solve tasks can sometimes reduce to familiar and even intuitive algorithms.
翻译:对语言模型(LMs)的主要批评之一是它们难以理解。本文提供的证据表明,尽管语言模型规模庞大、结构复杂,但它们有时会利用传统词嵌入中已知的计算机制:通过简单的向量运算来编码抽象关系(例如,波兰:华沙::中国:北京)。我们在上下文学习设置中研究了一系列不同规模的语言模型(从1.24亿参数到1760亿参数),并发现对于多种任务(涉及首都城市、大写化和过去时态),该机制的关键部分可简化为前馈网络应用的简单线性更新。我们进一步证明,这种机制仅适用于需要从预训练记忆而非局部上下文中检索的任务。我们的研究成果为大型语言模型的可解释性机制研究做出了贡献,并提供了一个乐观的理由:尽管模型具有大规模和非线性的特性,但它们最终用于解决任务所采用的策略有时可以简化为熟悉甚至直观的算法。