Language models learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights. In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific model parameters would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit. Next, we consider several variants of the editing problem, including erasing and amplifying facts. For one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior. Our code is available at https://github.com/google/belief-localization
翻译:语言模型在预训练过程中学习了大量事实信息,近期研究将这些信息定位到特定的模型权重(如中间层MLP权重)。本文发现,通过编辑位于现有方法建议存储事实位置之外的权重,可以改变模型存储事实的方式。这一发现令人惊讶,因为通常认为将事实定位到特定模型参数能揭示在模型中操控知识的位置,该假设曾推动模型编辑方法的研究。具体而言,我们证明来自表征去噪(亦称因果追踪)的局部化结论,并未为选择最佳MLP层编辑以用新事实覆盖已有存储事实提供任何见解。这一发现质疑了以往研究依赖因果追踪选择编辑层的方法。接下来,我们考察了编辑问题的多种变体,包括事实擦除与增强。对于其中一个编辑问题,编辑性能确实与表征去噪的局部化结果相关,但我们发现编辑层选择是更好的性能预测指标。我们的结果反直觉地表明:对预训练语言模型运作机制更深入的理解,并不总能转化为改变其行为的最佳方法。代码开源地址:https://github.com/google/belief-localization