Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability -- which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability -- can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the lookup-table mechanism for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
翻译:大型语言模型中的知识编辑与遗忘方法旨在编辑或移除不良知识或能力,同时不损害通用语言建模性能。本研究探讨了机制可解释性——其部分目标在于识别与构成模型能力的特定可解释机制相关的模型组件(回路)——如何提升编辑与遗忘的精确度与有效性。我们发现,当训练由不同方法定位的组件时,遗忘与编辑的稳健性存在显著差异。我们强调了两类方法之间的重要区别:一类主要基于保持输出结果来定位组件,另一类则寻找具有可预测中间状态的高层机制。具体而言,将编辑/遗忘操作定位到与事实检索的查找表机制相关的组件上,能够:1)在不同输入/输出格式下实现更稳健的编辑/遗忘;2)抵抗重新学习不良信息的尝试,同时在体育事实数据集和CounterFact数据集上的多个模型中,与基线方法相比减少了意外副作用。我们还发现,某些定位编辑比其他基线方法更能破坏模型中的潜在知识,从而使遗忘对各种攻击具有更强的稳健性。