Since the development of Large Language Models (LLMs) has achieved remarkable success, understanding and controlling their internal complex mechanisms has become an urgent problem. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to use representation engineering methods to guide the editing of LLMs by deploying a representation sensor as an oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple model editing paradigms demonstrate the effectiveness of ARE in various settings. Code and data are available at https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.
翻译:随着大语言模型(LLMs)的发展取得显著成功,理解并控制其内部复杂机制已成为一个紧迫的问题。近期研究尝试通过内部表征的视角来解释其行为。然而,开发实用且高效的方法,以将这些表征应用于通用且灵活的模型编辑,仍然具有挑战性。在本工作中,我们探索如何利用表征工程方法,通过部署一个表征传感器作为预言机,来指导LLMs的编辑。我们首先识别了在编辑过程中一个鲁棒且可靠的传感器的重要性,随后提出了一种对抗性表征工程(ARE)框架,为概念模型编辑提供了一种统一且可解释的方法,且不损害基线性能。在多种模型编辑范式上的实验证明了ARE在不同设置下的有效性。代码与数据可在 https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering 获取。