Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.
翻译:随着大型语言模型(LLMs)的快速发展取得了显著成功,理解并修正其内部复杂机制已成为紧迫议题。近期研究尝试通过内部表征的视角来解释其行为。然而,开发实用高效的方法以应用这些表征进行通用且灵活的模型编辑仍然面临挑战。本工作中,我们探索如何利用表征工程的洞见,通过部署表征传感器作为编辑预言机来指导LLMs的编辑。我们首先明确了编辑过程中鲁棒可靠传感器的重要性,进而提出对抗性表征工程(ARE)框架,为概念模型编辑提供统一且可解释的方法,同时不损害基线性能。在多任务上的实验证明了ARE在各种模型编辑场景中的有效性。我们的代码与数据公开于https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering。