We consider representation misdirection (RM), a class of LLM unlearning methods that achieves forgetting by manipulating the forget-representations, that is, latent representations of forget samples. Despite being important, the roles of target vectors used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the linear representation hypothesis. Specifically, if one can somehow identify a one-dimensional representation corresponding to a high-level concept, the linear representation hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning elicits controllable side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models' truth, sentiment, and refusal) and capability enhancement (e.g., improving unlearned models' in-context learning capability). Our findings reveal that this fairly attractive phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing models that require stronger capabilities and controllable behaviors.
翻译:我们研究表示误导(RM),这是一类通过操纵遗忘表示(即遗忘样本的潜在表示)来实现遗忘的大语言模型(LLM)遗忘方法。尽管重要,但RM中使用的目标向量的作用仍未得到充分探索。在此,我们通过线性表示假设的视角来探讨并重新审视RM。具体而言,如果能够识别出对应某个高层概念的一维表示,线性表示假设便允许在该概念向量上于遗忘表示空间内进行线性操作。基于此观点,我们假设,除了遗忘之外,机器遗忘还会引发与该高层概念相对应的可控副作用行为和更强的副作用能力。我们的假设在一系列广泛的任务中得到了实证验证,包括行为控制(例如,控制遗忘模型的真实性、情感倾向和拒绝行为)和能力增强(例如,提升遗忘模型的上下文学习能力)。我们的研究结果表明,这一颇具吸引力的现象,若被误用可能成为一种潜在风险,但也可以作为一种机制,用于开发需要更强能力和可控行为的模型。