Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we first theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. Second, we investigate how the coefficient influences the alignment of forget-sample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. Third, we show that RMU unlearned models are robust against adversarial jailbreak attacks. Last, our empirical analysis shows that RMU is less effective when applied to the middle and later layers in LLMs. To resolve this drawback, we propose Adaptive RMU -- a simple yet effective alternative method that makes unlearning effective with most layers. Extensive experiments demonstrate that Adaptive RMU significantly improves the unlearning performance compared to prior art while incurring no additional computational cost.
翻译:用于遗忘的表征误导(RMU)通过将模型中间层的表征引导至目标随机表征,已成为大型语言模型(LLM)遗忘的有效方法。尽管其性能优异,但其内在机理与理论解释仍未得到充分探索。本文首先从理论上证明:在中间层引导遗忘表征会降低词元置信度,导致LLM生成错误或无意义的响应。其次,我们研究了系数如何影响遗忘样本表征与随机方向的匹配程度,并揭示了不同网络层实现有效遗忘的最优系数取值。第三,我们证明经RMU遗忘的模型对对抗性越狱攻击具有鲁棒性。最后,实证分析表明RMU在应用于LLM中后层时效果有限。为克服此缺陷,我们提出自适应RMU——一种简单而有效的改进方法,可使遗忘在多数网络层中保持有效性。大量实验表明,自适应RMU在未增加计算成本的前提下,较现有技术显著提升了遗忘性能。