Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.
翻译:大型语言模型可能编码了需要移除的敏感信息或过时知识,以确保模型响应的负责任与合规性。遗忘学习已发展为完全重新训练的高效替代方案,旨在消除特定知识同时保持模型的整体效用。现有遗忘学习方法的评估主要关注:(1)对目标知识(遗忘集)的遗忘程度;(2)在保留集上的性能维持(即效用保持)。然而,这些评估忽视了一个重要的可用性维度:当被移除的信息在提示中重新出现时,用户可能仍希望模型能够利用这些信息。通过对六种前沿遗忘学习方法的系统性评估,我们发现这些方法均会持续损害此类上下文效用。为解决该问题,我们通过引入可插拔的优化项来增强遗忘学习目标,以保持模型在上下文存在被遗忘知识时利用该知识的能力。大量实验表明,我们的方法能将上下文效用恢复至接近原始水平,同时仍保持有效的遗忘效果与保留集效用。