Machine unlearning aims to efficiently eliminate the influence of specific training data, known as the forget set, from the model. However, existing unlearning methods for Large Language Models (LLMs) face a critical challenge: they rely solely on negative feedback to suppress responses related to the forget set, which often results in nonsensical or inconsistent outputs, diminishing model utility and posing potential privacy risks. To address this limitation, we propose a novel approach called Alternate Preference Optimization (AltPO), which combines negative feedback with in-domain positive feedback on the forget set. Additionally, we introduce new evaluation metrics to assess the quality of responses related to the forget set. Extensive experiments show that our approach not only enables effective unlearning but also avoids undesirable model behaviors while maintaining overall model performance. Our implementation can be found at https://github.com/molereddy/Alternate-Preference-Optimization.
翻译:机器遗忘旨在高效消除特定训练数据(即遗忘集)对模型的影响。然而,现有的大语言模型(LLMs)遗忘方法面临一个关键挑战:它们仅依赖负反馈来抑制与遗忘集相关的响应,这通常会导致无意义或不一致的输出,降低模型效用并带来潜在的隐私风险。为解决这一局限,我们提出了一种称为交替偏好优化(AltPO)的新方法,该方法将负反馈与遗忘集上的领域内正反馈相结合。此外,我们引入了新的评估指标来评估与遗忘集相关的响应质量。大量实验表明,我们的方法不仅能实现有效遗忘,还能避免不良模型行为,同时保持模型的整体性能。我们的实现可在 https://github.com/molereddy/Alternate-Preference-Optimization 找到。