Mitigating the retention of sensitive or private information in large language models is essential for enhancing privacy and safety. Existing unlearning methods, like Gradient Ascent and Negative Preference Optimization, directly tune models to remove unwanted information. However, these methods often become unstable because they fine-tune by maximizing cross-entropy loss, which is the opposite of traditional loss minimization in learning. This reversal creates instability, especially on larger datasets, as the model struggles to balance unlearning with maintaining language capacity, leading to over-unlearning. In this paper, we introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens. This technique ensures smooth convergence and avoids catastrophic forgetting, even in challenging unlearning tasks with large datasets and sequential unlearning requests. Extensive experiments show that UnDIAL can achieve both robustness in unlearning and scalability while maintaining stable training dynamics and resilience to hyperparameter tuning.
翻译:减轻大语言模型对敏感或隐私信息的记忆对于增强隐私与安全性至关重要。现有的遗忘方法,如梯度上升与负向偏好优化,直接通过调整模型来移除不期望的信息。然而,这些方法通常不稳定,因为它们通过最大化交叉熵损失进行微调,这与传统学习中的损失最小化目标相反。这种逆转导致了不稳定性,尤其是在较大数据集上,模型难以在遗忘与保持语言能力之间取得平衡,从而引发过度遗忘。本文提出UnDIAL(基于调整对数自蒸馏的遗忘方法),一种新颖且鲁棒的遗忘方法。我们的方法利用自蒸馏来调整对数,并选择性地降低目标词元的影响。该技术确保了平滑收敛并避免了灾难性遗忘,即使在具有大规模数据集和序列遗忘请求的挑战性遗忘任务中也是如此。大量实验表明,UnDIAL能够在保持稳定训练动态和对超参数调整鲁棒性的同时,实现遗忘的鲁棒性与可扩展性。