During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach reduces token usage per target by up to a factor of 46 compared with SotA methods, while improving fluency by 5.48 percent and adversarial robustness by 12.02 percent over the base model. On the Real World Knowledge Unlearning (RWKU) benchmark, PURGE achieves 11 percent unlearning effectiveness while preserving 98 percent of original utility. PURGE shows that framing LLM unlearning as a verifiable task, enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.
翻译:在预训练过程中,大型语言模型会无意间记忆敏感或受版权保护的数据,这在GDPR和欧盟《人工智能法案》等法律框架下带来了重大的合规挑战。为满足这些法规要求,需要开发能够从已部署模型中移除信息而无需从头重新训练的技术。现有的遗忘方法试图解决这一需求,但往往泄露其本应删除的数据,牺牲流畅性与鲁棒性,或依赖成本高昂的外部奖励模型。我们提出了PURGE(通过相对群体擦除的策略遗忘),这是一种基于群体相对策略优化框架的新方法,将遗忘问题构建为可验证的任务。PURGE采用一种内在奖励信号,对任何涉及禁用概念的表述进行惩罚,从而实现安全且一致的遗忘。与现有最优方法相比,我们的方法将每个目标对应的标记使用量降低了最高46倍,同时在基础模型上提升了5.48%的流畅性和12.02%的对抗鲁棒性。在真实世界知识遗忘基准测试中,PURGE实现了11%的遗忘效能,同时保留了原始模型98%的效用。PURGE表明,将大型语言模型遗忘构建为可验证任务,能够实现更可靠、高效和可扩展的遗忘,这为结合理论保证、提升安全性和实际部署效率的遗忘研究指明了新的方向。