Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
翻译:大型语言模型经过微调后能够拒绝回答涉及危险知识的问题,但这些防护措施往往能被绕过。遗忘方法旨在从模型中彻底消除危险能力,使其无法被攻击者利用。本研究从对抗视角出发,对遗忘与传统安全后训练之间的根本差异提出质疑。我们证明,先前报道中对遗忘无效的现有越狱方法,在精心实施时仍可成功。此外,我们开发了多种自适应方法,能够恢复大部分据称已被遗忘的能力。例如,我们证明:对10个无关样本进行微调,或移除激活空间中的特定方向,即可为经RMU(一种先进遗忘方法)编辑的模型恢复大部分危险能力。我们的研究结果对当前遗忘方法的鲁棒性提出挑战,并质疑其相对于安全训练的优势。