Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: https://github.com/OPTML-Group/AdvUnlearn

翻译：扩散模型在文本到图像生成领域取得了显著成功，但也带来了安全风险，例如可能生成有害内容和侵犯版权。为应对这些风险，机器遗忘（亦称概念擦除）技术应运而生。然而，现有技术仍易受对抗性提示攻击的影响，攻击者可能诱导经过遗忘处理的扩散模型重新生成包含本应被擦除概念（如裸露内容）的不良图像。本研究旨在通过将对抗训练原理融入机器遗忘过程，提升概念擦除的鲁棒性，由此构建出名为AdvUnlearn的鲁棒遗忘框架。然而，实现高效且有效的融合极具挑战性。首先，我们发现直接应用对抗训练会损害遗忘后扩散模型的图像生成质量。为此，我们在额外保留集上设计了效用保持正则化方法，以优化AdvUnlearn中概念擦除鲁棒性与模型效用之间的平衡。此外，我们发现相较于UNet模块，文本编码器更适合进行鲁棒化处理，这既能保证遗忘效果，又使获得的文本编码器可作为即插即用的鲁棒遗忘模块适配于多种扩散模型类型。实验方面，我们通过大量实证研究证明了AdvUnlearn在多种扩散模型遗忘场景（包括裸露内容、特定物体及风格概念的擦除）中均具有鲁棒性优势。除鲁棒性外，AdvUnlearn还实现了与模型效用的平衡权衡。据我们所知，这是首个通过对抗训练系统探索扩散模型鲁棒遗忘的研究，与现有忽视概念擦除鲁棒性的方法形成显著区别。代码已开源：https://github.com/OPTML-Group/AdvUnlearn