With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.
翻译:随着生成式人工智能的发展,文本到图像(T2I)模型已具备生成多样化内容的能力。然而,T2I模型仍可能生成不安全内容。为缓解此问题,研究者提出了多种概念擦除方法。但现有方法倾向于过度擦除不安全概念,并抑制有害提示词中包含的良性概念,从而对模型效用产生负面影响。本文聚焦于消除不安全内容的同时,通过强化学习优化概念擦除奖励(CER)以保持模型在安全语义解释中的能力。为避免过度擦除,我们引入Safe Adapter,通过投影部分文本嵌入实现对交叉注意力层中概念的高效调控。在多个数据集上的大量实验表明,与现有最先进(SOTA)概念擦除方法相比,所提方法在缓解不安全内容生成的同时,能保持良性图像的高度逼真性。在鲁棒性方面,本方法在对抗红队工具时优于同类方法。此外,我们展示了所提方法在新兴的图像到图像(I2I)场景中相较于其他方法更有效。最后,我们将方法扩展至擦除通用概念,例如艺术风格和物体。免责声明:本文包含可能令部分读者不适的露骨性内容讨论。本工作中使用的所有图像均为合成图像或来自公开数据集。