Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model's native generation capability. Our code will be available on GitHub.
翻译:文本到图像扩散模型由于未经筛选的大规模训练数据(如暴力、色情和令人震惊的图像)已被证明存在不安全生成问题,因此需要消除不安全概念。大多数现有方法侧重于修改包含不安全描述的文本条件下的生成概率。然而,它们无法保证训练阶段未见文本的安全生成,特别是针对对抗性攻击的提示。在本文中,我们重新分析了消除任务,并指出现有方法无法保证不安全生成总概率的最小化。为解决此问题,我们提出了暗矿工。它包含一个由挖掘、验证和规避三个阶段组成的循环过程。该方法贪婪地挖掘具有不安全概念最大生成概率的嵌入,从而更有效地减少不安全生成。在实验中,我们评估了其在两个不当概念、两个对象和两种风格上的性能。与6种先前最先进的方法相比,我们的方法在大多数情况下实现了更好的消除和防御效果,特别是在4种最先进的攻击下,同时保留了模型的固有生成能力。我们的代码将在GitHub上提供。