Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.
翻译:文本到图像扩散模型,如稳定扩散(Stable Diffusion, SD),近期在高品质内容生成方面展现出卓越能力,并成为近期变革性AI浪潮的代表之一。然而,这种进步伴随着对生成技术误用的日益关注,特别是用于生成受版权保护或NSFW(即不适合工作场合)的图像。尽管已通过模型微调等方法努力过滤不适当图像/提示或移除不良概念/风格,但这些安全机制针对多样化有问题提示的可靠性在很大程度上仍未被探索。在这项工作中,我们提出提示式调试(Prompting4Debugging, P4D)作为一种调试和红队测试工具,可自动为扩散模型发现有问题提示,以测试部署的安全机制的可靠性。我们展示了P4D工具在揭示带有安全机制的SD模型新漏洞方面的有效性。特别是,我们的结果显示,现有安全提示基准中约一半的提示原本被认为“安全”,但实际上可以被操纵以绕过许多已部署的安全机制,包括概念移除、负面提示和安全引导。我们的发现表明,若缺乏全面测试,对有限安全提示基准的评估可能导致对文本到图像模型产生虚假的安全感。