Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.
翻译:文本到图像扩散模型(例如 Stable Diffusion (SD))近期在高质量内容生成方面展现出卓越能力,并已成为当前变革性人工智能浪潮的代表之一。然而,这种进展伴随着对该生成技术滥用的日益担忧,特别是用于生成受版权保护或 NSFW(即不适合在工作场所展示)图像。尽管已有工作尝试通过模型微调来过滤不当图像/提示或移除不良概念/风格,但这些安全机制在面对多样化问题提示时的可靠性在很大程度上仍未得到充分探索。在本工作中,我们提出 Prompting4Debugging (P4D) 作为一种调试与红队测试工具,能够自动为扩散模型寻找问题提示,以测试已部署安全机制的可靠性。我们证明了 P4D 工具在揭示配备安全机制的 SD 模型新漏洞方面的有效性。特别地,我们的结果表明,现有安全提示基准中约半数的提示(原本被认为是“安全的”)实际上可被操纵以绕过多种已部署的安全机制,包括概念移除、负面提示和安全引导。我们的发现表明,若缺乏全面测试,基于有限安全提示基准的评估可能导致对文本到图像模型安全性的错误认知。