Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two close-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8\% to 74\%. In real-world scenarios, various filters such as keyword blocklists, customized prompt filters, and NSFW image filters, are commonly employed to mitigate these risks. We evaluate the effectiveness of such filters against our jailbreak and found that, while current classifiers may be effective for single modality detection, they fail to work against our jailbreak. Our work provides a foundation for further development towards more secure and reliable T2I models.
翻译:扩散模型近期在图像质量和文本提示保真度方面取得了显著进展。与此同时,此类生成模型的安全性日益成为关注焦点。本研究提出一种新型越狱攻击,其诱使文生图模型生成包含视觉文本的图像,其中图像与文本单独检测时均被视为安全,但组合后却构成不安全内容。为系统探究此现象,我们构建了一个数据集,用于评估当前基于扩散的文生图模型在此类越狱攻击下的表现。我们对九个代表性文生图模型进行了基准测试,包括两个闭源商业模型。实验结果表明存在令人担忧的不安全内容生成倾向:所有被测模型均受此类越狱攻击影响,不安全内容生成率介于8\%至74\%之间。在实际应用场景中,通常采用关键词黑名单、定制提示过滤器和NSFW图像过滤器等多种防护机制来降低风险。我们评估了这些防护机制对本研究越狱攻击的有效性,发现现有分类器虽可能对单模态检测有效,却无法抵御我们的越狱攻击。本研究为进一步开发更安全可靠的文生图模型奠定了基础。