Multimodal Pragmatic Jailbreak on Text-to-image Models

Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two close-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8\% to 74\%. In real-world scenarios, various filters such as keyword blocklists, customized prompt filters, and NSFW image filters, are commonly employed to mitigate these risks. We evaluate the effectiveness of such filters against our jailbreak and found that, while current classifiers may be effective for single modality detection, they fail to work against our jailbreak. Our work provides a foundation for further development towards more secure and reliable T2I models.

翻译：扩散模型近期在图像质量和文本提示保真度方面取得了显著进展。与此同时，此类生成模型的安全性日益成为关注焦点。本研究提出一种新型越狱攻击，其诱使文生图模型生成包含视觉文本的图像，其中图像与文本单独检测时均被视为安全，但组合后却构成不安全内容。为系统探究此现象，我们构建了一个数据集，用于评估当前基于扩散的文生图模型在此类越狱攻击下的表现。我们对九个代表性文生图模型进行了基准测试，包括两个闭源商业模型。实验结果表明存在令人担忧的不安全内容生成倾向：所有被测模型均受此类越狱攻击影响，不安全内容生成率介于8\%至74\%之间。在实际应用场景中，通常采用关键词黑名单、定制提示过滤器和NSFW图像过滤器等多种防护机制来降低风险。我们评估了这些防护机制对本研究越狱攻击的有效性，发现现有分类器虽可能对单模态检测有效，却无法抵御我们的越狱攻击。本研究为进一步开发更安全可靠的文生图模型奠定了基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日