Metaphor-based Jailbreak Attacks on Text-to-Image Models

Text-to-image (T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreak attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we reveal an underexplored vulnerability of T2I models to metaphor-based jailbreak attacks (MJA), which aims to attack diverse defense mechanisms without prior knowledge of their type by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module (LMAG) and an adversarial prompt optimization module (APO). LMAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, LMAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA achieves stronger attack performance while using fewer queries, compared with six baseline methods. Additionally, we provide an in-depth vulnerability analysis suggesting that metaphor-based adversarial prompts evade safety mechanisms by inducing semantic ambiguity, while sensitive images arise from the model's probabilistic interpretation of concealed semantics.

翻译：文本到图像（T2I）模型通常集成了防御机制以防止生成敏感图像。然而，最近的越狱攻击表明，对抗性提示能有效绕过这些机制，诱导T2I模型生成敏感内容，揭示了关键的安全漏洞。但现有攻击方法隐含假设攻击者已知部署的防御类型，这限制了它们在面对未知或多样化防御机制时的有效性。本研究揭示了一种T2I模型对基于隐喻的越狱攻击（MJA）尚未充分探索的脆弱性，该类攻击旨在无需预设防御类型知识的前提下，通过生成基于隐喻的对抗性提示来攻击多样化防御机制。具体而言，MJA包含两个模块：基于LLM的多智能体生成模块（LMAG）和对抗性提示优化模块（APO）。LMAG将基于隐喻的对抗性提示生成分解为三个子任务：隐喻检索、上下文匹配和对抗性提示生成。随后，LMAG协调三个基于LLM的智能体，通过探索不同隐喻和上下文来生成多样化的对抗性提示。为提升攻击效率，APO首先训练一个代理模型来预测对抗性提示的攻击结果，然后设计一种采集策略以自适应地识别最优对抗性提示。在配备多种外部和内部防御机制的T2I模型上进行的广泛实验表明，与六种基线方法相比，MJA在使用更少查询次数的同时实现了更强的攻击性能。此外，我们提供的深入脆弱性分析表明，基于隐喻的对抗性提示通过诱发语义歧义来规避安全机制，而敏感图像则源于模型对隐藏语义的概率性解读。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML2025】隐私防护图像压缩：防御视觉-语言预训练模型的滥用

专知会员服务

5+阅读 · 2025年6月22日

大语言模型越狱攻击：模型、根因及其攻防演化

专知会员服务

22+阅读 · 2025年4月28日

【CVPR2025】先获取后适配：挖掘文本‑图像生成模型在图像复原中的潜力

专知会员服务

11+阅读 · 2025年4月22日

大语言模型越狱攻击: 模型、根因及其攻防演化

专知会员服务

24+阅读 · 2025年2月16日