Text-to-image (T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreak attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we reveal an underexplored vulnerability of T2I models to metaphor-based jailbreak attacks (MJA), which aims to attack diverse defense mechanisms without prior knowledge of their type by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module (LMAG) and an adversarial prompt optimization module (APO). LMAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, LMAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA achieves stronger attack performance while using fewer queries, compared with six baseline methods. Additionally, we provide an in-depth vulnerability analysis suggesting that metaphor-based adversarial prompts evade safety mechanisms by inducing semantic ambiguity, while sensitive images arise from the model's probabilistic interpretation of concealed semantics.
翻译:文本到图像(T2I)模型通常集成了防御机制以防止生成敏感图像。然而,最近的越狱攻击表明,对抗性提示能有效绕过这些机制,诱导T2I模型生成敏感内容,揭示了关键的安全漏洞。但现有攻击方法隐含假设攻击者已知部署的防御类型,这限制了它们在面对未知或多样化防御机制时的有效性。本研究揭示了一种T2I模型对基于隐喻的越狱攻击(MJA)尚未充分探索的脆弱性,该类攻击旨在无需预设防御类型知识的前提下,通过生成基于隐喻的对抗性提示来攻击多样化防御机制。具体而言,MJA包含两个模块:基于LLM的多智能体生成模块(LMAG)和对抗性提示优化模块(APO)。LMAG将基于隐喻的对抗性提示生成分解为三个子任务:隐喻检索、上下文匹配和对抗性提示生成。随后,LMAG协调三个基于LLM的智能体,通过探索不同隐喻和上下文来生成多样化的对抗性提示。为提升攻击效率,APO首先训练一个代理模型来预测对抗性提示的攻击结果,然后设计一种采集策略以自适应地识别最优对抗性提示。在配备多种外部和内部防御机制的T2I模型上进行的广泛实验表明,与六种基线方法相比,MJA在使用更少查询次数的同时实现了更强的攻击性能。此外,我们提供的深入脆弱性分析表明,基于隐喻的对抗性提示通过诱发语义歧义来规避安全机制,而敏感图像则源于模型对隐藏语义的概率性解读。