Text-to-Image models may generate harmful content, such as pornographic images, particularly when unsafe prompts are submitted. To address this issue, safety filters are often added on top of text-to-image models, or the models themselves are aligned to reduce harmful outputs. However, these defenses remain vulnerable when an attacker strategically designs adversarial prompts to bypass these safety guardrails. In this work, we propose \alg, a method to jailbreak text-to-image models with safety guardrails using a fine-tuned large language model. Unlike other query-based jailbreak attacks that require repeated queries to the target model, our attack generates adversarial prompts efficiently after fine-tuning our AttackLLM. We evaluate our method on three datasets of unsafe prompts and against five safety guardrails. Our results demonstrate that our approach effectively bypasses safety guardrails, outperforms existing no-box attacks, and also facilitates other query-based attacks.
翻译:文本到图像模型可能生成有害内容,例如色情图像,尤其是在提交不安全提示时。为解决此问题,通常在文本到图像模型之上添加安全过滤器,或对模型本身进行对齐以减少有害输出。然而,当攻击者策略性地设计对抗性提示以绕过这些安全防护时,这些防御措施仍然脆弱。在本工作中,我们提出\alg,一种利用微调的大型语言模型来越狱带有安全防护的文本到图像模型的方法。与其他需要重复查询目标模型的基于查询的越狱攻击不同,我们的攻击在微调AttackLLM后能高效生成对抗性提示。我们在三个不安全提示数据集上评估了我们的方法,并针对五种安全防护进行了测试。我们的结果表明,该方法能有效绕过安全防护,优于现有的无盒攻击,并促进其他基于查询的攻击。