Text-to-image generative models offer many innovative services but also raise ethical concerns due to their potential to generate unethical images. Most publicly available text-to-image models employ safety filters to prevent unintended generation intents. In this work, we introduce the Divide-and-Conquer Attack to circumvent the safety filters of state-of-the-art text-to-image models. Our attack leverages LLMs as agents for text transformation, creating adversarial prompts from sensitive ones. We have developed effective helper prompts that enable LLMs to break down sensitive drawing prompts into multiple harmless descriptions, allowing them to bypass safety filters while still generating sensitive images. This means that the latent harmful meaning only becomes apparent when all individual elements are drawn together. Our evaluation demonstrates that our attack successfully circumvents the closed-box safety filter of SOTA DALLE-3 integrated natively into ChatGPT to generate unethical images. This approach, which essentially uses LLM-generated adversarial prompts against GPT-4-assisted DALLE-3, is akin to using one's own spear to breach their shield. It could have more severe security implications than previous manual crafting or iterative model querying methods, and we hope it stimulates more attention towards similar efforts. Our code and data are available at: https://github.com/researchcode001/Divide-and-Conquer-Attack
翻译:文本到图像生成模型提供了许多创新服务,但也因其可能生成不道德图像而引发伦理担忧。大多数公开可用的文本到图像模型都采用安全过滤器来防止意外生成意图。在本工作中,我们提出分治攻击以规避最先进文本到图像模型的安全过滤器。我们的攻击利用大语言模型作为文本转换代理,从敏感提示中生成对抗性提示。我们开发了有效的辅助提示,使大语言模型能够将敏感绘图提示分解为多个无害描述,从而在绕过安全过滤器的同时仍能生成敏感图像。这意味着潜在的恶意含义仅在所有独立元素共同绘制时才显现。我们的评估表明,该攻击成功规避了原生集成于ChatGPT中的SOTA DALLE-3的闭盒安全过滤器,生成了不道德图像。该方法本质上利用大语言模型生成的对抗性提示攻击GPT-4辅助的DALLE-3,类似于“以子之矛攻子之盾”。与先前的手动构建或迭代模型查询方法相比,它可能带来更严重的安全隐患,并期望其能引发对类似工作的更多关注。我们的代码和数据已公开于:https://github.com/researchcode001/Divide-and-Conquer-Attack