Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

Text-to-image generative models offer many innovative services but also raise ethical concerns due to their potential to generate unethical images. Most publicly available text-to-image models employ safety filters to prevent unintended generation intents. In this work, we introduce the Divide-and-Conquer Attack to circumvent the safety filters of state-of-the-art text-to-image models. Our attack leverages LLMs as agents for text transformation, creating adversarial prompts from sensitive ones. We have developed effective helper prompts that enable LLMs to break down sensitive drawing prompts into multiple harmless descriptions, allowing them to bypass safety filters while still generating sensitive images. This means that the latent harmful meaning only becomes apparent when all individual elements are drawn together. Our evaluation demonstrates that our attack successfully circumvents the closed-box safety filter of SOTA DALLE-3 integrated natively into ChatGPT to generate unethical images. This approach, which essentially uses LLM-generated adversarial prompts against GPT-4-assisted DALLE-3, is akin to using one's own spear to breach their shield. It could have more severe security implications than previous manual crafting or iterative model querying methods, and we hope it stimulates more attention towards similar efforts. Our code and data are available at: https://github.com/researchcode001/Divide-and-Conquer-Attack

翻译：文本到图像生成模型提供了许多创新服务，但也因其可能生成不道德图像而引发伦理担忧。大多数公开可用的文本到图像模型都采用安全过滤器来防止意外生成意图。在本工作中，我们提出分治攻击以规避最先进文本到图像模型的安全过滤器。我们的攻击利用大语言模型作为文本转换代理，从敏感提示中生成对抗性提示。我们开发了有效的辅助提示，使大语言模型能够将敏感绘图提示分解为多个无害描述，从而在绕过安全过滤器的同时仍能生成敏感图像。这意味着潜在的恶意含义仅在所有独立元素共同绘制时才显现。我们的评估表明，该攻击成功规避了原生集成于ChatGPT中的SOTA DALLE-3的闭盒安全过滤器，生成了不道德图像。该方法本质上利用大语言模型生成的对抗性提示攻击GPT-4辅助的DALLE-3，类似于“以子之矛攻子之盾”。与先前的手动构建或迭代模型查询方法相比，它可能带来更严重的安全隐患，并期望其能引发对类似工作的更多关注。我们的代码和数据已公开于：https://github.com/researchcode001/Divide-and-Conquer-Attack

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/