Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

翻译：近年来，基于大型语言模型（LLMs）的人工智能系统在信息检索、语言生成和图像生成等多种任务上展现出极其强大的性能，甚至超越了人类水平。与此同时，也存在着多种安全风险，可能通过绕过LLMs的对齐机制导致恶意内容的生成，这通常被称为“越狱”。然而，先前的研究大多仅关注LLMs中基于文本的越狱，而针对文本到图像（T2I）生成系统的越狱问题则相对被忽视。在本文中，我们首先评估了商用T2I生成系统（如ChatGPT、Copilot和Gemini）在使用简单提示词时关于版权侵权的安全性。通过这项实证研究，我们发现，对于简单提示词攻击，Copilot和Gemini分别仅能阻止12%和17%，而ChatGPT能阻止84%的攻击。接着，我们进一步提出了一种针对T2I生成系统的、更强的自动越狱流程，该流程能生成绕过其安全防护的提示词。我们的自动越狱框架利用一个LLM优化器来生成提示词，以最大化生成图像的违规程度，且无需任何权重更新或梯度计算。令人惊讶的是，我们这种简单而有效的方法成功越狱了ChatGPT，其阻止率降至11.0%，使其在76%的情况下生成了受版权保护的内容。最后，我们探索了各种防御策略，例如生成后过滤和机器遗忘技术，但发现它们均不充分，这表明需要更强的防御机制。