Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12\% and 17\% of the attacks with naive prompts, respectively, while ChatGPT blocks 84\% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0\% block rate, making it generate copyrighted contents in 76\% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

翻译：近年来，基于大型语言模型的人工智能系统在信息检索、语言生成和图像生成等多种任务上展现出极其强大的性能，甚至超越了人类水平。与此同时，也存在着多种安全风险，可能导致通过规避大型语言模型的对齐机制而生成恶意内容，这通常被称为"越狱"。然而，先前的研究大多仅关注基于文本的大型语言模型越狱，而文本到图像生成系统的越狱问题相对被忽视。本文首先评估了ChatGPT、Copilot和Gemini等商用T2I生成系统在朴素提示下涉及版权侵权的安全性。通过此项实证研究，我们发现Copilot和Gemini仅能分别阻挡12%和17%的朴素提示攻击，而ChatGPT则能阻挡84%的攻击。随后，我们进一步提出一种针对T2I生成系统的更强自动化越狱流程，该流程能生成绕过其安全防护的提示。我们的自动化越狱框架利用LLM优化器生成提示，以最大化生成图像的违规程度，且无需任何权重更新或梯度计算。令人惊讶的是，我们这种简单而有效的方法成功实现了对ChatGPT的越狱，其阻挡率仅为11.0%，使得该系统在76%的情况下生成了受版权保护的内容。最后，我们探索了多种防御策略，例如生成后过滤和机器遗忘技术，但发现这些策略均存在不足，这表明需要建立更强的防御机制。