When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems

from arxiv, This work proposes a multi-turn jailbreak attack against real-world chat-based T2I generation systems that intergrate memory mechanism. It also constructed a simulation system, with considering three industrial-grade memory mechanisms, 7 kinds of safety filters (both input and output); It is going to appear on USENIX 2026

Modern text-to-image (T2I) generation systems (e.g., DALL$\cdot$E 3) exploit the memory mechanism, which captures key information in multi-turn interactions for faithful generation. Despite its practicality, the security analyses of this mechanism have fallen far behind. In this paper, we reveal that it can exacerbate the risk of jailbreak attacks. Previous attacks fuse the unsafe target prompt into one ultimate adversarial prompt, which can be easily detected or lead to the generation of non-unsafe images due to under- or over-detoxification. In contrast, we propose embedding the malice at the inception of the chat session in memory, addressing the above limitations. Specifically, we propose Inception, the first multi-turn jailbreak attack against real-world text-to-image generation systems that explicitly exploits their memory mechanisms. Inception is composed of two key modules: segmentation and recursion. We introduce Segmentation, a semantic-preserving method that generates multi-round prompts. By leveraging NLP analysis techniques, we design policies to decompose a prompt, together with its malicious intent, according to sentence structure, thereby evading safety filters. Recursion further addresses the challenge posed by unsafe sub-prompts that cannot be separated through simple segmentation. It firstly expands the sub-prompt, then invokes segmentation recursively. To facilitate multi-turn adversarial prompts crafting, we build VisionFlow, an emulation T2I system that integrates two-stage safety filters and industrial-grade memory mechanisms. The experiment results show that Inception successfully allures unsafe image generation, surpassing the SOTA by a 20.0\% margin in attack success rate. We also conduct experiments on the real-world commercial T2I generation platforms, further validating the threats of Inception in practice.

翻译：现代文本到图像（T2I）生成系统（如DALL$\cdot$E 3）利用记忆机制，在多轮交互中捕获关键信息以实现忠实生成。尽管其实用性强，但针对该机制的安全性分析却远远滞后。本文揭示该机制可能加剧越狱攻击的风险。以往的攻击将不安全目标提示融合进一个最终的对抗性提示中，这易被检测或由于去毒不足或过度导致生成非不安全图像。相比之下，我们提出在聊天会话初始阶段将恶意内容嵌入记忆，以解决上述局限。具体而言，我们提出Inception，这是首个针对真实世界文本到图像生成系统的多轮越狱攻击，其明确利用了系统的记忆机制。Inception由两个关键模块组成：分割与递归。我们引入Segmentation，一种生成多轮提示的语义保持方法。通过利用自然语言处理分析技术，我们设计策略根据句子结构分解提示及其恶意意图，从而规避安全过滤器。Recursion进一步解决了无法通过简单分割分离的不安全子提示所带来的挑战。它首先扩展子提示，然后递归调用分割。为促进多轮对抗性提示的构建，我们开发了VisionFlow，一个集成两阶段安全过滤器与工业级记忆机制的仿真T2I系统。实验结果表明，Inception成功诱导了不安全图像的生成，在攻击成功率上超越现有最佳方法20.0%。我们还在真实商业T2I生成平台上进行了实验，进一步验证了Inception在实际中的威胁。