Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\% ASR against Qwen3-VL-Plus, scaling to 90\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

翻译：视觉语言模型（VLM）的快速发展推动了人工智能领域前所未有的能力突破，然而这种持续的模态扩展也意外地暴露了急剧扩大且不受约束的对抗攻击面。当前的多模态越狱策略主要聚焦于表层像素扰动、文字排版攻击或有害图像，未能深入挖掘视觉数据固有的复杂语义结构，导致自然图像中巨大的语义攻击面基本未受审视。为揭示这些深层的语义脆弱性，我们提出**MemJack**——一种**记忆**增强的多智能体**越狱**攻**击**框架，该框架显式利用视觉语义编排自动化越狱攻击。MemJack通过协调的多智能体协作，动态将视觉实体映射至恶意意图，借助多角度视觉语义伪装生成对抗性提示，并利用迭代零空间投影（INLP）几何过滤器绕过早期潜在空间的拒绝机制。通过持久化多模态经验记忆积累与迁移成功策略，MemJack能够在不同图像间维持高度连贯的多轮越狱攻击交互，从而提升对新图像的攻击成功率（ASR）。在完整未修改的COCO val2017图像上的大量实验表明，针对Qwen3-VL-Plus模型，MemJack实现了71.48%的攻击成功率，在扩展攻击预算下可达90%。此外，为促进未来防御性对齐研究，我们将发布**MemJack-Bench**——一个包含超过113,000条交互式多模态越狱攻击轨迹的综合数据集，为构建本质鲁棒的VLM奠定关键基础。