PolyJailbreak: Cross-Modal Jailbreaking Attacks on Black-Box Multimodal LLMs

Multimodal large language models (MLLMs) have become integral to a wide range of real-world applications by jointly reasoning over text and visual inputs. However, despite recent advances in safety alignment, MLLMs remain vulnerable to jailbreak attacks, where carefully crafted inputs can bypass safety mechanisms and elicit harmful responses. In this work, we investigate the security vulnerabilities of MLLMs in text-vision scenarios and propose a novel black-box jailbreak framework, named PolyJailbreak. We first identify a phenomenon, termed multimodal safety asymmetry, where visual alignment introduces uneven safety constraints across modalities and weakens overall robustness. We analyze attention dynamics and latent representations in MLLMs, revealing that visual inputs can disrupt cross-modal information flow and reduce the model's ability to separate benign and malicious intents. Motivated by these findings, we propose PolyJailbreak, which organizes the discovered vulnerabilities into a structured library of reusable Atomic Strategy Primitives to enable step-wise transformations from harmful intents to effective jailbreak inputs. Guided by these primitives, a reinforcement learning-based multi-agent optimization process automatically adapts attacks to the target model without access to internal parameters. Extensive experiments on a wide range of MLLMs demonstrate that PolyJailbreak consistently outperforms state-of-the-art jailbreak baselines, with an average improvement of 18.15% in attack success rate and a success rate exceeding 95% on commercial black-box models, including GPT-4o and Gemini.

翻译：多模态大语言模型（MLLMs）通过联合推理文本和视觉输入，已成为众多现实世界应用的核心组成部分。然而，尽管近期在安全对齐方面取得了进展，MLLMs 仍然容易受到越狱攻击，即精心构造的输入可以绕过安全机制并诱发有害响应。在本工作中，我们研究了 MLLMs 在文本-视觉场景下的安全漏洞，并提出了一种新颖的黑盒越狱框架，命名为 PolyJailbreak。我们首先识别了一种称为“多模态安全不对称性”的现象，即视觉对齐引入了跨模态间不均匀的安全约束，并削弱了模型的整体鲁棒性。我们分析了 MLLMs 中的注意力动态和潜在表示，揭示了视觉输入会破坏跨模态信息流，并降低模型区分良性意图与恶意意图的能力。基于这些发现，我们提出了 PolyJailbreak。该框架将已发现的漏洞组织成一个结构化的、可复用的“原子策略基元”库，从而支持从有害意图到有效越狱输入的逐步转换。在这些基元的指导下，一个基于强化学习的多智能体优化过程能够自动将攻击适配到目标模型，而无需访问其内部参数。在多种 MLLMs 上进行的大量实验表明，PolyJailbreak 始终优于最先进的越狱基线方法，其攻击成功率平均提升了 18.15%，并且在包括 GPT-4o 和 Gemini 在内的商业黑盒模型上成功率超过 95%。