Multimodal large language models (MLLMs) have become integral to a wide range of real-world applications by jointly reasoning over text and visual inputs. However, despite recent advances in safety alignment, MLLMs remain vulnerable to jailbreak attacks, where carefully crafted inputs can bypass safety mechanisms and elicit harmful responses. In this work, we investigate the security vulnerabilities of MLLMs in text-vision scenarios and propose a novel black-box jailbreak framework, named PolyJailbreak. We first identify a phenomenon, termed multimodal safety asymmetry, where visual alignment introduces uneven safety constraints across modalities and weakens overall robustness. We analyze attention dynamics and latent representations in MLLMs, revealing that visual inputs can disrupt cross-modal information flow and reduce the model's ability to separate benign and malicious intents. Motivated by these findings, we propose PolyJailbreak, which organizes the discovered vulnerabilities into a structured library of reusable Atomic Strategy Primitives to enable step-wise transformations from harmful intents to effective jailbreak inputs. Guided by these primitives, a reinforcement learning-based multi-agent optimization process automatically adapts attacks to the target model without access to internal parameters. Extensive experiments on a wide range of MLLMs demonstrate that PolyJailbreak consistently outperforms state-of-the-art jailbreak baselines, with an average improvement of 18.15% in attack success rate and a success rate exceeding 95% on commercial black-box models, including GPT-4o and Gemini.
翻译:暂无翻译