Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats. Our code is released: https://github.com/ZhangXu0963/CrossGuard.
翻译:[translated abstract in Chinese]
多模态大语言模型(MLLMs)展现出强大的推理与感知能力,但日益易受越狱攻击。现有工作主要针对显性攻击(即恶意内容存在于单一模态),而近期研究揭示了隐性攻击:良性文本与图像输入共同表达不安全意图。此类联合模态威胁难以检测且尚未充分探索,主要受限于高质量隐性数据的稀缺性。我们提出ImpForge——一种自动化红队测试框架,利用强化学习结合定制化奖励模块,在14个领域生成多样化的隐性样本。基于该数据集,我们进一步开发CrossGuard——一种面向意图的防护机制,针对显性与隐性威胁提供鲁棒且全面的防御。在安全与不安全基准、隐性与显性攻击以及多种域外场景下的大量实验表明,CrossGuard显著优于现有防御手段(包括先进MLLMs及护栏系统),在保持高实用性的同时实现更强的安全性。这为增强MLLMs应对现实世界多模态威胁的鲁棒性提供了平衡且实用的解决方案。我们的代码已开源:https://github.com/ZhangXu0963/CrossGuard。