Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
翻译:多模态大语言模型(MLLMs)具备强大的推理与感知能力,但日益容易受到越狱攻击。现有工作主要关注显式攻击,即恶意内容仅存在于单一模态中;而近期研究揭示了隐式攻击,即良性的文本和图像输入共同表达不安全意图。此类联合模态威胁难以检测且研究尚不充分,主要原因是高质量隐式数据的稀缺。我们提出了ImpForge,一种自动化红队测试流程,它利用强化学习结合定制化的奖励模块,在14个领域中生成多样化的隐式样本。基于此数据集,我们进一步开发了CrossGuard,一种意图感知的安全防护机制,为显式和隐式威胁提供鲁棒且全面的防御。在安全与不安全基准测试、隐式与显式攻击以及多种域外设置下的大量实验表明,CrossGuard显著优于现有防御方法(包括先进的多模态大语言模型和防护机制),在保持高实用性的同时实现了更强的安全性。这为增强多模态大语言模型抵御现实世界多模态威胁的鲁棒性,提供了一个均衡且实用的解决方案。