Multimodal large language models (MLLMs) have revolutionized vision-language understanding but are vulnerable to multimodal jailbreak attacks, where adversaries meticulously craft inputs to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard is trained such that the likelihood of generating harmful responses in a toxic corpus is minimized, and can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities and attack strategies. It demonstrates impressive generalizability across multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4, MiniGPT-4, and InstructBLIP, thereby broadening the scope of our solution.
翻译:多模态大语言模型(MLLMs)在视觉-语言理解领域带来了革命性变革,但其易受多模态越狱攻击的影响,攻击者通过精心构造输入以诱导模型产生有害或不恰当的回应。本文提出UniGuard,一种新颖的多模态安全护栏,它联合考虑了单模态与跨模态的有害信号。UniGuard的训练目标是最小化在有害语料中生成有害回应的可能性,并且能够在推理过程中以极低的计算成本无缝应用于任何输入提示。大量实验证明了UniGuard在多种模态和攻击策略上的泛化能力。该模型在包括LLaVA、Gemini Pro、GPT-4、MiniGPT-4和InstructBLIP在内的多个前沿MLLMs上均展现出卓越的泛化性能,从而拓宽了我们解决方案的适用范围。