Despite their superb multimodal capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks, which are inference-time attacks that induce the model to output harmful responses with tricky prompts. It is thus essential to defend VLMs against potential jailbreaks for their trustworthy deployment in real-world applications. In this work, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends the black-box target VLM against jailbreak attacks without compromising its performance. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator fine-tuned via reinforcement learning for enhancing cross-modal robustness. We empirically show on three VLMs (LLaVA, MiniGPT-4, and Gemini) and two safety benchmarks (MM-SafetyBench and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.
翻译:尽管视觉语言模型(VLMs)具备卓越的多模态能力,但研究表明它们易受越狱攻击的影响。这类推理时攻击通过精心设计的提示诱导模型输出有害响应。因此,为保障VLMs在实际应用中的可信部署,防御其潜在越狱风险至关重要。本研究聚焦于针对越狱攻击的黑盒VLM防御。现有黑盒防御方法可分为单模态与双模态两类:单模态方法仅增强VLM的视觉或语言模块,而双模态方法通过文本-图像表征重对齐提升模型鲁棒性。然而,这些方法存在两大局限:1)未能充分利用跨模态信息;2)可能损害模型在良性输入上的性能。为克服这些局限,本文提出一种新型蓝队方法BlueSuffix,能在不影响性能的前提下为黑盒目标VLM提供越狱攻击防御。BlueSuffix包含三个核心组件:1)针对越狱图像的视觉净化器;2)针对越狱文本的文本净化器;3)通过强化学习微调的蓝队后缀生成器,用于增强跨模态鲁棒性。我们在三个VLM(LLaVA、MiniGPT-4和Gemini)及两个安全基准(MM-SafetyBench与RedTeam-2K)上的实验表明,BlueSuffix以显著优势超越基线防御方法。本工作为防御VLM越狱攻击开辟了新的研究方向。