BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Despite their superb multimodal capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks, which are inference-time attacks that induce the model to output harmful responses with tricky prompts. It is thus essential to defend VLMs against potential jailbreaks for their trustworthy deployment in real-world applications. In this work, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends the black-box target VLM against jailbreak attacks without compromising its performance. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator fine-tuned via reinforcement learning for enhancing cross-modal robustness. We empirically show on three VLMs (LLaVA, MiniGPT-4, and Gemini) and two safety benchmarks (MM-SafetyBench and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

翻译：尽管视觉语言模型（VLMs）具备卓越的多模态能力，但研究表明它们易受越狱攻击的影响。这类推理时攻击通过精心设计的提示诱导模型输出有害响应。因此，为保障VLMs在实际应用中的可信部署，防御其潜在越狱风险至关重要。本研究聚焦于针对越狱攻击的黑盒VLM防御。现有黑盒防御方法可分为单模态与双模态两类：单模态方法仅增强VLM的视觉或语言模块，而双模态方法通过文本-图像表征重对齐提升模型鲁棒性。然而，这些方法存在两大局限：1）未能充分利用跨模态信息；2）可能损害模型在良性输入上的性能。为克服这些局限，本文提出一种新型蓝队方法BlueSuffix，能在不影响性能的前提下为黑盒目标VLM提供越狱攻击防御。BlueSuffix包含三个核心组件：1）针对越狱图像的视觉净化器；2）针对越狱文本的文本净化器；3）通过强化学习微调的蓝队后缀生成器，用于增强跨模态鲁棒性。我们在三个VLM（LLaVA、MiniGPT-4和Gemini）及两个安全基准（MM-SafetyBench与RedTeam-2K）上的实验表明，BlueSuffix以显著优势超越基线防御方法。本工作为防御VLM越狱攻击开辟了新的研究方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日