Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer,Dan Valentine,Luke Bailey,James Chua,Cristóbal Eyzaguirre,Zane Durante,Joe Benton,Brando Miranda,Henry Sleight,John Hughes,Rajashree Agrawal,Mrinank Sharma,Scott Emmons,Sanmi Koyejo,Ethan Perez

from arxiv, NeurIPS 2024 Workshops: RBFM (Best Paper), Frontiers in AdvML (Oral), Red Teaming GenAI (Oral), SoLaR (Spotlight), SATA

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image ``jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of ``highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

翻译：将新模态整合到前沿人工智能系统中带来了令人兴奋的能力，但也增加了此类系统可能以不良方式被对抗性操纵的可能性。在本研究中，我们聚焦于一类流行的视觉-语言模型（VLMs），这类模型能够根据视觉和文本输入生成文本输出。我们开展了一项大规模实证研究，通过使用包含40多个开源参数化VLM的多样化集合（其中包括我们公开发布的18个新VLM），评估基于梯度的通用图像“越狱”攻击的可迁移性。总体而言，我们发现基于梯度的可迁移图像越狱攻击极难实现。当针对单个VLM或针对VLM集合优化图像越狱攻击时，该越狱攻击能成功攻破被攻击的VLM（或集合），但几乎无法迁移到任何其他VLM；这种迁移性不受被攻击VLM与目标VLM是否具有匹配的视觉主干网络或语言模型、语言模型是否经过指令遵循和/或安全对齐训练，或许多其他因素的影响。仅有两种情况显示出部分成功的迁移：在具有相同预训练和初始化但VLM训练数据略有不同的VLM之间，以及在单个VLM的不同训练检查点之间。基于这些结果，我们进一步证明，通过攻击更大规模的“高度相似”VLM集合，可以显著提升针对特定目标VLM的攻击迁移成功率。这些发现与现有关于针对语言模型的通用可迁移文本越狱攻击以及针对图像分类器的可迁移对抗攻击的证据形成鲜明对比，表明VLM可能对基于梯度的迁移攻击具有更强的鲁棒性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日