When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Rylan Schaeffer,Dan Valentine,Luke Bailey,James Chua,Cristóbal Eyzaguirre,Zane Durante,Joe Benton,Brando Miranda,Henry Sleight,John Hughes,Rajashree Agrawal,Mrinank Sharma,Scott Emmons,Sanmi Koyejo,Ethan Perez

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

翻译：前沿人工智能系统整合新模态带来了令人兴奋的能力，但也增加了这些系统可能以不良方式遭受对抗性操纵的风险。在本研究中，我们聚焦于一类流行的视觉-语言模型（VLMs），该类模型根据视觉和文本输入生成文本输出。我们开展了大规模实证研究，使用包含40余个开源参数化VLM的多样化集合（包括我们公开发布的18个新VLM），评估基于梯度的通用图像“越狱”攻击的可迁移性。总体而言，我们发现可迁移的基于梯度的图像越狱攻击极难实现。当针对单个VLM或VLM集成优化图像越狱时，该越狱能成功攻破受攻击的VLM，但几乎无法迁移至任何其他VLM；这种迁移性不受以下因素影响：受攻击VLM与目标VLM是否具有匹配的视觉主干网络或语言模型、语言模型是否经过指令遵循和/或安全对齐训练，以及其他诸多因素。仅存在两种部分成功迁移的场景：在具有相同预训练初始化但VLM训练数据略有差异的模型之间，以及单个VLM的不同训练检查点之间。基于这些发现，我们进一步证明：通过攻击更大规模的“高度相似”VLM集成，可显著提升针对特定目标VLM的攻击迁移成功率。这些结果与现有关于语言模型的通用可迁移文本越狱攻击以及图像分类器的可迁移对抗攻击的证据形成鲜明对比，表明VLM可能对基于梯度的迁移攻击具有更强的鲁棒性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日