The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.
翻译:前沿人工智能系统整合新模态带来了令人兴奋的能力,但也增加了这些系统可能以不良方式遭受对抗性操纵的风险。在本研究中,我们聚焦于一类流行的视觉-语言模型(VLMs),该类模型根据视觉和文本输入生成文本输出。我们开展了大规模实证研究,使用包含40余个开源参数化VLM的多样化集合(包括我们公开发布的18个新VLM),评估基于梯度的通用图像“越狱”攻击的可迁移性。总体而言,我们发现可迁移的基于梯度的图像越狱攻击极难实现。当针对单个VLM或VLM集成优化图像越狱时,该越狱能成功攻破受攻击的VLM,但几乎无法迁移至任何其他VLM;这种迁移性不受以下因素影响:受攻击VLM与目标VLM是否具有匹配的视觉主干网络或语言模型、语言模型是否经过指令遵循和/或安全对齐训练,以及其他诸多因素。仅存在两种部分成功迁移的场景:在具有相同预训练初始化但VLM训练数据略有差异的模型之间,以及单个VLM的不同训练检查点之间。基于这些发现,我们进一步证明:通过攻击更大规模的“高度相似”VLM集成,可显著提升针对特定目标VLM的攻击迁移成功率。这些结果与现有关于语言模型的通用可迁移文本越狱攻击以及图像分类器的可迁移对抗攻击的证据形成鲜明对比,表明VLM可能对基于梯度的迁移攻击具有更强的鲁棒性。