The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image ``jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of ``highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.
翻译:将新模态整合到前沿人工智能系统中带来了令人兴奋的能力,但也增加了此类系统可能以不良方式被对抗性操纵的可能性。在本研究中,我们聚焦于一类流行的视觉-语言模型(VLMs),这类模型能够根据视觉和文本输入生成文本输出。我们开展了一项大规模实证研究,通过使用包含40多个开源参数化VLM的多样化集合(其中包括我们公开发布的18个新VLM),评估基于梯度的通用图像“越狱”攻击的可迁移性。总体而言,我们发现基于梯度的可迁移图像越狱攻击极难实现。当针对单个VLM或针对VLM集合优化图像越狱攻击时,该越狱攻击能成功攻破被攻击的VLM(或集合),但几乎无法迁移到任何其他VLM;这种迁移性不受被攻击VLM与目标VLM是否具有匹配的视觉主干网络或语言模型、语言模型是否经过指令遵循和/或安全对齐训练,或许多其他因素的影响。仅有两种情况显示出部分成功的迁移:在具有相同预训练和初始化但VLM训练数据略有不同的VLM之间,以及在单个VLM的不同训练检查点之间。基于这些结果,我们进一步证明,通过攻击更大规模的“高度相似”VLM集合,可以显著提升针对特定目标VLM的攻击迁移成功率。这些发现与现有关于针对语言模型的通用可迁移文本越狱攻击以及针对图像分类器的可迁移对抗攻击的证据形成鲜明对比,表明VLM可能对基于梯度的迁移攻击具有更强的鲁棒性。