The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks can transfer when VLMs' latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models' unique representation spaces - a critical insight for building more robust models.
翻译:对抗鲁棒性领域长期证实,对抗样本能在图像分类器间成功迁移,文本越狱攻击能在语言模型间成功迁移。然而,近期两项研究报道称无法在视觉语言模型间成功迁移图像越狱攻击。为解释这一显著差异,我们提出关于机器学习模型攻击可迁移性的根本区分:输入数据空间的攻击能够迁移,而模型表示空间的攻击则不能——至少在没有表示几何对齐的情况下无法迁移。随后我们在四种不同场景中为该假设提供理论与实证证据。首先,我们在两个网络通过不同表示计算相同输入输出映射的简化场景中,数学证明这一区分。其次,我们针对图像分类器构建与经典数据空间攻击同样有效的表示空间攻击,但其无法迁移。第三,我们针对语言模型构建能成功越狱被攻击模型但同样无法迁移的表示空间攻击。第四,我们构建能成功迁移至新视觉语言模型的数据空间攻击,并证明当视觉语言模型的潜在几何在投影后空间充分对齐时,表示空间攻击可实现迁移。本研究揭示对抗性迁移并非所有攻击的固有属性,而是取决于其操作域——共享数据空间与模型独特表示空间——这一关键见解对构建更鲁棒的模型具有重要意义。