Vision-language pre-training (VLP) models have shown vulnerability to adversarial examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately transferred to attack other black-box models. However, existing work has mainly focused on investigating white-box attacks. In this paper, we present the first study to investigate the adversarial transferability of recent VLP models. We observe that existing methods exhibit much lower transferability, compared to the strong attack performance in white-box settings. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. Particularly, unlike unimodal learning, VLP models rely heavily on cross-modal interactions and the multimodal alignments are many-to-many, e.g., an image can be described in various natural languages. To this end, we propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance. Experimental results demonstrate that SGA could generate adversarial examples that can strongly transfer across different VLP models on multiple downstream vision-language tasks. On image-text retrieval, SGA significantly enhances the attack success rate for transfer attacks from ALBEF to TCL by a large margin (at least 9.78% and up to 30.21%), compared to the state-of-the-art.
翻译:视觉-语言预训练(VLP)模型在多模态任务中表现出对对抗样本的脆弱性。此外,恶意对抗样本可被有意迁移以攻击其他黑盒模型。然而,现有工作主要集中于研究白盒攻击。本文首次探讨了近期VLP模型的对抗迁移性。我们观察到,现有方法在白盒场景中虽具有强攻击性能,其迁移性却显著较低。这种迁移性退化部分源于对跨模态交互利用不足。特别地,与单模态学习不同,VLP模型高度依赖跨模态交互,且多模态对齐呈现多对多的特性(例如,同一图像可由多种自然语言描述)。为此,我们提出一种高迁移性的集合级引导攻击(SGA),该方法充分挖掘模态交互,并融合了跨模态引导下的对齐保持增强技术。实验结果表明,SGA能够在多个下游视觉-语言任务中生成强迁移性的对抗样本,有效攻击不同VLP模型。在图像-文本检索任务中,与现有最优方法相比,SGA将对抗攻击从ALBEF迁移至TCL的成功率大幅提升至少9.78%至多30.21%。