Despite the substantial advancements in Vision-Language Pre-training (VLP) models, their susceptibility to adversarial attacks poses a significant challenge. Existing work rarely studies the transferability of attacks on VLP models, resulting in a substantial performance gap from white-box attacks. We observe that prior work overlooks the interaction mechanisms between modalities, which plays a crucial role in understanding the intricacies of VLP models. In response, we propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack), leveraging modality interaction through embedding guidance and interaction enhancement. Specifically, attacking text at the embedding level while preserving semantics, as well as utilizing interaction image gradients to enhance constraints on perturbations of texts and images. Significantly, in the image-text retrieval task on Flickr30K dataset, CMI-Attack raises the transfer success rates from ALBEF to TCL, $\text{CLIP}_{\text{ViT}}$ and $\text{CLIP}_{\text{CNN}}$ by 8.11%-16.75% over state-of-the-art methods. Moreover, CMI-Attack also demonstrates superior performance in cross-task generalization scenarios. Our work addresses the underexplored realm of transfer attacks on VLP models, shedding light on the importance of modality interaction for enhanced adversarial robustness.
翻译:尽管视觉语言预训练模型取得了显著进展,但其对对抗攻击的易感性仍构成重大挑战。现有研究很少探讨针对VLP模型的攻击迁移性,导致其与白盒攻击之间存在显著的性能差距。我们观察到先前工作忽视了模态间的交互机制,而该机制对于理解VLP模型的复杂性至关重要。为此,我们提出一种名为协同多模态交互攻击的新型攻击方法,通过嵌入引导和交互增强来利用模态交互。具体而言,该方法在嵌入层面攻击文本同时保持语义完整性,并利用交互图像梯度增强对文本和图像扰动的约束。值得注意的是,在Flickr30K数据集上的图文检索任务中,CMI-Attack将攻击从ALBEF迁移到TCL、$\text{CLIP}_{\text{ViT}}$和$\text{CLIP}_{\text{CNN}}$的成功率较现有最优方法提升了8.11%-16.75%。此外,CMI-Attack在跨任务泛化场景中也展现出优越性能。本研究探索了VLP模型迁移攻击这一尚未充分开发的领域,揭示了模态交互对于增强对抗鲁棒性的重要意义。