Despite the substantial advancements in Vision-Language Pre-training (VLP) models, their susceptibility to adversarial attacks poses a significant challenge. Existing work rarely studies the transferability of attacks on VLP models, resulting in a substantial performance gap from white-box attacks. We observe that prior work overlooks the interaction mechanisms between modalities, which plays a crucial role in understanding the intricacies of VLP models. In response, we propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack), leveraging modality interaction through embedding guidance and interaction enhancement. Specifically, attacking text at the embedding level while preserving semantics, as well as utilizing interaction image gradients to enhance constraints on perturbations of texts and images. Significantly, in the image-text retrieval task on Flickr30K dataset, CMI-Attack raises the transfer success rates from ALBEF to TCL, $\text{CLIP}_{\text{ViT}}$ and $\text{CLIP}_{\text{CNN}}$ by 8.11%-16.75% over state-of-the-art methods. Moreover, CMI-Attack also demonstrates superior performance in cross-task generalization scenarios. Our work addresses the underexplored realm of transfer attacks on VLP models, shedding light on the importance of modality interaction for enhanced adversarial robustness.
翻译:尽管视觉-语言预训练(VLP)模型取得了显著进展,但其易受对抗攻击的脆弱性构成了重大挑战。现有研究较少探讨针对VLP模型攻击的迁移性,导致其与白盒攻击的性能差距较大。我们观察到先前工作忽视了模态间的交互机制,而该机制对理解VLP模型的复杂性至关重要。为此,我们提出一种新型攻击方法——协同多模态交互攻击(CMI-Attack),通过嵌入引导与交互增强来利用模态交互。具体而言,在保持语义的同时攻击文本的嵌入层,并利用交互图像梯度增强对文本和图像扰动的约束。值得注意的是,在Flickr30K数据集上的图像-文本检索任务中,CMI-Attack将ALBEF到TCL、$\text{CLIP}_{\text{ViT}}$和$\text{CLIP}_{\text{CNN}}$的迁移成功率较现有最优方法提升8.11%-16.75%。此外,CMI-Attack在跨任务泛化场景中同样展现出优越性能。我们的工作探索了VLP模型迁移攻击这一尚未充分研究的领域,揭示了模态交互对增强对抗鲁棒性的关键作用。