Vision-language pre-training (VLP) models demonstrate impressive abilities in processing both images and text. However, they are vulnerable to multi-modal adversarial examples (AEs). Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples for VLP models significantly. However, they do not consider the optimal alignment problem between dataaugmented image-text pairs. This oversight leads to adversarial examples that are overly tailored to the source model, thus limiting improvements in transferability. In our research, we first explore the interplay between image sets produced through data augmentation and their corresponding text sets. We find that augmented image samples can align optimally with certain texts while exhibiting less relevance to others. Motivated by this, we propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method formulates the features of image and text sets as two distinct distributions and employs optimal transport theory to determine the most efficient mapping between them. This optimal mapping informs our generation of adversarial examples to effectively counteract the overfitting issues. Extensive experiments across various network architectures and datasets in image-text matching tasks reveal that our OT-Attack outperforms existing state-of-the-art methods in terms of adversarial transferability.
翻译:视觉语言预训练(VLP)模型在处理图像与文本方面展现出卓越能力,但其易受多模态对抗样本攻击。研究高迁移性对抗样本的生成方法,对于揭示VLP模型在实际场景中的脆弱性至关重要。近期研究表明,通过数据增强与图文模态交互可显著提升VLP模型对抗样本的迁移性。然而,现有方法未考虑数据增强后图文对之间的最优对齐问题,导致生成的对抗样本过度拟合源模型,从而制约了迁移性的提升。本研究首先探索数据增强生成的图像集合与对应文本集合间的相互作用,发现增强后的图像样本能与部分文本实现最优对齐,而与其余文本关联性较低。受此启发,我们提出基于最优传输的对抗攻击方法OT-Attack。该方法将图像集与文本集特征建模为两个独立分布,运用最优传输理论确定两者间最高效映射关系,并基于该最优映射生成对抗样本来有效缓解过拟合问题。在图像-文本匹配任务中,跨多种网络架构与数据集的广泛实验表明,所提OT-Attack方法在对抗迁移性方面显著优于现有最优方法。