Vision-language pre-training models (VLP) are vulnerable, especially to multimodal adversarial samples, which can be crafted by adding imperceptible perturbations on both original images and texts. However, under the black-box setting, there have been no works to explore the transferability of multimodal adversarial attacks against the VLP models. In this work, we take CLIP as the surrogate model and propose a gradient-based multimodal attack method to generate transferable adversarial samples against the VLP models. By applying the gradient to optimize the adversarial images and adversarial texts simultaneously, our method can better search for and attack the vulnerable images and text information pairs. To improve the transferability of the attack, we utilize contrastive learning including image-text contrastive learning and intra-modal contrastive learning to have a more generalized understanding of the underlying data distribution and mitigate the overfitting of the surrogate model so that the generated multimodal adversarial samples have a higher transferability for VLP models. Extensive experiments validate the effectiveness of the proposed method.
翻译:视觉-语言预训练模型(VLP)存在脆弱性,尤其易受多模态对抗样本攻击,此类样本通过在原始图像和文本上添加难以察觉的扰动生成。然而在黑盒场景下,尚缺乏研究探索针对VLP模型的多模态对抗攻击的可迁移性。本文以CLIP为代理模型,提出基于梯度的多模态攻击方法以生成针对VLP模型的可迁移对抗样本。通过同步优化对抗图像与对抗文本的梯度,本方法能够更有效地搜索并攻击易受攻击的图像-文本信息对。为提升攻击可迁移性,我们引入对比学习机制,包括图像-文本对比学习和模态内对比学习,从而更全面地理解底层数据分布并缓解代理模型的过拟合问题,使生成的多模态对抗样本对VLP模型具备更高可迁移性。大量实验验证了所提方法的有效性。