Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new block-wise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models. Source codes can be found at https://github.com/ericyinyzy/VLAttack.
翻译:视觉-语言(VL)预训练模型已在众多多模态任务中展现出优越性。然而,此类模型的对抗鲁棒性尚未得到充分探索。现有方法主要关注白盒设定下的对抗鲁棒性研究,这在实际场景中并不现实。本文旨在探索一种新颖且实用的任务:利用预训练VL模型生成图像和文本扰动,以攻击不同下游任务中的黑盒微调模型。为此,我们提出VLATTACK方法,通过融合单模态和多模态层面的图像与文本扰动来生成对抗样本。在单模态层面,我们提出一种新的块状相似度攻击(BSA)策略,通过学习图像扰动来破坏通用表征。同时,我们采用现有文本攻击策略生成独立于图像模态攻击的文本扰动。在多模态层面,我们设计了一种新颖的迭代交叉搜索攻击(ICSA)方法,以单模态层面输出为起点,周期性地更新对抗图像-文本对。我们针对六种任务对五种广泛使用的VL预训练模型进行了大量实验。实验结果表明,与现有最优基线方法相比,VLATTACK在所有任务上均实现了最高的攻击成功率,揭示了部署预训练VL模型中的盲点。源代码可访问https://github.com/ericyinyzy/VLAttack获取。