Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLAttack to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new block-wise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack three widely-used VL pretrained models for six tasks on eight datasets. Experimental results show that the proposed VLAttack framework achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a significant blind spot in the deployment of pre-trained VL models. Codes will be released soon.
翻译:视觉-语言(VL)预训练模型已在众多多模态任务中展现出卓越性能。然而,这类模型的对抗鲁棒性尚未得到充分探索。现有方法主要关注白盒设置下的对抗鲁棒性研究,这在实际场景中并不现实。本文旨在探索一项新颖且实用的任务:利用预训练VL模型生成图像和文本扰动,以攻击不同下游任务中经过微调的黑盒模型。为此,我们提出VLAttack框架,通过融合单模态和多模态层面的图像与文本扰动来生成对抗样本。在单模态层面,我们提出新的分块相似性攻击(BSA)策略学习图像扰动以破坏通用表示。同时,我们采用现有文本攻击策略生成与图像模态攻击无关的文本扰动。在多模态层面,我们设计新型迭代交叉搜索攻击(ICSA)方法,以单模态层输出为起点,周期性更新对抗性图文对。我们在8个数据集上针对6种任务攻击了3个广泛使用的VL预训练模型,进行了大量实验。结果表明,与现有最优基线相比,所提出的VLAttack框架在所有任务上均实现了最高攻击成功率,这揭示了预训练VL模型部署中存在显著的安全盲区。代码将很快开源。