Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLAttack to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new block-wise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack three widely-used VL pretrained models for six tasks on eight datasets. Experimental results show that the proposed VLAttack framework achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a significant blind spot in the deployment of pre-trained VL models. Codes will be released soon.
翻译:视觉-语言(VL)预训练模型在多项多模态任务中展现了其优越性。然而,这类模型的对抗鲁棒性尚未得到充分探索。现有方法主要关注白盒设置下的对抗鲁棒性,这在实际中并不现实。本文旨在探索一项新颖且实际的任务:利用预训练VL模型生成图像和文本扰动,以攻击不同下游任务中的黑盒微调模型。为此,我们提出VLAttack框架,通过融合单模态与多模态层面的图像和文本扰动来生成对抗样本。在单模态层面,我们提出一种新的块级相似性攻击(BSA)策略,学习图像扰动以破坏通用表征;同时,采用现有文本攻击策略生成与图像模态攻击无关的文本扰动。在多模态层面,我们设计了一种新颖的迭代交叉搜索攻击(ICSA)方法,以单模态层输出为起点,周期性地更新对抗性图像-文本对。我们针对八个数据集上的六项任务,对三种广泛使用的VL预训练模型进行了大量攻击实验。结果表明,与最先进的基线方法相比,所提出的VLAttack框架在所有任务上均实现了最高攻击成功率,揭示了预训练VL模型部署中一个显著的盲区。代码将很快开源。