Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of adversarial attacks, including targeted and backdoor data poisoning attacks. Despite this vulnerability, robust contrastive vision-language pretraining against adversarial attacks has remained unaddressed. In this work, we propose RoCLIP, the first effective method for robust pretraining {and fine-tuning} multimodal vision-language models. RoCLIP effectively breaks the association between poisoned image-caption pairs by considering a pool of random examples, and (1) matching every image with the text that is most similar to its caption in the pool, and (2) matching every caption with the image that is most similar to its image in the pool. Our extensive experiments show that our method renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training or fine-tuning of CLIP. In particular, RoCLIP decreases the poison and backdoor attack success rates down to 0\% during pre-training and 1\%-4\% during fine-tuning, and effectively improves the model's performance.
翻译:对比性视觉-语言表征学习通过从互联网抓取的数百万图像-文本对进行学习,在零样本分类任务中取得了最先进的性能。然而,支撑CLIP等大型多模态模型的庞大数据集,使其极易遭受多种对抗性攻击,包括定向攻击和后门数据投毒攻击。尽管存在这一脆弱性,针对对抗性攻击的鲁棒对比性视觉-语言预训练方法至今仍未被解决。在本工作中,我们提出RoCLIP——首个针对多模态视觉-语言模型鲁棒预训练(及微调)的有效方法。RoCLIP通过引入随机样本池,有效破坏有毒图像-文本对之间的关联,具体包括:(1) 将每张图像与池中与其标题最相似的文本进行匹配,(2) 将每个标题与池中与其图像最相似的图像进行匹配。大量实验表明,在CLIP预训练或微调过程中,我们的方法能有效瓦解最先进的定向数据投毒攻击和后门攻击。特别是,RoCLIP在预训练阶段将投毒攻击和后门攻击成功率降至0%,在微调阶段降至1%-4%,并显著提升模型性能。