Multimodal contrastive pretraining has been used to train multimodal representation models, such as CLIP, on large amounts of paired image-text data. However, previous studies have revealed that such models are vulnerable to backdoor attacks. Specifically, when trained on backdoored examples, CLIP learns spurious correlations between the embedded backdoor trigger and the target label, aligning their representations in the joint embedding space. Injecting even a small number of poisoned examples, such as 75 examples in 3 million pretraining data, can significantly manipulate the model's behavior, making it difficult to detect or unlearn such correlations. To address this issue, we propose CleanCLIP, a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks by independently re-aligning the representations for individual modalities. We demonstrate that unsupervised finetuning using a combination of multimodal contrastive and unimodal self-supervised objectives for individual modalities can significantly reduce the impact of the backdoor attack. Additionally, we show that supervised finetuning on task-specific labeled image data removes the backdoor trigger from the CLIP vision encoder. We show empirically that CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning. The code and checkpoints are available at https://github.com/nishadsinghi/CleanCLIP.
翻译:多模态对比预训练已被用于在大量配对的图像-文本数据上训练多模态表示模型,如CLIP。然而,先前的研究揭示这类模型容易受到后门攻击。具体而言,当在含有后门的样本上训练时,CLIP会学习嵌入后门触发器与目标标签之间的虚假相关性,并在联合嵌入空间中对其表示进行对齐。即使在300万预训练数据中仅注入少量中毒样本(例如75个样本),也能显著操控模型行为,使得此类相关性难以被检测或遗忘。为解决这一问题,我们提出CleanCLIP——一种微调框架,通过独立重新对齐各模态的表示来削弱后门攻击引入的虚假关联。我们证明,对单一模态采用多模态对比与单模态自监督目标相结合的无监督微调,能够显著降低后门攻击的影响。此外,我们表明,在特定任务标注图像数据上的监督微调可移除CLIP视觉编码器中的后门触发器。实验结果表明,CleanCLIP在保持对良性样本模型性能的同时,能消除多模态对比学习中多种后门攻击的影响。代码与模型检查点已开源至https://github.com/nishadsinghi/CleanCLIP。