Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code is available \href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{here}.
翻译:尽管CLIP在零样本方式下的图文匹配任务中被广泛使用,但研究表明其对图像添加的对抗性扰动高度脆弱。近期研究提出通过动态生成的对抗样本微调CLIP的视觉编码器,并在多个下游数据集上展现出对抗攻击的增强鲁棒性,该特性被称为零样本鲁棒性。本文发现,旨在最大化分类损失的恶意扰动会导致“虚假稳定”图像,并提出在推理阶段利用CLIP预训练的视觉编码器对此类对抗图像进行反攻以实现鲁棒性。我们的范式简洁且无需训练,首次实现了在测试时防御CLIP免受对抗攻击的方法,这与现有旨在提升CLIP零样本对抗鲁棒性的方法形成正交补充。我们在16个分类数据集上进行实验,相较于从现有不依赖外部网络的对抗鲁棒性研究中适配的测试时防御方法,本方法在保持干净图像性能无明显下降的同时,展现出稳定且一致的性能提升。我们还证明该范式可应用于经过对抗微调的CLIP模型,以进一步增强其在测试时的鲁棒性。代码发布于\href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{此处}。