Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6%.
翻译:大型预训练视觉语言模型(如CLIP)已在多种下游任务中展现出卓越的零样本泛化能力。然而,近期研究表明,CLIP的推理性能易受微小对抗性扰动的影响而显著下降,尤其在其视觉模态上,这构成了严重的安全威胁。为缓解此脆弱性,本文提出一种名为测试时对抗性提示调优的新型防御方法,旨在提升CLIP针对视觉对抗攻击的推理鲁棒性。TAPT是一种测试时防御方法,通过学习防御性双模态(文本与视觉)提示来强化CLIP的推理过程。具体而言,该方法通过最小化多视角熵并对齐对抗样本与干净样本的分布,以无监督方式为每个测试样本优化防御性提示。我们在包括ImageNet及其他10个零样本数据集在内的11个基准数据集上评估TAPT的有效性,实验表明该方法能使原始CLIP针对AutoAttack的零样本对抗鲁棒性提升至少48.9%,同时基本保持其在干净样本上的性能。此外,TAPT在不同骨干网络上的表现均优于现有对抗性提示调优方法,平均鲁棒性提升至少达36.6%。