Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks, and exhibit remarkable zero-shot generalization capability, while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However, direct application to the CLIP model may result in overfitting, compromising the model's capacity for generalization. In this paper, we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch, to enhance the model's zero-shot adversarial robustness. Specifically, PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model, aiming to preserve the generalization features already captured by the pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method, improving the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach consistently improves clean accuracy by an average of 8.72%.
翻译:大规模预训练视觉语言模型(如CLIP)在各类任务中展现出卓越性能,并具备显著的零样本泛化能力,但同时易受难以察觉的对抗样本攻击。现有工作通常采用对抗训练(微调)作为防御对抗样本的方法,但直接应用于CLIP模型可能导致过拟合,削弱模型的泛化能力。本文提出预训练模型引导的对抗微调方法(PMG-AFT),通过精心设计辅助分支利用原始预训练模型的监督信号,增强模型的零样本对抗鲁棒性。具体而言,PMG-AFT通过最小化目标模型中对抗样本特征与预训练模型特征之间的距离,旨在保留预训练模型已捕获的泛化特征。在15个零样本数据集上的大量实验表明,PMG-AFT显著优于当前最先进方法,Top-1鲁棒准确率平均提升4.99%。此外,本方法在干净样本准确率上持续保持平均8.72%的提升。