The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Inspired by the success of vision-language foundation models, previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. However, in practice, they are still unsatisfactory due to several issues, including heavy adaptation cost, suboptimal text supervision, and uncontrolled natural generalization capacity. In this paper, to address these issues, we propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement. Specifically, we achieve this by providing adversarially correlated text supervision that is end-to-end learned from adversarial examples. We also propose a novel training objective that enhances the consistency of multi-modal features while encourages differentiated uni-modal features between natural and adversarial examples. The proposed framework gives access to learn adversarial text supervision, which provides superior cross-modal adversarial alignment and matches state-of-the-art zero-shot adversarial robustness with only 1% training data.
翻译:深度神经网络对难以察觉的对抗性扰动的脆弱性已引起广泛关注。受视觉语言基础模型成功经验的启发,先前的工作通过将对抗性视觉特征与文本监督对齐,实现了零样本对抗鲁棒性。然而,在实践中,由于适应成本高昂、文本监督次优以及自然泛化能力不受控等问题,这些方法仍不尽如人意。为解决这些问题,本文提出了一种少样本对抗提示框架,该框架通过有限数据调整输入序列,显著提升了对抗鲁棒性。具体而言,我们通过提供端到端地从对抗样本中学习得到的对抗相关文本监督来实现这一目标。同时,我们提出了一种新的训练目标,该目标在增强多模态特征一致性的同时,鼓励自然样本与对抗样本之间具有差异化的单模态特征。所提出的框架能够学习对抗性文本监督,实现卓越的跨模态对抗对齐,并仅使用1%的训练数据即可达到与最先进零样本对抗鲁棒性相当的性能。