Deep neural networks are vulnerable to adversarial samples. Adversarial fine-tuning methods aim to enhance adversarial robustness through fine-tuning the naturally pre-trained model in an adversarial training manner. However, we identify that some latent features of adversarial samples are confused by adversarial perturbation and lead to an unexpectedly increasing gap between features in the last hidden layer of natural and adversarial samples. To address this issue, we propose a disentanglement-based approach to explicitly model and further remove the latent features that cause the feature gap. Specifically, we introduce a feature disentangler to separate out the latent features from the features of the adversarial samples, thereby boosting robustness by eliminating the latent features. Besides, we align features in the pre-trained model with features of adversarial samples in the fine-tuned model, to further benefit from the features from natural samples without confusion. Empirical evaluations on three benchmark datasets demonstrate that our approach surpasses existing adversarial fine-tuning methods and adversarial training baselines.
翻译:深度神经网络易受对抗样本攻击。对抗微调方法旨在通过对抗训练的方式对自然预训练模型进行微调,从而增强其对抗鲁棒性。然而,我们发现对抗样本的某些潜在特征会受到对抗扰动的混淆,导致自然样本与对抗样本在最后一层隐藏层中的特征之间出现意料之外的差异增大。为解决这一问题,我们提出一种基于解耦的方法,显式建模并进一步移除导致特征差异的潜在特征。具体而言,我们引入一个特征解耦器,从对抗样本的特征中分离出潜在特征,从而通过消除这些潜在特征来提升鲁棒性。此外,我们将预训练模型中的特征与微调模型中对抗样本的特征进行对齐,以进一步利用自然样本中无混淆的特征。在三个基准数据集上的实验评估表明,我们的方法优于现有的对抗微调方法和对抗训练基线。