The deep neural networks are known to be vulnerable to well-designed adversarial attacks. The most successful defense technique based on adversarial training (AT) can achieve optimal robustness against particular attacks but cannot generalize well to unseen attacks. Another effective defense technique based on adversarial purification (AP) can enhance generalization but cannot achieve optimal robustness. Meanwhile, both methods share one common limitation on the degraded standard accuracy. To mitigate these issues, we propose a novel pipeline to acquire the robust purifier model, named Adversarial Training on Purification (AToP), which comprises two components: perturbation destruction by random transforms (RT) and purifier model fine-tuned (FT) by adversarial loss. RT is essential to avoid overlearning to known attacks, resulting in the robustness generalization to unseen attacks, and FT is essential for the improvement of robustness. To evaluate our method in an efficient and scalable way, we conduct extensive experiments on CIFAR-10, CIFAR-100, and ImageNette to demonstrate that our method achieves optimal robustness and exhibits generalization ability against unseen attacks.
翻译:深度神经网络已知易受精心设计的对抗性攻击的影响。基于对抗训练(AT)最成功的防御技术能针对特定攻击实现最优鲁棒性,但无法良好泛化至未知攻击。另一种基于对抗净化(AP)的有效防御技术可增强泛化能力却无法达到最优鲁棒性。同时,两种方法均存在标准精度下降的共同局限。为缓解这些问题,我们提出一种新框架来获取鲁棒净化模型,命名为对抗训练净化(AToP),该框架包含两大组件:通过随机变换(RT)破坏扰动,以及通过对抗损失微调(FT)净化模型。随机变换对于避免对已知攻击的过学习至关重要,从而实现对未知攻击的鲁棒泛化;而微调则是提升鲁棒性的关键。为高效可扩展地评估方法,我们在CIFAR-10、CIFAR-100和ImageNette数据集上进行了广泛实验,结果表明我们的方法能达到最优鲁棒性并展现对未知攻击的泛化能力。