The deep neural networks are known to be vulnerable to well-designed adversarial attacks. The most successful defense technique based on adversarial training (AT) can achieve optimal robustness against particular attacks but cannot generalize well to unseen attacks. Another effective defense technique based on adversarial purification (AP) can enhance generalization but cannot achieve optimal robustness. Meanwhile, both methods share one common limitation on the degraded standard accuracy. To mitigate these issues, we propose a novel framework called Adversarial Training on Purification (AToP), which comprises two components: perturbation destruction by random transforms (RT) and purifier model fine-tuned (FT) by adversarial loss. RT is essential to avoid overlearning to known attacks resulting in the robustness generalization to unseen attacks and FT is essential for the improvement of robustness. To evaluate our method in an efficient and scalable way, we conduct extensive experiments on CIFAR-10, CIFAR-100, and ImageNette to demonstrate that our method achieves state-of-the-art results and exhibits generalization ability against unseen attacks.
翻译:深度神经网络已知对精心设计的对抗攻击具有脆弱性。基于对抗训练(AT)的最成功防御技术虽能针对特定攻击实现最优鲁棒性,但难以泛化至未见攻击。另一基于对抗净化(AP)的有效防御技术可增强泛化能力,却无法达到最优鲁棒性。同时,两种方法均面临标准准确率下降的共同局限。为解决这些问题,我们提出名为对抗训练与净化协同框架(AToP)的新颖架构,包含两大组件:基于随机变换(RT)的扰动破坏模块和通过对抗损失微调(FT)的净化器模型。RT对避免针对已知攻击的过学习以实现对未见攻击的鲁棒泛化至关重要,而FT则是提升鲁棒性的关键。为高效可扩展地评估方法,我们在CIFAR-10、CIFAR-100和ImageNette上开展大量实验,证明所提方法达到最优结果,并展现对未见攻击的泛化能力。