Enhancing Adversarial Robustness via Uncertainty-Aware Distributional Adversarial Training

Despite remarkable achievements in deep learning across various domains, its inherent vulnerability to adversarial examples still remains a critical concern for practical deployment. Adversarial training has emerged as one of the most effective defensive techniques for improving model robustness against such malicious inputs. However, existing adversarial training schemes often lead to limited generalization ability against underlying adversaries with diversity due to their overreliance on a point-by-point augmentation strategy by mapping each clean example to its adversarial counterpart during training. In addition, adversarial examples can induce significant disruptions in the statistical information w.r.t. the target model, thereby introducing substantial uncertainty and challenges to modeling the distribution of adversarial examples. To circumvent these issues, in this paper, we propose a novel uncertainty-aware distributional adversarial training method, which enforces adversary modeling by leveraging both the statistical information of adversarial examples and its corresponding uncertainty estimation, with the goal of augmenting the diversity of adversaries. Considering the potentially negative impact induced by aligning adversaries to misclassified clean examples, we also refine the alignment reference based on the statistical proximity to clean examples during adversarial training, thereby reframing adversarial training within a distribution-to-distribution matching framework interacted between the clean and adversarial domains. Furthermore, we design an introspective gradient alignment approach via matching input gradients between these domains without introducing external models. Extensive experiments across four benchmark datasets and various network architectures demonstrate that our approach achieves state-of-the-art adversarial robustness and maintains natural performance.

翻译：尽管深度学习在各个领域取得了显著成就，但其对对抗样本的固有脆弱性仍然是实际部署中的关键问题。对抗训练已成为提升模型对此类恶意输入鲁棒性的最有效防御技术之一。然而，现有的对抗训练方案由于过度依赖逐点增强策略（即在训练中将每个干净样本映射到其对抗样本），往往导致对具有多样性的潜在对抗样本的泛化能力有限。此外，对抗样本会严重干扰目标模型的统计信息，从而为对抗样本的分布建模引入显著的不确定性和挑战。为规避这些问题，本文提出一种新颖的不确定性感知分布对抗训练方法，该方法通过同时利用对抗样本的统计信息及其对应的不确定性估计来强化对抗建模，旨在增强对抗样本的多样性。考虑到将对抗样本与误分类的干净样本对齐可能产生的负面影响，我们还在对抗训练期间基于与干净样本的统计邻近性优化对齐参考，从而在干净域与对抗域交互的分布到分布匹配框架内重构对抗训练。此外，我们设计了一种内省式梯度对齐方法，通过匹配这两个域之间的输入梯度而无需引入外部模型。在四个基准数据集和多种网络架构上进行的大量实验表明，我们的方法实现了最先进的对抗鲁棒性，并保持了自然性能。