Adversarial training (AT) is widely considered the state-of-the-art technique for improving the robustness of deep neural networks (DNNs) against adversarial examples (AE). Nevertheless, recent studies have revealed that adversarially trained models are prone to unfairness problems, restricting their applicability. In this paper, we empirically observe that this limitation may be attributed to serious adversarial confidence overfitting, i.e., certain adversarial examples with overconfidence. To alleviate this problem, we propose HAM, a straightforward yet effective framework via adaptive Hard Adversarial example Mining.HAM concentrates on mining hard adversarial examples while discarding the easy ones in an adaptive fashion. Specifically, HAM identifies hard AEs in terms of their step sizes needed to cross the decision boundary when calculating loss value. Besides, an early-dropping mechanism is incorporated to discard the easy examples at the initial stages of AE generation, resulting in efficient AT. Extensive experimental results on CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant improvement in robust fairness while reducing computational cost compared to several state-of-the-art adversarial training methods. The code will be made publicly available.
翻译:对抗训练(AT)被广泛认为是提升深度神经网络(DNN)对抗样本(AE)鲁棒性的最先进技术。然而,近期研究表明,经过对抗训练的模型容易出现不公平性问题,限制了其适用性。本文实验发现,这一局限性可能归因于严重的对抗置信度过拟合,即某些对抗样本具有过度自信。为解决该问题,我们提出HAM——一种通过自适应硬对抗样本挖掘的简洁而有效的框架。HAM专注于挖掘硬对抗样本,同时以自适应方式丢弃简单样本。具体而言,HAM根据对抗样本在计算损失值时跨越决策边界所需的步长来识别硬对抗样本。此外,我们引入早期丢弃机制,在对抗样本生成的初始阶段丢弃简单样本,从而实现高效的对抗训练。在CIFAR-10、SVHN和Imagenette上的大量实验结果表明,相比几种最先进的对抗训练方法,HAM在显著提升鲁棒公平性的同时降低了计算成本。相关代码将公开发布。