Deep Neural Networks (DNN) have been shown to be vulnerable to adversarial examples. Adversarial training (AT) is a popular and effective strategy to defend against adversarial attacks. Recent works (Benz et al., 2020; Xu et al., 2021; Tian et al., 2021) have shown that a robust model well-trained by AT exhibits a remarkable robustness disparity among classes, and propose various methods to obtain consistent robust accuracy across classes. Unfortunately, these methods sacrifice a good deal of the average robust accuracy. Accordingly, this paper proposes a novel framework of worst-class adversarial training and leverages no-regret dynamics to solve this problem. Our goal is to obtain a classifier with great performance on worst-class and sacrifice just a little average robust accuracy at the same time. We then rigorously analyze the theoretical properties of our proposed algorithm, and the generalization error bound in terms of the worst-class robust risk. Furthermore, we propose a measurement to evaluate the proposed method in terms of both the average and worst-class accuracies. Experiments on various datasets and networks show that our proposed method outperforms the state-of-the-art approaches.
翻译:深度神经网络(DNN)已被证明易受对抗样本攻击。对抗训练(AT)是一种广受欢迎且有效的防御对抗攻击的策略。近期研究(Benz等人,2020;Xu等人,2021;Tian等人,2021)表明,经AT充分训练的鲁棒模型在各类别间存在显著的鲁棒性差异,并提出了多种方法以实现各类别间一致的鲁棒精度。然而,这些方法会牺牲大量的平均鲁棒精度。为此,本文提出了一种新的最差类别对抗训练框架,并利用无遗憾动态(no-regret dynamics)来解决这一问题。我们的目标是获得一个在最差类别上表现优异、同时仅牺牲少量平均鲁棒精度的分类器。接着,我们严格分析了所提算法的理论性质,以及关于最差类别鲁棒风险的泛化误差界。此外,我们提出了一种衡量标准,用于从平均精度和最差类别精度两方面评估所提方法。在多种数据集和网络上的实验表明,我们的方法优于现有最先进的方案。