The disparity in accuracy between classes in standard training is amplified during adversarial training, a phenomenon termed the robust fairness problem. Existing methodologies aimed to enhance robust fairness by sacrificing the model's performance on easier classes in order to improve its performance on harder ones. However, we observe that under adversarial attacks, the majority of the model's predictions for samples from the worst class are biased towards classes similar to the worst class, rather than towards the easy classes. Through theoretical and empirical analysis, we demonstrate that robust fairness deteriorates as the distance between classes decreases. Motivated by these insights, we introduce the Distance-Aware Fair Adversarial training (DAFA) methodology, which addresses robust fairness by taking into account the similarities between classes. Specifically, our method assigns distinct loss weights and adversarial margins to each class and adjusts them to encourage a trade-off in robustness among similar classes. Experimental results across various datasets demonstrate that our method not only maintains average robust accuracy but also significantly improves the worst robust accuracy, indicating a marked improvement in robust fairness compared to existing methods.
翻译:标准训练中类别间准确率的差异在对抗训练中被放大,这一现象称为鲁棒公平性问题。现有方法旨在通过牺牲模型在简单类别上的性能以提升其在困难类别上的表现,从而增强鲁棒公平性。然而,我们观察到在对抗攻击下,模型对最差类别样本的预测大多偏向于与该类别相似的类别,而非简单类别。通过理论与实证分析,我们证明鲁棒公平性会随类别间距离减小而恶化。基于此发现,我们提出距离感知的公平对抗训练(DAFA)方法,通过考虑类别间的相似性来解决鲁棒公平性问题。具体而言,我们的方法为每个类别分配不同的损失权重和对抗边界,并调整这些参数以促进相似类别间的鲁棒性权衡。多个数据集上的实验结果表明,该方法不仅能保持平均鲁棒准确率,还能显著提升最差鲁棒准确率,相较于现有方法在鲁棒公平性上取得了显著改进。