RobustFair: Adversarial Evaluation through Fairness Confusion Directed Gradient Search

The trustworthiness of DNNs is often challenged by their vulnerability to minor adversarial perturbations, which may not only undermine prediction accuracy (robustness) but also cause biased predictions for similar inputs (individual fairness). Accurate fairness has been recently proposed to enforce a harmonic balance between accuracy and individual fairness. It induces the notion of fairness confusion matrix to categorize predictions as true fair, true biased, false fair, and false biased. This paper proposes a harmonic evaluation approach, RobustFair, for the accurate fairness of DNNs, using adversarial perturbations crafted through fairness confusion directed gradient search. By using Taylor expansions to approximate the ground truths of adversarial instances, RobustFair can particularly identify the robustness defects entangled for spurious fairness, which are often elusive in robustness evaluation, and missing in individual fairness evaluation. RobustFair can boost robustness and individual fairness evaluations by identifying robustness or fairness defects simultaneously. Empirical case studies on fairness benchmark datasets show that, compared with the state-of-the-art white-box robustness and individual fairness testing approaches, RobustFair detects significantly 1.77-11.87 times adversarial perturbations, yielding 1.83-13.12 times biased and 1.53-8.22 times false instances. The adversarial instances can then be effectively exploited to improve the accurate fairness (and hence accuracy and individual fairness) of the original deep neural network through retraining. The empirical case studies further show that the adversarial instances identified by RobustFair outperform those identified by the other testing approaches, in promoting 21% accurate fairness and 19% individual fairness on multiple sensitive attributes, without losing accuracy at all or even promoting it by up to 4%.

翻译：深度神经网络的可靠性常因对微小对抗扰动的脆弱性而受到挑战，这些扰动不仅可能削弱预测准确性（鲁棒性），还可能导致对相似输入产生有偏差的预测（个体公平性）。准确公平性（accurate fairness）最近被提出，旨在强制实现准确性与个体公平性之间的和谐平衡。它引入了公平混淆矩阵的概念，将预测结果分类为真实公平、真实有偏、虚假公平和虚假有偏。本文提出一种针对深度神经网络准确公平性的和谐评估方法RobustFair，该方法利用通过公平混淆导向梯度搜索生成的对抗扰动。通过使用泰勒展开逼近对抗实例的真实标记，RobustFair能够特别识别出为虚假公平（spurious fairness）所困扰的鲁棒性缺陷——这些缺陷在鲁棒性评估中往往难以捕捉，且在个体公平性评估中常被遗漏。RobustFair可同时识别鲁棒性缺陷或公平性缺陷，从而增强鲁棒性与个体公平性评估。在公平性基准数据集上的实证案例研究表明，与最先进的白盒鲁棒性和个体公平性测试方法相比，RobustFair检测的对抗扰动数量显著提升1.77-11.87倍，产生的有偏实例和虚假实例分别增加1.83-13.12倍和1.53-8.22倍。这些对抗实例随后可通过重训练有效用于提升原始深度神经网络的准确公平性（进而提升准确性和个体公平性）。实证案例研究进一步表明，在多个敏感属性上，RobustFair识别的对抗实例在促进准确公平性提升21%、个体公平性提升19%方面优于其他测试方法识别的实例，且准确性完全未受影响，甚至提升高达4%。