RobustFair: Adversarial Evaluation through Fairness Confusion Directed Gradient Search

Deep neural networks (DNNs) often face challenges due to their vulnerability to various adversarial perturbations, including false perturbations that undermine prediction accuracy and biased perturbations that cause biased predictions for similar inputs. This paper introduces a novel approach, RobustFair, to evaluate the accurate fairness of DNNs when subjected to these false or biased perturbations. RobustFair employs the notion of the fairness confusion matrix induced in accurate fairness to identify the crucial input features for perturbations. This matrix categorizes predictions as true fair, true biased, false fair, and false biased, and the perturbations guided by it can produce a dual impact on instances and their similar counterparts to either undermine prediction accuracy (robustness) or cause biased predictions (individual fairness). RobustFair then infers the ground truth of these generated adversarial instances based on their loss function values approximated by the total derivative. To leverage the generated instances for trustworthiness improvement, RobustFair further proposes a data augmentation strategy to prioritize adversarial instances resembling the original training set, for data augmentation and model retraining. Notably, RobustFair excels at detecting intertwined issues of robustness and individual fairness, which are frequently overlooked in standard robustness and individual fairness evaluations. This capability empowers RobustFair to enhance both robustness and individual fairness evaluations by concurrently identifying defects in either domain. Empirical case studies and quantile regression analyses on benchmark datasets demonstrate the effectiveness of the fairness confusion matrix guided perturbation for false or biased adversarial instance generation.

翻译：深度神经网络（DNN）常因易受各种对抗性扰动影响而面临挑战，这些扰动包括削弱预测准确性的虚假扰动以及导致相似输入产生偏颇预测的偏差扰动。本文提出一种名为RobustFair的新方法，用于评估DNN在遭受这些虚假或偏差扰动时的精确公平性。RobustFair利用精确公平性中诱发的公平混淆矩阵概念，识别用于扰动关键输入特征。该矩阵将预测结果分为真公平、真偏差、假公平和假偏差四类，以其为导向的扰动可对实例及其相似对应物产生双重影响：既可能削弱预测准确性（鲁棒性），也可能导致偏颇预测（个体公平性）。随后，RobustFair通过总导数近似的损失函数值推断这些生成对抗性实例的真实标签。为利用生成的实例提升模型可信度，RobustFair进一步提出一种数据增强策略，优先选择与原始训练集相似的对抗性实例进行数据增强和模型重训练。值得注意的是，RobustFair擅长检测鲁棒性与个体公平性相互交织的问题——这在标准鲁棒性和个体公平性评估中常被忽视。这一能力使RobustFair能够通过同时识别任一领域的缺陷，同时增强鲁棒性和个体公平性评估。基准数据集上的实证案例研究与分位数回归分析表明，公平混淆矩阵引导的扰动在生成虚假或偏差对抗性实例方面具有有效性。