Deep neural networks (DNNs) are vulnerable to adversarial perturbations that degrade both predictive accuracy and individual fairness, posing critical risks in high-stakes online decision-making. The relationship between these two dimensions of robustness remains poorly understood. To bridge this gap, we introduce robust individual fairness (RIF), which requires that similar individuals receive predictions consistent with the same ground truth even under adversarial manipulation. To evaluate and expose violations of RIF, we propose RIFair, an attack framework that applies identical perturbations to similar individuals to induce accuracy or fairness failures. We further introduce perturbation impact index (PII) and perturbation impact direction (PID) to quantify and explain why identical perturbations produce unequal effects on individuals who should behave similarly. Experiments across diverse model architectures and real-world web datasets reveal that existing robustness metrics capture distinct and often incompatible failure modes in accuracy and fairness. We find that many online applicants are simultaneously vulnerable to multiple types of adversarial failures, and that inaccurate or unfair outcomes arise due to similar individuals share the same PID but have sharply different PIIs, leading to divergent prediction-change trajectories in which some cross decision boundaries earlier. Finally, we demonstrate that adversarial examples generated by RIFair can strategically manipulate test-set accuracy or fairness by replacing only a small subset of items, creating misleading impressions of model performance. These findings expose fundamental limitations in current robustness evaluations and highlight the need for jointly assessing accuracy and fairness under adversarial perturbations in high-stakes online decision-making.
翻译:深度神经网络(DNN)容易受到对抗性扰动的影响,这些扰动会同时降低预测准确性和个体公平性,给高风险在线决策带来了严峻挑战。然而,这两种鲁棒性维度之间的关系仍鲜为人知。为弥补这一空白,我们提出了鲁棒个体公平性(RIF),它要求即使在对抗性操纵下,相似的个体所获得的预测也应与同一真实情况保持一致。为了评估并揭示违反RIF的情况,我们提出了RIFair攻击框架,该框架对相似个体施加相同的扰动,以诱发准确性或公平性失效。我们进一步引入了扰动影响指数(PII)和扰动影响方向(PID),用以量化和解释为何相同的扰动会对本应表现相似的个体产生不同的影响。通过对多种模型架构和真实世界网络数据集的实验发现,现有的鲁棒性指标捕捉到的是准确性与公平性中截然不同且往往互不兼容的失效模式。我们发现,许多在线申请者同时容易受到多种类型的对抗性失效攻击,而不准确或不公平的结果之所以产生,是因为相似个体共享相同的PID,但其PII却存在显著差异,这导致了预测变化轨迹的分歧——其中一些个体更早地跨越了决策边界。最后,我们证明,RIFair生成的对抗性示例能够通过仅替换一小部分数据项,策略性地操纵测试集的准确性或公平性,从而造成对模型性能的误导性印象。这些发现揭示了当前鲁棒性评估的根本局限性,并强调了在高风险在线决策中,有必要在对抗性扰动下联合评估准确性与公平性。