Machine learning algorithms in socially sensitive domains (e.g., credit decisions) often focus on equalizing predictive outcomes. However, satisfying these metrics does not guarantee that models use the same reasoning for different groups. We show that existing outcome-fair models can still apply fundamentally different reasoning to individuals, a ``hidden procedural bias'' missed by standard fairness metrics and algorithms. We propose Counterfactual Explanation Consistency (CEC), a framework that detects and mitigates this bias by aligning feature attributions between individuals and their counterfactual counterparts. Key contributions include a nearest-neighbor counterfactual generation method, a modified baseline for integrated gradient comparisons, an individual-level procedural fairness metric, and a corresponding training loss. We introduce a taxonomy identifying ``Regime B'' (same outcome, different reasoning) as a critical blind spot. Experiments on synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrate that outcome-fair baselines exhibit substantial hidden bias, while CEC substantially reduces it with modest utility cost.
翻译:机器学习算法在社会敏感领域(如信贷决策)中通常聚焦于均衡预测结果。然而,满足这些指标并不能保证模型对不同群体采用相同的推理逻辑。我们证明,现有的结果公平模型仍可能对个体应用根本不同的推理方式,这是一种被标准公平性指标和算法所忽视的“隐性程序偏差”。我们提出反事实解释一致性(CEC)框架,通过对齐个体与其反事实对应对象的特征归因来检测并缓解这种偏差。主要贡献包括:一种基于最近邻的反事实生成方法、一种用于积分梯度比较的修正基线方法、一项个体层面的程序公平性指标,以及相应的训练损失函数。我们引入一种分类法,将“模式B”(相同结果、不同推理)识别为关键盲区。在合成数据、德国信用数据、成人收入数据和HMDA按揭贷款数据上的实验表明,结果公平基线模型存在显著隐藏偏差,而CEC在仅牺牲适度效用成本的前提下可大幅降低该偏差。