Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI

Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emph{algorithmic fairness}, which targets equitable outcomes, and \emph{explainable AI} (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emph{reasoning process}. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emph{conditional invariance framework} formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes $ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ for all task-relevant $x$, a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice.

翻译：机器学习算法正被用于高风险的决策场景，包括刑事司法、医疗、信贷和就业等领域。研究界已对此作出回应，形成了两个相对独立的研究领域：追求结果公平性的*算法公平性*，以及追求推理可解释性的*可解释人工智能*（XAI）。本综述识别并描绘了这两个领域交叉处一个新颖的盲点，即一个模型可能在其输出结果上满足所有标准公平性准则，但其*推理过程*却可能极度不公平。我们将其定义为过程偏差，缓解该偏差需要将解释的公平性视为一个独立的科学研究对象。据我们所知，我们首次对这一新兴领域进行了统一的理论与文献综述，并阐明了事后解释器在认证解释公平性方面的缺陷。我们的核心贡献是一个*条件不变性框架*，该框架将解释公平性形式化为一个基本要求：对于所有任务相关的变量 $x$，解释应不受保护属性的影响，即 $P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ 。所有现有的解释公平性指标均是该单一原则的部分操作性实现。我们引入了一个七维分类法，识别了解释不公平的三种生成机制（表征驱动、解释模型失配、可行动性驱动），并提出了一个规范化的六步评估工作流程，以在实践中操作化解释公平性审计。