In recent years, explaining decisions made by complex machine learning models has become essential in high-stakes domains such as energy systems, healthcare, finance, and autonomous systems. However, the reliability of these explanations, namely, whether they remain stable and consistent under realistic, non-adversarial changes, remains largely unmeasured. Widely used methods such as SHAP and Integrated Gradients (IG) are well-motivated by axiomatic notions of attribution, yet their explanations can vary substantially even under system-level conditions, including small input perturbations, correlated representations, and minor model updates. Such variability undermines explanation reliability, as reliable explanations should remain consistent across equivalent input representations and small, performance-preserving model changes. We introduce the Explanation Reliability Index (ERI), a family of metrics that quantifies explanation stability under four reliability axioms: robustness to small input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to mild distributional shifts. For each axiom, we derive formal guarantees, including Lipschitz-type bounds and temporal stability results. We further propose ERI-T, a dedicated measure of temporal reliability for sequential models, and introduce ERI-Bench, a benchmark designed to systematically stress-test explanation reliability across synthetic and real-world datasets. Experimental results reveal widespread reliability failures in popular explanation methods, showing that explanations can be unstable under realistic deployment conditions. By exposing and quantifying these instabilities, ERI enables principled assessment of explanation reliability and supports more trustworthy explainable AI (XAI) systems.
翻译:近年来,在能源系统、医疗保健、金融和自主系统等高风险领域,解释复杂机器学习模型的决策变得至关重要。然而,这些解释的可靠性——即它们在现实非对抗性变化下是否保持稳定和一致——在很大程度上仍未得到有效度量。SHAP和积分梯度等广泛使用的方法虽然基于归因的公理化概念具有良好动机,但其解释即使在系统级条件下(包括微小输入扰动、相关表示和轻微模型更新)也可能发生显著变化。这种可变性损害了解释的可靠性,因为可靠解释应在等效输入表示和保持性能的微小模型变化下保持一致。我们提出了解释可靠性指数(ERI),这是一系列量化解释在四个可靠性公理下稳定性的度量:对微小输入扰动的鲁棒性、特征冗余下的一致性、模型演化过程中的平滑性以及对温和分布偏移的韧性。针对每个公理,我们推导了形式化保证,包括Lipschitz型边界和时间稳定性结果。我们进一步提出了ERI-T,一种用于序列模型的时间可靠性专用度量,并介绍了ERI-Bench——一个旨在系统化压力测试合成和真实数据集上解释可靠性的基准。实验结果表明,流行解释方法普遍存在可靠性缺陷,显示解释在现实部署条件下可能不稳定。通过揭示和量化这些不稳定性,ERI实现了对解释可靠性的原则性评估,并支持构建更可信赖的可解释人工智能(XAI)系统。