No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

翻译：可解释人工智能（XAI）研究长期以来聚焦于解释模型预测。近期，有方法通过将预测不确定性归因至输入特征（即不确定性归因）来解释预测不确定性。然而，这些方法的评估仍缺乏一致性——现有研究依赖异构代理任务与度量指标，阻碍了方法间的可比性。我们通过将不确定性归因与已确立的XAI评估框架Co-12进行对齐来解决这一问题：针对正确性、一致性、连续性与紧凑性属性提出具体实施方案。此外，我们引入适应性属性——专为不确定性归因设计的评估维度，用于检验认知不确定性的受控增加能否可靠传播至特征级归因。我们在表格数据与图像数据上，结合不确定性量化方法与特征归因方法的八种组合，通过八个度量指标演示了该评估框架。实验表明，基于梯度的方法在一致性和适应性方面始终优于基于扰动的方法，而蒙特卡洛丢弃连接在大多数度量上优于蒙特卡洛丢弃。尽管大多数度量对样本间方法排序具有一致性，但方法间的排序一致性仍然较低——这表明单一度量不足以全面评估不确定性归因质量。本评估框架通过为不确定性归因方法的系统比较与发展建立基础，为相关知识体系做出了贡献。