Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.
翻译:可解释性被广泛认为是可信人工智能系统的关键要素。然而,当前用于评估反事实解释的指标多为算法评估指标,这些指标很少基于人类对解释质量的判断进行验证。这引发了一个问题:此类指标是否能够有意义地反映用户的感知?我们通过一项实证研究来探讨这一问题,该研究在三个数据集上直接比较了算法评估指标与人类判断。参与者从多个感知质量维度对反事实解释进行评分,我们将这些评分与一套全面的标准反事实评估指标相关联。我们分析了单个指标的关系,以及指标组合在多大程度上能够预测人类评估。研究结果表明,算法指标与人类评分之间的相关性普遍较弱,且高度依赖于数据集。此外,在预测模型中增加指标数量并未带来可靠的改进,这表明当前指标在捕捉与人类相关的标准方面存在结构性局限。总体而言,我们的发现表明,广泛使用的反事实评估指标未能反映用户感知中解释质量的关键方面,这凸显了需要采用更加以人为中心的方法来评估可解释人工智能。