Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.
翻译:可解释人工智能(XAI)方法通常通过功能性指标(如正确性)进行评估,这类指标以计算方式估计解释反映模型推理的准确程度。更高的正确性被假设能带来更好的人类理解,但这一关联尚未通过受控级别的实验检验。我们开展了一项用户研究(N=200),在时间序列分类任务中设置了四个级别的解释正确性(100%、85%、70%、55%),参与者无法依赖领域知识或视觉直觉,而是基于解释预测AI的决策(前向模拟)。正确性影响了理解,但并非在每个级别都如此:与完全正确的解释相比,在70%和55%的正确性下,表现有所下降;而正确性从70%进一步降低并未导致额外损失。较低的正确性并未均匀降低表现,而是减少了学习决策模式的参与者比例。同时,即使完全正确的解释也不能保证理解,仅有一部分参与者达到了高准确率。探索性分析显示,仅当解释完全正确且参与者已学习到模式时,自我报告评分才与已证实的表现相关。这些发现表明,并非所有功能性正确性上的差异都会转化为人类理解上的差异,这凸显了需要根据人类结果验证功能性指标的必要性。