While much research focused on producing explanations, it is still unclear how the produced explanations' quality can be evaluated in a meaningful way. Today's predominant approach is to quantify explanations using proxy scores which compare explanations to (human-annotated) gold explanations. This approach assumes that explanations which reach higher proxy scores will also provide a greater benefit to human users. In this paper, we present problems of this approach. Concretely, we (i) formulate desired characteristics of explanation quality, (ii) describe how current evaluation practices violate them, and (iii) support our argumentation with initial evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems. We find that proxy scores correlate poorly with human quality ratings and, additionally, become less expressive the more often they are used (i.e. following Goodhart's law). Finally, we propose guidelines to enable a meaningful evaluation of explanations to drive the development of systems that provide tangible benefits to human users.
翻译:尽管大量研究致力于生成解释,但如何有意义地评估生成解释的质量仍不清楚。当前主流方法是通过代理分数将解释与(人工标注的)黄金解释进行比较来量化解释质量。该方法假设获得更高代理分数的解释将为人类用户带来更大益处。本文揭示了该方法的若干问题。具体而言,我们(i)阐述了理想解释质量的特征,(ii)描述了当前评估实践如何违背这些特征,以及(iii)通过一项众包案例研究的初步证据支持我们的论点——该研究调查了最先进的可解释问答系统的解释质量。研究发现代理分数与人类质量评级相关性较差,并且随着使用频率增加,其表达能力逐渐降低(即遵循古德哈特定律)。最后,我们提出指导方针,以实现对解释的有意义评估,从而推动开发能为人类用户带来切实利益的系统。