Despite the rising popularity of saliency-based explanations, the research community remains at an impasse, facing doubts concerning their purpose, efficacy, and tendency to contradict each other. Seeking to unite the community's efforts around common goals, several recent works have proposed evaluation metrics. In this paper, we critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics, focusing our inquiry on natural language processing. First, we show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our strategy exploits the tendency for extracted explanations and their complements to be "out-of-support" relative to each other and in-distribution inputs. Next, we demonstrate that the EVAL-X metrics can be inflated arbitrarily by a simple method that encodes the label, even though EVAL-X is precisely motivated to address such exploits. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
翻译:尽管基于显著性的解释方法日益流行,研究社区仍陷入困境,面临对其目的、有效性及相互矛盾倾向的质疑。为围绕共同目标统一研究力量,近期多项工作提出了评估指标。本文对两类指标进行批判性审视:ERASER指标(全面性与充分性)及EVAL-X指标,重点聚焦自然语言处理领域。首先,我们证明在不改变模型对分布内测试输入的预测或解释的前提下,可显著提升其全面性与充分性得分。该策略利用了提取解释及其补集相对于彼此及分布内输入"脱离支持域"的倾向。其次,我们展示即便EVAL-X的设计初衷正是针对此类漏洞,但通过一种简单的标签编码方法仍可任意操纵其数值。我们的研究结果对当前指标指导可解释性研究的能力提出质疑,凸显出需要对这类指标究竟旨在捕捉什么进行更广泛的重新评估。