Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric, which can be solved by beam search. This observation leads to the obvious yet unaddressed question: why do we use explainers (e.g., LIME) not based on solving the target metric, if the metric value represents explanation quality? We present a series of investigations showing strong performance of this beam search explainer and discuss its broader implication: a definition-evaluation duality of interpretability concepts. We implement the explainer and release the Python solvex package for models of text, image and tabular domains.
翻译:特征归因方法广泛用于解释神经网络预测,并常通过诸如全面性和充分性等指标进行评估。本文揭示了这些指标的一个引人注目的特性:其可解性。具体而言,我们可以将针对某一指标优化解释的问题定义为可解问题,并通过波束搜索加以求解。这一发现引出了一个显而易见却尚未被探讨的问题:既然指标值代表解释质量,为何不基于直接求解目标指标来使用解释器(如LIME)?我们通过一系列研究展示了此波束搜索解释器的强大性能,并讨论了其更广泛的意义:可解释性概念的定义-评估二元性。我们实现了该解释器,并发布了适用于文本、图像和表格领域模型的Python solvex软件包。