Explanations of neural models aim to reveal a model's decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model's inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.
翻译:神经模型解释旨在揭示模型预测的决策过程。然而,近期研究表明,当前生成解释的方法(如显著性图或反事实解释)可能具有误导性,因为它们容易呈现与模型内部机制不忠实的原因。本文探索了评估自然语言解释(NLE)忠实性这一具有挑战性的问题。为此,我们提出两项测试。首先,我们设计了一个反事实输入编辑器,用于插入导致反事实预测但未被NLE反映的原因。其次,我们从生成的NLE中提取原因重构输入,并检查这些重构输入在多大程度上能产生相同预测。我们的测试可评估新兴NLE模型,为开发忠实的NLE提供基础工具。