End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately represent the reasoning process behind the model's prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.
翻译:端到端神经自然语言处理模型因其难以理解而著称。近年来,这催生了大量关于模型可解释性的研究。模型解释的一个重要目标是忠实性,即解释应准确反映模型预测背后的推理过程。在本综述中,我们从忠实性视角回顾了自然语言处理领域110余种模型解释方法。首先讨论忠实性的定义、评估方法及其对可解释性的意义;随后介绍忠实解释的最新进展,将现有方法归纳为五类:基于相似性的方法、模型内部结构分析、基于反向传播的方法、反事实干预及自解释模型。针对每类方法,我们综合其代表性研究、优势与局限。最后总结这些方法的共性优势与现存挑战,并展望自然语言处理中实现忠实可解释性的未来研究方向。