Transformer architectures are complex and their use in NLP, while it has engendered many successes, makes their interpretability or explainability challenging. Recent debates have shown that attention maps and attribution methods are unreliable (Pruthi et al., 2019; Brunner et al., 2019). In this paper, we present some of their limitations and introduce COCKATIEL, which successfully addresses some of them. COCKATIEL is a novel, post-hoc, concept-based, model-agnostic XAI technique that generates meaningful explanations from the last layer of a neural net model trained on an NLP classification task by using Non-Negative Matrix Factorization (NMF) to discover the concepts the model leverages to make predictions and by exploiting a Sensitivity Analysis to estimate accurately the importance of each of these concepts for the model. It does so without compromising the accuracy of the underlying model or requiring a new one to be trained. We conduct experiments in single and multi-aspect sentiment analysis tasks and we show COCKATIEL's superior ability to discover concepts that align with humans' on Transformer models without any supervision, we objectively verify the faithfulness of its explanations through fidelity metrics, and we showcase its ability to provide meaningful explanations in two different datasets.
翻译:Transformer架构复杂且其在自然语言处理中的应用虽带来了诸多成功,但其可解释性仍面临挑战。近期研究表明,注意力图和归因方法并不可靠(Pruthi 等,2019;Brunner 等,2019)。本文揭示了这些方法的部分局限性,并提出了COCKATIEL方法以有效解决其中若干问题。COCKATIEL是一种新颖的、事后、基于概念且模型无关的可解释人工智能技术,通过非负矩阵分解(NMF)发现模型用于预测的概念,并利用敏感性分析准确估计各概念对模型的重要性,从而从训练于NLP分类任务的神经网络模型最后一层生成有意义的解释。该方法无需牺牲底层模型的准确性或重新训练新模型。我们在单方面和多方面情感分析任务上开展实验,证明COCKATIEL在无需任何监督情况下,能够更优地发现与人类认知对齐的Transformer模型概念;通过保真度指标客观验证了其解释的忠实性,并在两个不同数据集上展示了其提供有意义解释的能力。