The emergence of large-scale pretrained language models has posed unprecedented challenges in deriving explanations of why the model has made some predictions. Stemmed from the compositional nature of languages, spurious correlations have further undermined the trustworthiness of NLP systems, leading to unreliable model explanations that are merely correlated with the output predictions. To encourage fairness and transparency, there exists an urgent demand for reliable explanations that allow users to consistently understand the model's behavior. In this work, we propose a complete framework for extending concept-based interpretability methods to NLP. Specifically, we propose a post-hoc interpretability method for extracting predictive high-level features (concepts) from the pretrained model's hidden layer activations. We optimize for features whose existence causes the output predictions to change substantially, \ie generates a high impact. Moreover, we devise several evaluation metrics that can be universally applied. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on {predictive impact}, usability, and faithfulness compared to the baselines.
翻译:大规模预训练语言模型的出现,在解释模型为何做出某些预测方面带来了前所未有的挑战。由于语言具有组合性本质,虚假相关性进一步损害了自然语言处理系统的可信度,导致模型解释仅与输出预测相关而不可靠。为了促进公平性和透明度,迫切需要能够使用户一致理解模型行为的可靠解释。在本研究中,我们提出了一个将基于概念的可解释性方法扩展至自然语言处理的完整框架。具体而言,我们提出了一种事后可解释性方法,用于从预训练模型的隐藏层激活中提取预测性高层特征(概念)。我们优化那些存在时会显著改变输出预测(即产生高影响)的特征。此外,我们设计了几种可普遍应用的评估指标。在真实和合成任务上的大量实验表明,与基线方法相比,我们的方法在预测影响、可用性和忠实度方面取得了更优的结果。