Explaining Language Models' Predictions with High-Impact Concepts

The emergence of large-scale pretrained language models has posed unprecedented challenges in deriving explanations of why the model has made some predictions. Stemmed from the compositional nature of languages, spurious correlations have further undermined the trustworthiness of NLP systems, leading to unreliable model explanations that are merely correlated with the output predictions. To encourage fairness and transparency, there exists an urgent demand for reliable explanations that allow users to consistently understand the model's behavior. In this work, we propose a complete framework for extending concept-based interpretability methods to NLP. Specifically, we propose a post-hoc interpretability method for extracting predictive high-level features (concepts) from the pretrained model's hidden layer activations. We optimize for features whose existence causes the output predictions to change substantially, \ie generates a high impact. Moreover, we devise several evaluation metrics that can be universally applied. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on {predictive impact}, usability, and faithfulness compared to the baselines.

翻译：大规模预训练语言模型的出现，在解释模型为何做出某些预测方面带来了前所未有的挑战。由于语言具有组合性本质，虚假相关性进一步损害了自然语言处理系统的可信度，导致模型解释仅与输出预测相关而不可靠。为了促进公平性和透明度，迫切需要能够使用户一致理解模型行为的可靠解释。在本研究中，我们提出了一个将基于概念的可解释性方法扩展至自然语言处理的完整框架。具体而言，我们提出了一种事后可解释性方法，用于从预训练模型的隐藏层激活中提取预测性高层特征（概念）。我们优化那些存在时会显著改变输出预测（即产生高影响）的特征。此外，我们设计了几种可普遍应用的评估指标。在真实和合成任务上的大量实验表明，与基线方法相比，我们的方法在预测影响、可用性和忠实度方面取得了更优的结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/