Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their ``black-box'' nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of ``Food'' and investigate how it influences the prediction of a model's sentiment towards a restaurant review. We introduce C$^3$M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we manifest that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.
翻译:预训练语言模型(PLMs)已在多种自然语言处理任务中取得显著进展。然而,其"黑箱"特性导致的可解释性缺失,给负责任部署带来了挑战。尽管先前研究尝试通过自注意力层中的注意力权重等机制提升可解释性,但这些权重往往缺乏清晰度、可读性与直观性。本研究提出一种通过人类易于理解的高维语义概念来解读PLMs的新方法。例如,我们学习"食物"概念,并探究它如何影响模型对餐厅评论情感预测的结果。我们引入C$^3$M框架,该框架融合人工标注与机器生成的概念,以提取旨在封装语义明确且任务特定概念的隐藏神经元。基于真实数据集的经验评估表明,我们的方法为解读PLM行为提供了宝贵洞见,有助于诊断模型故障,并在含噪概念标签下增强模型鲁棒性。