Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.
翻译:大型语言模型(LLMs)在日益增多的任务上展现出显著的预测性能。然而,其快速普及与日益增强的不透明性,使得可解释性需求日益增长。本研究探讨能否自动获得黑箱文本模块的自然语言解释。“文本模块”指将文本映射为标量连续值的任意函数(如LLM内部的子模块或脑区拟合模型)。“黑箱”意味着我们仅能获取模块的输入与输出。我们提出“概括与评分”(SASC)方法——该方法输入一个文本模块,返回该模块选择性的自然语言解释,并附带该解释可靠性的评分。我们在三个场景中研究SASC:首先,在合成模块上评估SASC,发现其常能恢复真实解释;其次,运用SASC解释预训练BERT模型内部模块,实现对模型内部机制的检视;最后,证明SASC可为单个fMRI体素对语言刺激的响应生成解释,在精细脑功能映射领域具有潜在应用价值。所有使用SASC及复现结果的代码均已发布于Github。