Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.
翻译:大型语言模型在日益增多的任务中展现出卓越的预测性能。然而,其快速普及和不断增加的不可解释性催生了对可解释性的迫切需求。本文探究能否自动获取黑盒文本模块的自然语言解释。"文本模块"指将文本映射为标量连续值的任意函数,例如大型语言模型中的子模块或脑区拟合模型。"黑盒"表明我们仅能访问模块的输入与输出。我们提出了Summarize and Score方法,该方法接收文本模块并返回该模块选择性的自然语言解释,同时附带解释可靠性的评分。我们在三种情境下研究SASC方法:其一,在合成模块上评估SASC,发现该方法通常能还原真实解释;其二,利用SASC解释预训练BERT模型中的子模块,从而实现对模型内部机制的检视;其三,证实SASC可为个体fMRI体素对语言刺激的响应生成解释,这对精细脑图谱绘制具有潜在应用价值。在Github上公开了使用SASC方法及复现结果的全部代码。