Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron $a$ activates on all and only input strings that refer to a concept picked out by the proposed explanation $E$. In the intervention mode, we construe $E$ as a claim that the neuron $a$ is a causal mediator of the concept denoted by $E$. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether natural language is a good choice for explanations and whether neurons are the best level of analysis.
翻译:自然语言是解释大型语言模型如何处理和存储信息的吸引人媒介,但评估此类解释的忠实性极具挑战性。为应对这一问题,我们针对声称单个神经元代表文本输入中某个概念的自然语言解释,开发了两种评估模式。在观察模式中,我们评估这样的主张:神经元 \(a\) 激活于且仅激活于那些指代由解释 \(E\) 所确定概念的输入字符串。在干预模式中,我们解读 \(E\) 为一种主张:神经元 \(a\) 是 \(E\) 所指概念的中介因果因素。我们将该框架应用于Bills等人(2023)为GPT-2 XL神经元生成的GPT-4解释,结果表明即使是最有信心的解释也表现出高错误率且几乎没有因果效应。最后,我们批判性地评估了自然语言是否适合作为解释方式以及神经元是否是最佳分析层级。