Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions span textual and numeric domains, and involve a range of real-world complexities. We evaluate methods that use pretrained language models (LMs) to produce descriptions of function behavior in natural language and code. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built from an LM with black-box access to functions, can infer function structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, AIA descriptions tend to capture global function behavior and miss local details. These results suggest that FIND will be useful for evaluating more sophisticated interpretability methods before they are applied to real-world models.
翻译:以人类可读的描述标注神经网络子模块,对许多下游任务具有重要价值:这类描述能够揭示模型缺陷、指导干预措施,甚至可能解释关键的模型行为。迄今为止,大多数对训练后网络的机制性描述仍局限于小型模型、窄范围现象以及大量的人工投入。要标注规模日益庞大、复杂度日益增高的模型中所有人类可解释的子计算过程,几乎必然需要能够自动生成和验证描述的工具。近年来,将学习模型纳入标注流程的技术开始受到关注,但评估其效用的方法却十分有限且具有临时性。我们应如何验证和比较开放式的标注工具?本文提出FIND(函数解释与描述)基准套件,用于评估自动化可解释性方法的构建模块。FIND包含与已训练神经网络组件相似的函数,以及我们期望生成的相关描述。这些函数涵盖文本和数值领域,涉及多种现实世界复杂性。我们评估了利用预训练语言模型以自然语言和代码形式生成函数行为描述的方法。此外,我们引入了一种新的交互式方法,即通过自动化可解释性智能体(AIA)生成函数描述。研究发现,以语言模型为基础构建的AIA(具有对函数的黑盒访问权限)能够推断函数结构,通过形成假设、提出实验并依据新数据更新描述来扮演科学家的角色。然而,AIA的描述倾向于捕捉函数的全局行为,忽略了局部细节。这些结果表明,FIND将有助于在更复杂的可解释性方法应用于真实模型之前对其进行有效评估。