A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.
翻译:可解释性研究中的一个基本问题是:神经网络,特别是语言模型,在多大程度上通过可组合执行更复杂任务的子网络来实现可重用功能。机制可解释性领域的最新进展在识别子网络方面取得了进展,这些子网络通常被称为“电路”,它们代表了模型在特定任务上行为的最小计算子图。然而,大多数研究专注于识别单个任务的电路,而未探究功能相似的电路彼此间如何关联。为填补这一空白,我们通过分析基于Transformer的语言模型中高度组合性子任务的电路,来研究神经网络的模块性。具体而言,给定一个概率上下文无关文法,我们识别并比较了负责十种模块化字符串编辑操作的电路。我们的结果表明,功能相似的电路既表现出显著的节点重叠,也具备跨任务忠实性。此外,我们证明所识别的电路可以通过子网络集合操作被重用和组合,以表示模型更复杂的功能能力。