Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.
翻译:机械可解释性旨在通过识别神经网络中的电路(即模型内实现特定任务算法的最小子图)来理解大型神经网络的内部工作机制。这些电路通常通过严格定义的提示格式进行发现和分析。然而,鉴于大语言模型(LLMs)能够在同一任务的不同提示格式间实现泛化,这些电路的泛化能力尚不明确。例如,目前不清楚模型的泛化是源于复用相同电路组件、组件行为变化,还是使用了完全不同的组件。本文以GPT-2 small中经过充分研究且被认为实现了简单可解释算法的间接宾语识别(IOI)电路为对象,通过挑战该算法假设的提示变体评估其泛化性能。研究发现该电路展现出惊人的泛化能力:在保留所有原始组件与机制的同时,仅需增加额外的输入边即可适应新场景。值得注意的是,即使在原始算法本应失效的提示变体上,该电路仍能保持泛化;我们发现了可解释此现象的机制,并将其命名为S2黑客技术。这些结果表明,大语言模型中的电路可能比既往认知更具灵活性与泛化性,凸显了研究电路泛化对于深入理解模型宏观能力的重要性。