Mechanistic interpretability aims to reverse engineer neural networks by uncovering which high-level algorithms they implement. Causal abstraction provides a precise notion of when a network implements an algorithm, i.e., a causal model of the network contains low-level features that realize the high-level variables in a causal model of the algorithm. A typical problem in practical settings is that the algorithm is not an entirely faithful abstraction of the network, meaning it only partially captures the true reasoning process of a model. We propose a solution where we combine different simple high-level models to produce a more faithful representation of the network. Through learning this combination, we can model neural networks as being in different computational states depending on the input provided, which we show is more accurate to GPT 2-small fine-tuned on two toy tasks. We observe a trade-off between the strength of an interpretability hypothesis, which we define in terms of the number of inputs explained by the high-level models, and its faithfulness, which we define as the interchange intervention accuracy. Our method allows us to modulate between the two, providing the most accurate combination of models that describe the behavior of a neural network given a faithfulness level.
翻译:机制可解释性旨在通过揭示神经网络所实现的高级算法来逆向工程神经网络。因果抽象提供了一个精确的概念,用于判断网络何时实现某个算法,即网络的因果模型包含实现算法因果模型中高级变量的低级特征。实际应用中的一个典型问题是,算法并非对网络的完全忠实抽象,这意味着它仅部分捕获了模型的真实推理过程。我们提出一种解决方案,通过结合不同的简单高级模型来生成对网络更忠实的表示。通过学习这种组合,我们可以将神经网络建模为根据输入的不同而处于不同的计算状态,我们证明这种方法对于在两项玩具任务上微调的GPT 2-small模型更为准确。我们观察到可解释性假设的强度(我们根据高级模型解释的输入数量定义)与其忠实性(我们定义为互换干预准确率)之间存在权衡。我们的方法允许我们在两者之间进行调节,在给定忠实性水平下提供描述神经网络行为的最准确模型组合。