Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. In order to understand these algorithms, it is often necessary to hypothesize intermediate variables involved in the network's computation. For example, does a language model depend on particular syntactic properties when generating a sentence? However, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique -- circuit probing -- that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. We compare circuit probing to other methods across these three experiments, and find it on par or more effective than existing analysis methods. Finally, we demonstrate circuit probing on a real-world use case, uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.
翻译:神经网络模型在各类复杂任务中取得了高性能,但其实现的算法因难以解释而著称。为了理解这些算法,通常需要假设网络计算中涉及的中间变量。例如,语言模型在生成句子时是否依赖于特定的句法属性?然而,现有分析工具难以验证此类假设。我们提出一种新型分析技术——电路探针——它能自动发现计算假设中间变量的低级电路。通过对模型参数进行靶向消融,该方法可实现因果分析。我们将该方法应用于基于简单算术任务训练的模型,证明其在以下三方面的有效性:(1)解读模型习得的算法;(2)揭示模型内部的模块化结构;(3)追踪训练过程中电路的发展。通过三项实验将电路探针与其他方法对比,发现其效果与现有分析方法相当或更优。最后,我们在真实场景中展示电路探针的应用,揭示了GPT2-Small和Medium模型中负责主谓一致和反身回指现象的电路。