Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. In order to understand these algorithms, it is often necessary to hypothesize intermediate variables involved in the network's computation. For example, does a language model depend on particular syntactic properties when generating a sentence? However, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique -- circuit probing -- that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. We compare circuit probing to other methods across these three experiments, and find it on par or more effective than existing analysis methods. Finally, we demonstrate circuit probing on a real-world use case, uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.
翻译:神经网络模型在各类复杂任务上取得了高性能表现,但其内部实现的算法却以难以解释著称。为了理解这些算法,往往需要假设网络计算过程中存在的中间变量。例如,语言模型在生成句子时是否依赖于特定的句法属性?然而现有分析工具难以验证这类假设。我们提出了一种新型分析技术——电路探测,该方法能自动发现计算假设中间变量的底层电路,从而在模型参数层面通过定向消融实现因果分析。我们将此方法应用于在简单算术任务上训练的模型,验证了其在以下三方面的有效性:(1) 破译模型习得的算法,(2) 揭示模型内部的模块化结构,(3) 追踪训练过程中电路的演化。通过这三组实验与现有方法的对比,我们发现电路探测的性能达到或超越现有分析方法。最后,我们将电路探测应用于实际场景,成功发现了GPT2-Small和Medium模型中负责主谓一致与反身回指现象的电路。