Finding Transformer Circuits with Edge Pruning

The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose *Edge Pruning* as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.

翻译：解释语言模型的路径通常通过对电路的分析来实现——电路是模型中捕捉其特定行为的稀疏计算子图。近期研究已实现了电路发现的自动化。然而，这些方法存在实际局限性，因为它们要么依赖低效的搜索算法，要么采用不精确的近似。本文提出将自动化电路发现构建为优化问题，并提出*边剪枝*作为一种高效且可扩展的解决方案。边剪枝方法利用基于梯度的剪枝技术，但其并非移除神经元或组件，而是剪枝组件之间的\emph{边}。我们的方法在GPT-2中发现的电路，其使用的边数相比先前方法减少一半以上，同时在标准电路发现任务中与完整模型预测的保真度相当。即使处理多达10万个示例，边剪枝仍保持高效，在速度上超越先前方法，并能产生显著更优的电路。该方法还能在Tracr编译的两个模型中完美恢复真实电路。得益于其高效性，我们将边剪枝扩展至CodeLlama-13B模型，其规模超过先前方法操作模型的100倍。我们在此设定下开展案例研究，比较指令提示与上下文学习背后的机制。我们发现了两个稀疏度超过99.96%的电路，其性能与完整模型匹配，并揭示两种设置下的机制存在显著重叠。本案例研究表明，边剪枝是一种实用且可扩展的可解释性工具，能够揭示仅在大规模模型中涌现的行为特性。