The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose *Edge Pruning* as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.
翻译:语言模型的解释路径通常通过对电路的分析来实现——电路是模型中捕捉特定行为模式的稀疏计算子图。近期研究已实现电路发现的自动化。然而,现有方法存在实际局限性,因其依赖低效搜索算法或不精确近似。本文将自动化电路发现构建为优化问题,并提出*边剪枝*作为高效可扩展的解决方案。边剪枝利用基于梯度的剪枝技术,但不同于移除神经元或组件,该方法剪枝组件间的\emph{连接边}。我们的方法在GPT-2中发现的电路,相比先前方法所用边数减少超50%,同时在标准电路发现任务中与完整模型预测保持同等忠实度。边剪枝即使处理多达10万样本仍保持高效,在速度上超越先前方法并产生显著更优的电路。该方法还能在Tracr编译的两个模型中完美还原真实电路。得益于其高效性,我们将边剪枝扩展至CodeLlama-13B——该模型规模超先前方法操作范围的100倍。我们在此场景下开展案例研究,比较指令提示与上下文学习背后的机制。研究发现两个稀疏度超99.96%的电路,其性能与完整模型匹配,并揭示两种场景下的机制存在实质性重叠。本案例研究表明,边剪枝是可解释性研究中实用且可扩展的工具,为仅在大模型中涌现的行为机制提供了新的见解。