Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.
翻译:自动化可解释性研究近来备受关注,有望将神经网络行为解释扩展到大型模型的潜在研究方向。现有自动化电路发现工作采用激活修补法识别负责解决特定任务的子网络(电路)。本研究表明,基于属性修补的简单方法仅需两次前向传播和一次反向传播即可超越所有现有方法。我们采用激活修补的线性近似来估计计算子图中每条边的重要性,并据此剪枝网络中重要性最低的边。通过评估该方法的性能与局限性,发现所有任务平均而言,本方法在电路恢复上的AUC优于其他方法。