Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.
翻译:将神经网络的行为定位到网络的组件子集或组件间交互子集是分析网络机制和潜在故障模式的自然第一步。现有工作通常具有定性和特设性,且对于如何恰当评估定位主张尚未达成共识。我们引入路径修补技术,用于表达和定量测试一类自然假设——行为定位到特定路径集。该技术改进了对归纳头的解释,刻画了GPT-2的一项行为特征,并开源了一套可高效运行类似实验的框架。