Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.
翻译:将神经网络的特定行为定位到网络组件的子集或组件间交互的子集,是分析网络机制及潜在故障模式时自然的第一步。现有工作通常为定性且特设性的,且对于如何恰当评估定位主张尚未达成共识。我们提出路径修补技术,用于表达并定量测试一类自然假设——即行为局部化于某组路径。我们改进了对归纳头的解释,刻画了GPT-2的一项行为机理,并开源了一个高效运行类似实验的框架。