Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.
翻译:激活补丁是一种直接计算行为对模型组件因果归因的方法。然而,穷举应用此方法需进行与模型组件数量线性相关的计算开销,这对于当前最先进的大型语言模型(LLMs)而言可能过于昂贵。我们研究了基于梯度的快速近似方法——归因补丁(AtP),并发现其两类失败模式,这些模式会导致显著的假阴性结果。我们提出AtP的变体AtP*,通过两项改进在保持可扩展性的同时解决上述失败模式。我们首次系统研究了AtP及可加速激活补丁的替代方法,证明AtP显著优于所有其他被考察方法,而AtP*则进一步带来显著提升。最后,我们提供了一种方法,用于界定AtP*估计中剩余假阴性概率的上限。