Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We highlight two different mechanisms that contribute to self-repair, including changes in the final LayerNorm scaling factor (which can repair up to 30% of the direct effect) and sparse sets of neurons implementing Anti-Erasure. We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.
翻译:先前针对窄分布的解释性研究初步发现了自我修复现象,即当大型语言模型中的组件被消融时,后续组件会改变其行为以进行补偿。本工作在前述文献基础上,证明在完整训练分布上消融单个注意力头时,自我修复存在于多种模型族和规模中。我们进一步表明,在完整训练分布上的自我修复是不完美的——注意力头的原始直接效应无法完全恢复,且存在噪声;不同提示下的自我修复程度差异显著(有时过度校正超过原始效应)。我们揭示了促成自我修复的两种不同机制,包括最终LayerNorm缩放因子的变化(可修复高达30%的直接效应)以及实现反擦除的稀疏神经元集合。此外,我们讨论了这些结果对可解释性实践者的启示,并以关于这些模型中为何会产生自我修复这一谜团的思辨性讨论作结,重点揭示了迭代推理假说(预测自我修复的理论框架)在语言模型中的证据。