Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.
翻译:激活修补是一种流行的机制可解释性技术,但在其应用方式及结果解读方面存在诸多微妙之处。我们基于实际应用该技术的经验,提供一份建议与最佳实践总结。本文概述了激活修补的不同应用方法,并探讨了如何解读其结果。我们重点阐述修补实验能提供关于电路结构的哪些证据,以及度量标准的选择及其相关陷阱。