Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.
翻译:机制可解释性旨在理解机器学习模型的内部机制,其中定位——识别关键模型组件——是关键步骤。激活修补(也称因果追踪或互换干预)是该任务的标准技术(Vig 等人,2020),但文献中存在多种变体,且在超参数或方法论的选择上缺乏共识。在本工作中,我们系统性地考察了激活修补中方法细节的影响,包括评估度量与损坏方法。在语言模型定位和回路发现的若干设置中,我们发现这些超参数的变化可能导致不同的可解释性结果。基于实证观察,我们提出了为何某些度量或方法更受青睐的概念性论点。最后,我们为未来激活修补的最佳实践提供了建议。