Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.
翻译:机制可解释性旨在理解机器学习模型的内部机制,其中定位——识别重要模型组件——是关键步骤。激活修补(也称为因果追踪或干预替换)是此项任务的标准技术(Vig等,2020),但文献中存在众多变体,在超参数选择或方法论上缺乏共识。本研究系统地考察了激活修补中方法细节的影响,包括评估指标和破坏方法。在语言模型定位与回路发现的多种设置中,我们发现改变这些超参数可能导致不同的可解释性结论。基于实证观察,我们提供了为何某些度量或方法更受青睐的概念性论证。最后,我们给出了未来激活修补最佳实践的建议。