Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to language models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.
翻译:机制可解释性旨在通过逆向工程神经网络的内部计算来理解其行为。然而,当前方法难以清晰解释神经网络激活,因为激活值缺少向计算特征的分解。单个神经元或模型组件并不与不同特征或功能一一对应。我们提出一种新型可解释性方法,通过将网络激活转换到新基——局部交互基(LIB)——来突破这一限制。LIB通过移除无关激活与交互来识别计算特征。该方法剔除无关激活方向,并将基与相邻层间雅可比矩阵的奇异向量对齐;同时根据特征对下游计算的重要性进行缩放,生成展示模型中所有计算相关特征与交互的交互图。我们在模加法和CIFAR-10模型上评估LIB的效果,发现与主成分分析相比,LIB能识别更多计算相关特征,且这些特征的交互更为稀疏。然而,当应用于语言模型时,LIB在可解释性提升或交互稀疏性方面未取得显著改进。我们得出结论:LIB是一种有前景的、理论驱动的神经网络分析方法,但其当前形式尚不适用于大型语言模型。