Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.
翻译:机制可解释性为全面理解神经网络如何做出预测提供了一条可行路径。先前研究表明,即使在执行简单算术运算的训练任务中,模型也会根据初始化和超参数实现多种算法(有时并行运行)。这是否意味着神经元层面的可解释性技术适用性有限?我们认为,高维神经网络能够学习训练数据的低维表征,这些表征的用途不仅限于做出准确预测。通过机制可解释性的视角可以理解此类表征,并发现其与人类领域知识惊人吻合的深刻洞见。这表明此类可解释性方法能够从训练解决问题的模型中推导出对问题的新理解。作为案例研究,我们通过分析训练再现核数据的模型,提取出了核物理概念。