A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use this understanding to craft adversarial examples, uncover overfitting, and identify small language model circuits directly from the weights alone. Our results demonstrate that bilinear layers serve as an interpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.
翻译:对深度神经网络中多层感知机(MLP)如何执行计算过程的机制性理解仍不充分。现有可解释性研究能够从输入数据集上的隐藏激活中提取特征,但通常无法解释MLP权重如何构建这些特征。其中一个挑战在于逐元素非线性引入了高阶相互作用,使得难以通过MLP层追踪计算过程。本文分析了双线性MLP——一种不含任何逐元素非线性却仍能实现竞争性能的门控线性单元(GLU)。双线性MLP可完全通过使用三阶张量的线性运算来表达,从而实现对权重的灵活分析。通过对双线性MLP权重进行特征分解来研究其谱特性,我们在玩具任务、图像分类和语言建模中均发现了可解释的低秩结构。基于这一理解,我们构建了对抗样本、揭示了过拟合现象,并直接从小型语言模型的权重中识别出计算回路。研究结果表明,双线性层可作为现有激活函数的可解释即插即用替代方案,且基于权重的可解释性方法对于理解深度学习模型具有可行性。