Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
翻译:理解基于Transformer的语言模型如何学习与回忆信息,是深度学习社区的关键目标。当前可解释性方法将前向传播获得的权重与隐藏状态投影至模型词汇空间,有助于揭示语言模型内部的信息流动机制。本研究将该方法扩展至语言模型的反向传播与梯度过程。我们首先证明梯度矩阵可表示为前向与后向传播输入的低秩线性组合,继而开发出将梯度投影至词汇项的方法,并探索新信息在语言模型神经元中存储的内在机制。