Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.
翻译:经过训练的Transformer模型已被证明能够计算出对于预测下一个即时标记而言似乎冗余的抽象特征。我们识别了来自下一标记预测目标的梯度信号中哪些成分导致了这一现象,并提出了一种方法来估计这些成分对特定特征出现的影响。在玩具任务上验证我们的方法后,我们将其用于解释OthelloGPT中世界模型的起源以及小型语言模型中句法特征的成因。最后,我们将该框架应用于预训练的大型语言模型,发现对未来标记具有极高或极低影响力的特征往往与代码等形式推理领域相关。总体而言,我们的工作通过训练过程中特征发展的视角,朝着理解Transformer隐藏特征的方向迈出了一步。