The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some researchers have gone so far as to name "reasons interpretability researchers hate layer norm." In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the OpenWebText and ThePile datasets (-0.05 cross-entropy loss), and the Hellaswag benchmark (-0.5% accuracy). We provide our implementation at https://github.com/ApolloResearch/gpt2_noLN, and fine-tuned GPT2-small models at https://huggingface.co/apollo-research/gpt2_noLN. Our work not only provides a simplified model for mechanistic interpretability research, but also provides evidence that the LN layers, at inference time, do not play a crucial role in transformer models.
翻译:GPT风格Transformer模型中的LayerNorm(LN)层长期以来一直是机制可解释性研究的障碍。LN是稳定大语言模型训练的关键组件,基于Transformer架构的所有大语言模型几乎都采用了LN或类似的RMSNorm。LN层的非线性特性阻碍了残差流的解释,使得模型难以分解为电路模块,因而成为机制可解释性研究的瓶颈。甚至有研究者将其列为“可解释性研究者厌恶LayerNorm的若干理由”。本文证明,通过对预训练的GPT2-small模型进行少量训练数据(5亿词元)的微调,可以完全移除其中的LN层。实验表明,这种无LN模型在OpenWebText和ThePile数据集上达到与原模型相近的性能(交叉熵损失仅增加0.05),在Hellaswag基准测试中准确率仅下降0.5%。我们在https://github.com/ApolloResearch/gpt2_noLN 开源了实现代码,并在https://huggingface.co/apollo-research/gpt2_noLN 提供了微调后的GPT2-small模型。这项工作不仅为机制可解释性研究提供了简化模型,同时证明在推理阶段LN层对Transformer模型并非关键组件。