As large language models (LLMs) become central to AI applications, gaining a deeper understanding of their inner workings is increasingly important. In this work, we analyze the weight matrices of pretrained transformer models -- specifically BERT and Llama -- using random matrix theory (RMT) as a zero-information hypothesis. While randomly initialized weights perfectly agree with RMT predictions, deviations emerge after training, allowing us to locate learned structures within the models. We identify layer-type specific behaviors that are consistent across all blocks and architectures considered. By pinpointing regions that deviate from RMT predictions, we highlight areas of feature learning and confirm this through comparisons with the activation covariance matrices of the corresponding layers. Our method provides a diagnostic tool for identifying relevant regions in transformer weights using only the trained matrices. Additionally, we address the ongoing debate regarding the significance of small singular values in the context of fine-tuning and alignment in LLMs. Our findings reveal that, after fine-tuning, small singular values play a crucial role in the models' capabilities, suggesting that removing them in an already aligned transformer can be detrimental, as it may compromise model alignment.
翻译:随着大型语言模型(LLMs)在人工智能应用中占据核心地位,深入理解其内部工作机制变得日益重要。在本工作中,我们以随机矩阵理论(RMT)作为零信息假设,分析了预训练Transformer模型(具体为BERT和Llama)的权重矩阵。虽然随机初始化的权重完全符合RMT的预测,但在训练后出现了偏差,这使我们能够定位模型内部已学习到的结构。我们识别出在所有考察的模块和架构中均保持一致、且具有层类型特异性的行为。通过精确定位偏离RMT预测的区域,我们突出了特征学习的区域,并通过与相应层激活协方差矩阵的比较验证了这一点。我们的方法提供了一种仅使用训练后的矩阵即可识别Transformer权重中相关区域的诊断工具。此外,我们还探讨了当前关于LLMs微调和对齐过程中小奇异值重要性的持续争论。我们的研究结果表明,在微调后,小奇异值对模型能力起着至关重要的作用,这表明在已对齐的Transformer中移除它们可能是有害的,因为这可能会损害模型的对齐性。