Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

This paper presents the first study of grokking in practical LLM pretraining. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs' training on algorithmic data, we focus on a practical setting for LLMs, i.e., one-epoch pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of their distributions and attributions to others. To find a mechanistic interpretation of this local grokking, we investigate the dynamics of training data's pathways (i.e., expert choices across layers in MoE). Our primary discovery is that the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples, despite the converged pretraining loss. This depicts a transition from memorization to generalization. Two novel metrics are developed to quantify these patterns: one computes the pathway similarity between samples, while the other measures the consistency of aggregated experts between subsequent layers for each sample. These training data based metrics induce zero cost but can faithfully track and monitor the generalization of LLMs on downstream tasks, which, in conventional settings, requires costly instruction tuning and benchmark evaluation.

翻译：本文首次研究了实际大语言模型预训练中的顿悟现象。具体而言，我们探究了大语言模型何时记忆训练数据、其在下游任务上的泛化能力何时开始提升，以及两者间若存在滞后会发生何种现象。与现有研究关注小型模型在数千轮算法数据训练中对有限特定任务的泛化时机不同，我们聚焦于大语言模型的实际设定：即在跨领域、大规模语料上进行单轮下一词预测预训练，并在涵盖数学/常识推理、代码生成及领域特定检索的多样化基准任务上评估泛化能力。本研究首次验证了顿悟现象在预训练混合专家大语言模型中依然存在，尽管由于数据分布的异质性及其对其他部分的贡献差异，不同局部数据组可能异步进入顿悟阶段。为探寻这种局部顿悟的机制性解释，我们研究了训练数据路径的动态演化（即混合专家模型中各层的专家选择模式）。我们的核心发现是：尽管预训练损失已收敛，但路径模式会从随机的、跨层非平滑的、样本特定的状态，逐渐演变为更具结构性且可跨样本迁移的形式。这描绘了从记忆到泛化的转变过程。我们开发了两个新颖指标来量化这些模式：其一计算样本间的路径相似度，其二度量每个样本在相邻层间聚合专家选择的一致性。这些基于训练数据的指标无需额外成本，却能可靠追踪和监测大语言模型在下游任务上的泛化能力——而在传统设定中，这需要耗费高昂的指令微调与基准评估。