We explore the critical data size in language models, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in language models training dynamics. We develop a grokking configuration to reproduce grokking on simplistic language models stably by rescaling initialization and weight decay. We show that generalization occurs only when language models reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models require more data. Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
翻译:我们探索了语言模型中的临界数据规模,这一阈值标志着从快速记忆向缓慢泛化的根本性转变。在“顿悟”配置下,我们将这一相变形式化为“数据效率假说”,并识别出语言模型训练动态中的数据不足、充分及盈余三种状态。通过重新缩放初始化和权重衰减,我们开发了一种“顿悟”配置,使其能稳定地在简易语言模型上复现“顿悟”现象。我们证明,只有当语言模型达到临界规模时,泛化才会发生。我们从样本维度和模型维度分析了“顿悟”现象,验证了所提出的数据效率假说。实验揭示了语言数据集在临界规模处发生更为平滑的相变。随着模型规模增大,这一临界点也随之变大,表明更大规模的模型需要更多数据。我们的研究结果深化了对语言模型训练的理解,为数据在语言模型学习机制中的作用提供了全新视角。