We explore the critical data size in language models, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in language models training dynamics. We develop a grokking configuration to reproduce grokking on simplistic language models stably by rescaling initialization and weight decay. We show that generalization occurs only when language models reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models require more data. Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
翻译:我们探索语言模型中的临界数据量——一个标志着从快速记忆到缓慢泛化根本性转变的阈值。在“顿悟”配置下,我们将这种相变形式化为数据效率假说,并识别出语言模型训练动态中的数据不足、数据充分和数据盈余三种状态。通过重缩放初始化和权重衰减,我们开发出一种“顿悟”配置,能在简化语言模型上稳定复现“顿悟”现象。我们证明:仅当语言模型达到临界规模时,泛化才会发生。我们从样本维度和模型维度分析“顿悟”现象,验证了所提出的数据效率假说。实验揭示,在语言数据集的临界数据量处会出现更平滑的相变。随着模型规模增大,这一临界点也随之增大,表明更大模型需要更多数据。我们的研究深化了对语言模型训练的理解,为数据在语言模型学习机制中的作用提供了全新视角。