The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are publicly available at https://github.com/huggingface/datablations.
翻译:当前语言模型的扩展趋势涉及同时增加参数量和训练数据集规模。外推这一趋势表明,训练数据集规模可能很快会受到互联网上文本数据总量的限制。基于这一限制,我们研究了数据受限情况下语言模型的扩展问题。具体而言,我们开展了一系列大规模实验,改变数据重复程度和计算预算,实验规模涵盖高达9000亿训练token和90亿参数的模型。我们发现,在固定计算预算且数据受限的条件下,相较于使用唯一数据,最多重复4个epoch的训练数据对损失的影响微乎其微。然而,随着重复次数增加,追加计算的价值最终衰减为零。我们提出并实证验证了一种针对计算最优性的缩放定律,该定律考虑了重复token和冗余参数递减的价值。最后,我们尝试了缓解数据稀缺的方法,包括在训练数据集中加入代码数据或移除常用过滤器。我们400次训练运行的模型和数据集在https://github.com/huggingface/datablations 公开提供。