The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
翻译:当前语言模型扩展的趋势涉及同时增加参数量和训练数据集规模。外推这一趋势表明,训练数据集规模可能很快受到互联网上可用文本数据量的限制。受此限制的启发,我们研究了在数据受限场景下的语言模型扩展问题。具体而言,我们进行了一系列大规模实验,改变数据重复程度和计算预算,实验范围涵盖高达9000亿训练token和90亿参数的模型。我们发现,在固定计算预算且数据受限的情况下,使用最多4个epoch的重复数据进行训练,与使用唯一数据相比,损失函数的变化可忽略不计。然而,当重复程度更高时,增加计算的价值最终会衰减至零。我们提出并通过实证验证了一种考虑重复token和冗余参数价值递减的计算最优缩放定律。最后,我们实验了缓解数据稀缺的方法,包括使用代码数据扩充训练数据集或移除常用的过滤条件。我们400次训练运行的模型和数据集可在https://github.com/huggingface/datablations免费获取。