Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of language model pre-training. This paper demonstrates that the optimal composition of training data from different domains is scale-dependent, challenging the existing practice of determining optimal mixtures through small-scale experiments and directly applying them at larger scales. We derive an analytical model for the dependence of optimal weights on data scale and introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales. *AutoScale* first uses a principled optimization framework to find optimal compositions at smaller, feasible scales, then predicts optimal compositions at larger scales using our derived model. Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance. Particularly, for GPT-2 Large on RedPajama, *AutoScale* decreases validation perplexity 28% faster than baselines, with up to 38% speed-up over unweighted training, achieving the best performance across downstream tasks. This work provides insights into the varying benefits of data sources across training scales for language models, contributing to the burgeoning research on scale-dependent data curation. Code is open-sourced.
翻译:领域重加权是一种新兴研究方向,旨在调整不同数据源的相对权重以提升语言模型预训练的效果与效率。本文论证了来自不同领域的训练数据最优组合具有规模依赖性,这对现有通过小规模实验确定最优混合比例并直接应用于更大规模的做法提出了挑战。我们推导了最优权重对数据规模依赖关系的解析模型,并提出*AutoScale*——一种新颖、实用的方法,用于在可能的大规模训练数据尺度上优化数据组合。*AutoScale*首先使用基于原理的优化框架在较小可行规模上寻找最优组合,随后利用我们推导的模型预测更大规模下的最优组合。我们在GPT-2 Large和BERT预训练上的评估表明,*AutoScale*能有效提升训练收敛速度与下游任务性能。特别是在RedPajama数据集上的GPT-2 Large实验中,*AutoScale*使验证困惑度下降速度较基线方法提升28%,相比未加权训练最高加速38%,并在所有下游任务中取得最佳性能。本研究揭示了数据源在不同训练规模下对语言模型效益的动态变化,为规模敏感的数据策展这一新兴研究领域作出贡献。代码已开源。