Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. In this paper, we study the cause of this inconsistency by unifying existing methods into a standard optimization framework. We show that all methods set proportions to minimize total loss, subject to a method-specific mixing law -- an assumption on how loss is a function of mixture proportions. We find that existing parameterizations of mixing laws can express the true loss-proportion relationship empirically, but the methods themselves often set the mixing law parameters inaccurately, resulting in poor and inconsistent performance. Finally, we leverage the insights from our framework to derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.28 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.01 test perplexity points.
翻译:语言模型的性能取决于识别出用于训练的最佳数据组混合比例(例如法律、代码、数学)。先前的研究提出了多种高效学习混合比例的方法,范围包括在多次训练运行上拟合回归模型,以及在训练过程中动态更新比例。令人惊讶的是,我们发现,在按组平均测试困惑度方面,没有任何现有方法能始终优于简单的分层抽样基线。在本文中,我们通过将现有方法统一到一个标准优化框架中来研究这种不一致性的原因。我们证明,所有方法都是基于一个方法特定的混合定律——即关于损失如何作为混合比例函数的假设——来设定比例以最小化总损失。我们发现,现有混合定律的参数化方式能够经验性地表达真实的损失-比例关系,但这些方法本身常常不准确地设定混合定律参数,导致性能不佳且不一致。最后,我们利用从框架中获得的见解,推导出一种名为 Aioli 的新在线方法,该方法在训练过程中直接估计混合定律参数,并用它们动态调整比例。经验表明,Aioli 在 6 个数据集中的全部 6 个上均优于分层抽样,平均降低 0.28 个测试困惑度点,而现有方法未能始终击败分层抽样,有时甚至差出多达 6.9 个点。此外,在由于计算限制而在较短运行中学习比例的实际场景下,Aioli 能够在整个训练运行过程中动态调整这些比例,始终优于现有方法,最多可提升 12.01 个测试困惑度点。