The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19\% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.
翻译:用于预训练大语言模型的数据对模型的下游性能具有决定性影响,这催生了大量关于数据选择方法的研究,旨在自动确定最适合预训练的数据。现有数据选择方法存在处理速度慢且计算成本高昂的问题,而随着模型规模和预训练数据集不断增大,这一问题愈发突出。相比之下,数据混合通过将数据点分组并确定各组间的采样概率,降低了数据选择的复杂度。然而,数据混合比例通常在训练前固定,因此无法适应动态变化的训练过程。为克服这些局限性,我们提出了一种结合数据选择与数据混合优点的在线数据混合(ODM)高效算法。该方法基于多臂老虎机算法,能够在训练过程中动态优化数据混合比例。值得注意的是,我们的方法使模型在达到次优方法最终困惑度时减少了19%的训练迭代次数,并在5-shot MMLU基准测试中相对准确率提升了1.9%,同时预训练期间增加的挂钟时间几乎可以忽略不计。