Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.
翻译:现代机器学习流程日益融合来自不同且异构来源的数据,例如预训练大型语言模型。然而,寻找最优数据混合方案仍是一个具有挑战性的开放性问题。我们将此数据混合问题形式化为一个双层优化目标:最优混合方案应能产生最适应下游任务的模型。遗憾的是,该目标通常难以直接求解。本文发现,当模型类别扩大时,该双层数据混合目标会转化为凸优化问题。我们提出并研究了一种基于梯度的凸优化方法,称为MixMin,并在语言建模与化学任务中进行了验证。在所有实验中,MixMin是唯一能持续提升数据混合效果的方法。通过MixMin,我们在使用pythia-410M模型训练82亿词元时,仅增加不足0.2%的计算开销便优化了数据混合方案,使PIQA、ARC Easy、SciQ和OpenWebMath数据集的负对数似然相对提升了1-5%。关键的是,我们发现针对小模型优化的MixMin混合方案同样能提升大模型的训练效果,这表明MixMin混合方案可能具有尺度不变性。在混合生物测定数据训练XGBoost模型时,平均精度分数提升了0.03-0.15。