Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of streamlining data curation to enhance training efficiency. Specifically, we propose a unified scaling law, termed $\textbf{BiMix}$, which accurately models the bivariate scaling behaviors of both data quantity and mixing proportions. We conduct systematic experiments and provide empirical evidence for the predictive power and fundamental principles of $\textbf{BiMix}$. Notably, our findings reveal that entropy-driven training-free data mixtures can achieve comparable or even better performance than more resource-intensive methods. We hope that our quantitative insights can shed light on further judicious research and development in cost-effective language modeling.
翻译:大型语言模型展现出卓越的泛化能力,这主要归功于对多样化来源数据的利用。然而,当前整合此类多样化数据的常规实践严重依赖启发式方案,缺乏理论指导。本研究通过探索基于低成本代理的数据混合策略来解决这些局限性,旨在简化数据筛选流程以提升训练效率。具体而言,我们提出了一个统一的缩放定律,称为 $\textbf{BiMix}$,它能精确建模数据量与混合比例的双变量缩放行为。我们进行了系统实验,并为 $\textbf{BiMix}$ 的预测能力及其基本原理提供了实证依据。值得注意的是,我们的研究结果表明,基于熵驱动的免训练数据混合方法能够取得与资源密集型方法相当甚至更优的性能。我们希望我们的量化见解能为成本效益优化的语言模型研发提供启示,推动该领域更审慎的探索。