Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of streamlining data curation to enhance training efficiency. Specifically, we propose a unified scaling law, termed $\textbf{BiMix}$, which accurately models the bivariate scaling behaviors of both data quantity and mixing proportions. We conduct systematic experiments and provide empirical evidence for the predictive power and fundamental principles of $\textbf{BiMix}$. Notably, our findings reveal that entropy-driven training-free data mixtures can achieve comparable or even better performance than more resource-intensive methods. We hope that our quantitative insights can shed light on further judicious research and development in cost-effective language modeling.

翻译：大型语言模型展现出卓越的泛化能力，这主要归功于对多样化来源数据的利用。然而，当前整合此类多样化数据的常规实践严重依赖启发式方案，缺乏理论指导。本研究通过探索基于低成本代理的数据混合策略来解决这些局限性，旨在简化数据筛选流程以提升训练效率。具体而言，我们提出了一个统一的缩放定律，称为 $\textbf{BiMix}$，它能精确建模数据量与混合比例的双变量缩放行为。我们进行了系统实验，并为 $\textbf{BiMix}$ 的预测能力及其基本原理提供了实证依据。值得注意的是，我们的研究结果表明，基于熵驱动的免训练数据混合方法能够取得与资源密集型方法相当甚至更优的性能。我们希望我们的量化见解能为成本效益优化的语言模型研发提供启示，推动该领域更审慎的探索。

相关内容

Scaling Law

关注 0

从目前的研究总结发现，模型规模的扩展是LLM能力提升的一个关键因素。从GPT-3的175B参数量到PaLM的540B记录，都验证了模型规模的扩展，导致能力的提升。当然，大的模型尺寸是必不可少的，但是扩展定律并不仅限于此，它一共包括三个方面：模型尺寸（Model size）数据规模（Data size）总计算量（Total compute）此外，预训练数据的质量在保证模型性能方面有着关键作用，因此在扩展语料库时，要注意数据收集和清理的策略。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日