Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised -- a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the mixture after the domain set is updated, leveraging information from past mixtures. We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.
翻译:数据混合——确定来自不同领域数据的比例——是训练语言模型(LMs)时首要关注的问题。虽然现有的混合方法显示出潜力,但在实际的语言模型开发过程中应用时仍显不足。我们提出了Olmix框架,旨在解决两个此类挑战。首先,开发混合方法的配置空间尚未被充分理解——现有方法中的设计选择缺乏充分的理由或共识,并且忽视了数据约束等实际问题。我们对该配置空间进行了全面的实证研究,识别出哪些设计选择能产生强大的混合方法。其次,在实践中,领域集合在整个语言模型开发过程中会不断演变,例如数据集的添加、移除、划分和修订——这是一个现有研究大多未涉及的问题设定,因为它们通常假设领域是固定的。我们研究了如何在领域集合更新后,利用过去混合方案的信息来高效地重新计算混合比例。我们引入了混合重用机制,该机制重用现有的比例,并仅对受更新影响的领域重新计算比例。在一系列模拟真实世界语言模型开发的五次领域集合更新过程中,混合重用机制在每次更新后都能达到完全重新计算混合比例的性能,同时减少了74%的计算量,并且在下游任务上比不进行混合的训练提升了11.6%。