Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.
翻译:优化预训练数据组成对于大语言模型的泛化能力至关重要。尽管动态混合策略能通过捕捉训练演化动态优于静态方法,但现有方法无法兼顾计算效率、样本效率以及面向多样化流水线的结构灵活性。我们提出演员-评论家在线数据混合方法(AC-ODM),从强化学习视角出发,采用参数化策略进行数据混合,并从理论上证明该策略可充当动态线性替代函数,最大化梯度间的相长干涉。为增强实际灵活性,AC-ODM支持两种运行模式:(i)代理模式,适用于固定的预制备语料库,将在小模型上习得的策略迁移至更大目标模型;(ii)非代理模式,无需先验知识即可直接进行端到端从头训练。实验表明,AC-ODM在多种架构上的收敛速度和下游准确率均显著优于现有方法。在Pythia-1B模型上,该方法仅需竞争基线方法最多66%的训练步数即可达到最优验证困惑度,MMLU准确率相对提升27.5%,HumanEval的pass@1指标提升2.23倍,而单步训练时钟时间仅增加可忽略不计的0.4%,内存开销仅增加2%。代码已开源:https://github.com/DANG-ai/AC-ODM。