Curriculum learning-organizing training data from easy to hard-has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases,reducing training steps by $18-45\%$ to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to $3.5\%$. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering-orthogonal to existing data selection methods-provides a practical mechanism for more efficient LLM pretraining.
翻译:课程学习——将训练数据从易到难组织——已在多个机器学习领域提升了训练效率,但其在语言模型预训练中的应用仍待深入探索。本文首次对大规模语言模型预训练中的课程学习进行了系统研究,通过三种策略(基础课程学习、基于进度的采样和交错课程)训练了超过200个模型,训练数据规模达1000亿词元,并采用涵盖语言学和信息论特性的六种难度指标进行指导。我们在三种实际场景(有限数据、无限数据和持续训练)下评估了模型在八个基准测试上的性能。实验表明,课程学习在训练早期和中期持续加速收敛,达到基线性能所需的训练步数减少$18-45\%$。当作为标准随机采样前的预热策略时,课程学习能带来最高$3.5\%$的持续性能提升。我们确定压缩比、词汇多样性(MTLD)和可读性(Flesch Reading Ease)为最有效的难度信号。研究结果表明,数据排序——这一与现有数据选择方法正交的维度——为实现更高效的大规模语言模型预训练提供了实用机制。