超越随机采样：基于课程学习的高效语言模型预训练方法 (Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning)

Curriculum learning-organizing training data from easy to hard-has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases,reducing training steps by $18-45\%$ to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to $3.5\%$. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering-orthogonal to existing data selection methods-provides a practical mechanism for more efficient LLM pretraining.

翻译：课程学习——将训练数据从易到难组织——已在多个机器学习领域提升了训练效率，但其在语言模型预训练中的应用仍待深入探索。本文首次对大规模语言模型预训练中的课程学习进行了系统研究，通过三种策略（基础课程学习、基于进度的采样和交错课程）训练了超过200个模型，训练数据规模达1000亿词元，并采用涵盖语言学和信息论特性的六种难度指标进行指导。我们在三种实际场景（有限数据、无限数据和持续训练）下评估了模型在八个基准测试上的性能。实验表明，课程学习在训练早期和中期持续加速收敛，达到基线性能所需的训练步数减少$18-45\%$。当作为标准随机采样前的预热策略时，课程学习能带来最高$3.5\%$的持续性能提升。我们确定压缩比、词汇多样性（MTLD）和可读性（Flesch Reading Ease）为最有效的难度信号。研究结果表明，数据排序——这一与现有数据选择方法正交的维度——为实现更高效的大规模语言模型预训练提供了实用机制。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

专知会员服务

36+阅读 · 2024年7月30日

【CMU博士论文】超越模型效率:机器学习系统的数据优化，147页pdf

专知会员服务

53+阅读 · 2023年7月1日

预训练语言模型的应用综述

专知会员服务

36+阅读 · 2023年1月23日

视觉语言多模态预训练综述

专知会员服务

122+阅读 · 2022年7月11日