Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW's implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.
翻译:数据课程已成为大语言模型成功训练的核心要素,然而决定最优数据排布的原则仍不明确。本文提出*训练重评估曲线*这一诊断工具,其能够*基于最终模型权重*对训练批次进行回顾性评估。该曲线通过刻画已训练模型对训练数据的保留程度如何随数据在训练过程中*出现时机*的变化而变化,揭示了训练动态的关键特征。通过对参数量从1.11亿到39亿的模型进行TREC分析,我们发现将高质量数据置于TREC的低谷区域能显著提升模型性能。尤为重要的是,虽然TREC传统上需在训练完成后才能观测,但我们证明其可通过AdamW优化器的隐式指数移动平均系数进行*提前预测*,从而实现前瞻性的课程设计。通过对已公开训练方案进行TREC预测,我们解释了先前的消融实验结果,并揭示了既有数据排布方案的次优性。最后,我们将高质量数据与TREC极小值点对齐,成功提升了基于9000亿标记训练的39亿参数大语言模型的持续预训练效果。