Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92\% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.
翻译:长期以来,增加以FLOPs衡量的训练计算量已被证明能提升大语言模型的精度,但训练过程依然消耗大量资源。先前研究表明,增加测试时计算量(例如通过迭代采样)可使较小模型以更低的总成本达到甚至超越大得多的模型性能。我们提出测试时计算感知训练方法,通过选取中间检查点及其对应的测试时计算配置,可在显著减少训练FLOPs的同时,达到或超越完整训练模型的精度。基于此发现,我们提出一种早停算法,联合选择检查点与测试时计算配置,以在保持精度的前提下最小化训练计算量。为实现该方法的实用性,我们开发了一种避免穷举搜索的高效测试时计算评估方法,并形式化定义了盈亏平衡界限,用以判断何时增加推理计算量可补偿减少的训练计算量。实验表明,该方法在保持甚至显著提升精度的同时,最高可减少92%的训练FLOPs。这些结果为模型开发中平衡训练与推理计算提供了新视角,有助于加速部署周期并实现更频繁的模型更新。代码将公开释放。