Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.
翻译:大语言模型(LLM)的标准训练流程通常是单向的,从预训练逐步过渡到后训练。然而,双向过程的潜力——即后训练阶段获得的洞见能够回溯性地改进预训练基础模型——仍未得到充分探索。我们的目标是建立一个自我增强的飞轮机制:通过强化学习(RL)微调的模型能够增强基础模型,而基础模型的提升又能进一步改善后续的后训练性能,整个过程无需专门训练的教师模型或参考模型。为实现这一目标,我们分析了训练动态,并确定中期训练(退火)阶段是模型能力发展的关键转折点。该阶段通常出现在预训练末期,使用高质量语料库并在快速衰减的学习率下进行。基于这一发现,我们提出了ReMiT(强化学习引导的中期训练)。具体而言,ReMiT利用经过RL微调模型的推理先验,在中期训练阶段动态调整词元的权重,优先关注那些对推理至关重要的词元。实验表明,ReMiT在涵盖数学、代码和通用推理的10个预训练基准上平均提升了3%,并在整个后训练流程中持续保持超过2%的增益。这些结果验证了迭代反馈循环的有效性,为大语言模型实现持续且自我增强的演进提供了可能。