Large language models are classically trained in stages: pretraining on raw text followed by post-training for instruction following and reasoning. However, this separation creates a fundamental limitation: many desirable behaviors such as safety, factuality, overall generation quality, and reasoning ability are only added at a late stage, even though the patterns learned earlier strongly shape a model's capabilities. To tackle this issue, we introduce a new way to pretrain and mid-train models that incorporates these behaviors earlier. We utilize an existing strong, post-trained model to both rewrite pretraining data and to judge policy model rollouts, thus using reinforcement earlier in training. In our experiments, we show this can give strong gains in quality, safety, factuality and reasoning.
翻译:大型语言模型经典上按阶段训练:先对原始文本进行预训练,随后进行后训练以学习指令遵循与推理能力。然而,这种分离造成了根本性局限:许多理想行为(如安全性、事实性、整体生成质量、推理能力)只在较晚阶段才被引入,尽管早期学到的模式已深刻塑造模型能力。为解决此问题,我们提出一种新的预训练与中期训练方法,更早地融入这些行为。我们利用现有的强大后训练模型,对预训练数据进行改写,并评判策略模型的输出,从而在训练早期引入强化学习。实验表明,该方法可在质量、安全性、事实性与推理能力方面带来显著提升。