Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.
翻译:现代语言模型后训练的数据来源主要有两种:在线数据(模型生成的推演轨迹)和离线数据(人工或其他模型的示范数据)。这两类数据通常分别被强化学习(RL)和监督微调(SFT)等方法所采用。本文证明这些方法并非相互对立,而是同一优化过程的不同实例。我们推导出统一策略梯度估计器,并通过在不同数据分布假设及多种偏差-方差权衡下计算通用目标的梯度,呈现了广泛后训练方法的统一计算框架。该梯度估计器由四个可互换组件构成:稳定化掩码、参考策略分母、优势估计和似然梯度。基于理论发现,我们提出混合后训练(HPT)算法,该算法能动态选择不同的训练信号。HPT旨在实现对示范数据的有效利用和稳定探索,同时不牺牲已习得的推理模式。我们通过大量实验和消融研究验证了统一理论框架与HPT的有效性。在六个数学推理基准测试和两个分布外测试集中,HPT在不同规模和架构的模型上均持续超越现有强基线方法。