Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients. We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction and naturally enhancing stability beyond existing methods. These insights motivate Optimal Baseline and Learning-Rate Policy Optimization (OBLR-PO), an algorithm that jointly adapts learning rates and baselines in a theoretically grounded manner. Experiments on Qwen3-4B-Base and Qwen3-8B-Base demonstrate consistent gains over existing policy optimization methods, validating that our theoretical contributions translate into practical improvements in large-scale post-training.
翻译:现有基于强化学习的大型语言模型后训练方法发展迅速,但其设计大多由启发式经验而非系统性理论原则所指导。这一差距限制了我们对于梯度估计器及其相关优化算法性质的理解,从而制约了提升训练稳定性和整体性能的机会。在本工作中,我们提出了一个统一的理论框架,在温和假设下刻画了常用策略梯度估计器的统计性质。我们的分析确立了无偏性,推导了精确的方差表达式,并得到了一个优化损失上界,从而能够对学习动态进行有原则的推理。基于这些结果,我们证明了收敛性保证,并推导出一种由梯度信噪比调控的自适应学习率调度策略。我们进一步证明,方差最优基线是一种梯度加权估计器,这为方差缩减提供了一个新原则,并自然地超越了现有方法,增强了稳定性。这些洞见催生了最优基线与学习率策略优化算法,该算法以理论为基础,联合自适应地调整学习率和基线。在Qwen3-4B-Base和Qwen3-8B-Base上的实验表明,相较于现有策略优化方法,该算法取得了持续的性能提升,验证了我们的理论贡献能够转化为大规模后训练中的实际改进。