Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.
翻译:策略镜像下降(PMD)通过迭代求解KL正则化的策略改进子问题,为强化学习(RL)提供了一个原则性框架。尽管该方法已被用于训练先进的大语言模型(如Kimi K1.5/K2),但理想的闭式PMD更新需要可靠的配分函数估计,这在大语言模型广阔的动作空间中利用有限次rollout进行工作时是一个重大挑战。我们研究了一种称为PMD-mean的实用算法,该算法使用采样策略下的平均奖励来近似配分函数的对数项,并在对数策略空间中进行回归。具体而言,我们刻画了PMD-mean的总体解,并证明它隐式地优化了带有自适应混合KL--$χ^2$正则项的镜像下降子问题。这一额外的$χ^2$正则化约束了概率的大幅变化,在期望奖励较低时产生更保守的更新,并增强了对有限样本估计误差的鲁棒性。在数学推理任务上的实验表明,PMD-mean以更高的稳定性与时间效率取得了优越的性能。这些发现加深了我们对PMD-mean的理解,并为大语言模型强化学习算法的原则性改进指明了路径。代码发布于 https://github.com/horizon-rl/OpenKimi。