On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. We extend our experiments to 8K context length, and RPG-REINFORCE with RPG-Style Clip achieves 52% accuracy on AIME25, surpassing the official Qwen3-4B-Instruct model (47%). Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme. Project Page: https://github.com/complex-reasoning/RPG.

翻译：策略梯度算法已成功应用于增强大型语言模型（LLM）的推理能力。KL正则化被广泛采用，但其设计维度——包括KL散度的方向（前向与反向）、归一化形式（归一化与非归一化）以及估计器类型（$k_1/k_2/k_3$）——在现有文献中较为分散，且常与离策略估计问题交织。本文聚焦于一个核心问题：在离策略设定下，每种KL变体需要何种权重配置，才能使得我们优化的代理目标精确产生预期KL正则化目标的梯度？我们通过一个简洁统一的推导框架——称为正则化策略梯度（RPG）视角——回答了该问题。RPG框架（i）统一了归一化与非归一化KL变体，并证明广泛使用的$k_3$惩罚项实质即为非归一化KL散度；（ii）明确了在何种条件下，采用梯度截断的REINFORCE型损失函数与完全可微的代理目标具有梯度等价性；（iii）识别并修正了GRPO中KL项的离策略重要性权重失配问题；（iv）提出了RPG-Style Clip方法，即在RPG-REINFORCE中引入裁剪重要性采样步骤，从而实现大规模、稳定的离策略策略梯度训练。在数学推理基准测试（AIME24、AIME25）中，采用RPG-Style Clip的RPG-REINFORCE相比DAPO在准确率上最高提升$+6$个绝对百分点。我们将实验扩展至8K上下文长度，采用RPG-Style Clip的RPG-REINFORCE在AIME25上达到52%的准确率，超越官方Qwen3-4B-Instruct模型（47%）。值得注意的是，RPG是一种稳定且可扩展的LLM推理强化学习算法，其实现基于（a）KL校正目标、（b）裁剪重要性采样以及（c）迭代参考策略更新机制。项目页面：https://github.com/complex-reasoning/RPG。