Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
翻译:强化学习已成为大型语言模型后训练的关键组成部分。由于训练-推理不匹配及策略陈旧性,实际中的LLM强化学习往往采用离策略方法,因此信任区域控制对于稳定优化至关重要。PPO和GRPO等主流方法通过比例裁剪机制近似实现这种控制,但重要性比例在长尾词汇中可能无法有效代理分布偏移。近期DPPO等工作通过将基于比例的裁剪替换为基于散度的掩码来解决此不匹配问题,从而定义由采样令牌的绝对概率偏移构成的信任区域。然而,DPPO仍依赖硬掩码:一旦令牌以有害方向跨越信任区域边界,其梯度被丢弃而非修正。为此,我们提出散度正则化策略优化(DRPO),该方法将硬掩码替换为策略偏移的光滑优势加权二次正则项。DRPO保留了与DPPO相同的信任区域几何结构,同时引入有界、连续的梯度权重:衰减发散性更新并在边界外提供修正信号。跨模型规模、架构及精度设置的实验表明,DRPO提升了LLM强化学习训练的稳定性与效率。