Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We explore two variants of the mixing mechanism, including an averaged mixing and an entropy-guided mixing. To further stabilize training, we employ a branch-specific clipping strategy that constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update. Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO. We will release our code upon acceptance of this paper.


翻译:可验证奖励强化学习(RLVR)为优化大型语言模型在推理任务中的表现提供了一个有前景的框架。然而,现有的RLVR算法聚焦于不同的粒度,各自具有互补的优势与局限。组相对策略优化(GRPO)使用令牌级重要性比率更新策略,这保留了细粒度的信用分配,但常受高方差和不稳定性困扰。相比之下,组序列策略优化(GSPO)对响应中的所有令牌应用单一序列级重要性比率,这更好地匹配了序列级奖励,但牺牲了令牌级的信用分配。本文提出动态混合策略优化(DHPO),旨在通过一个单一的裁剪替代目标来桥接GRPO与GSPO。DHPO通过加权机制结合了令牌级与序列级的重要性比率。我们探索了两种混合机制的变体,包括平均混合与熵引导混合。为进一步稳定训练,我们采用了一种分支特异性裁剪策略,该策略在混合前将令牌级和序列级比率约束在各自独立的信任区域内,防止任一分支中的异常值主导更新过程。在七个具有挑战性的数学推理基准测试中,对Qwen3系列的稠密模型与MoE模型进行的实验表明,DHPO始终优于GRPO和GSPO。本文录用后,我们将公开相关代码。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员