As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues: GRPO (Group Relative Policy Optimization) often suffers from unstable policy updates, while GSPO (Group Sequence Policy Optimization) can retain high-variance tokens. In GRPO, the importance ratio is computed at the token level, which overemphasizes individual tokens and makes learning sensitive to outliers, potentially causing training collapse. GSPO instead computes a response-level importance ratio, mitigating variance and reducing the accumulation of token-level noise present in GRPO. Nevertheless, our experiments show that GSPO frequently yields a near-zero clipping fraction: extreme token-level ratios can be diluted by other tokens in the same response, causing the entire response to be retained and resulting in unstable updates. We propose SSPO, which computes importance ratios at the subsentence level, striking a balance between GRPO and GSPO. SSPO alleviates training collapse and excessive variance while avoiding the failure mode in which the clipping mechanism indiscriminately retains entire responses. Moreover, we incorporate subsentence-level entropy into PPO-CLIP to adaptively adjust the clipping bounds: we encourage exploration for high-entropy tokens while tightening the clipping range for low-entropy tokens. Empirically, SSPO achieves an average score of 46.72 across five datasets on Qwen2.5-1.5B-Math model, outperforming GRPO (43.01) and GSPO (44.42), and attains state-of-the-art results on four datasets. On Qwen2.5-7B-Math model, SSPO also achieves the highest averaged scores over five baseline methods. These results demonstrate SSPO's effectiveness in RLVR.
翻译:作为大语言模型(LLM)后训练的关键组成部分,基于可验证奖励的强化学习(RLVR)显著提升了推理性能。然而,现有RLVR算法存在明显的稳定性问题:GRPO(分组相对策略优化)常出现策略更新不稳定,而GSPO(分组序列策略优化)则可能保留高方差令牌。GRPO在令牌级别计算重要性比率,过度强调单个令牌使学习对异常值敏感,可能导致训练崩溃。GSPO采用响应级别重要性比率,虽能缓解方差并减少GRPO中令牌级噪声的累积,但实验表明GSPO常产生接近零的裁剪比例:极端令牌级比率可能被同一响应中的其他令牌稀释,导致整个响应被保留并引发更新不稳定。我们提出SSPO,在子句级别计算重要性比率,在GRPO与GSPO之间取得平衡。SSPO既缓解训练崩溃和过度方差,又避免裁剪机制无差别保留整个响应的失效模式。此外,我们将子句级熵融入PPO-CLIP以自适应调整裁剪边界:鼓励高熵令牌探索,同时收紧低熵令牌的裁剪范围。实验表明,在Qwen2.5-1.5B-Math模型上,SSPO在五个数据集的平均得分为46.72,优于GRPO(43.01)和GSPO(44.42),并在四个数据集上达到最优结果。在Qwen2.5-7B-Math模型上,SSPO较五种基线方法亦取得最高平均分。这些结果证明了SSPO在RLVR中的有效性。