Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a step-wise judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO's learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.
翻译:强化学习(RL)已被证明能有效增强大型语言模型(LLM)的推理能力。一种广泛采用的方法——组相对策略优化(GRPO)在训练近期推理模型时展现出强大的实证效果,但当组内所有响应均错误(即全负样本组)时,该方法无法更新策略。这一局限揭示了人工与人类智能之间的差距:与人类能够从错误中学习不同,GRPO会丢弃这些失败信号。我们引入一个简单框架,通过使用逐步评判模型增强组内响应多样性来缓解全负样本问题,该评判模型可直接训练或从现有LLM适配获得。在简化设定下,我们证明这种多样化策略能加速GRPO的学习动态。随后,我们在九个推理基准测试(包含基础版本与蒸馏变体)上,通过离线与在线训练对不同规模模型(7B、14B、32B)进行了Stepwise Guided Policy Optimization(SGPO)的实证验证。总体而言,SGPO提升了平均性能,且在全负样本组普遍存在的训练早期和中期效果显著,但改进并非在所有基准测试中均一致,其效果取决于负样本的结构与信息量。最后,SGPO无需评判模型生成正确答案,这使其区别于知识蒸馏方法。