Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward -- so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce Reinforcement Learning with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR. The project page is available at https://bltnynk.github.io/publications/rl-zvp/.
翻译:可验证奖励强化学习(RLVR)是提升大型语言模型(LLM)推理能力的强大框架。然而,当前方法如GRPO仅依赖于模型对同一输入产生不同正确性响应的任务,而忽略了所有响应获得相同奖励的情况——即所谓的零方差提示。本文认为,此类提示并非无用,实际上可为策略优化提供有意义的反馈。为此,我们提出了零方差提示强化学习(RL-ZVP),一种从零方差提示中提取学习信号的新算法。RL-ZVP无需对比响应即可直接奖励正确行为并惩罚错误,同时通过词元级特征调制反馈以保留信息丰富且细致的信号。在六个数学推理基准测试中,RL-ZVP相比GRPO在准确率上最高提升8.61个百分点,通过率最高提升7.77个百分点,且持续优于其他过滤零方差提示的基线方法。这些结果凸显了在RLVR中利用零方差提示进行学习的未开发潜力。项目页面详见 https://bltnynk.github.io/publications/rl-zvp/。