Reinforcement learning has become a cornerstone for enhancing the reasoning capabilities of Large Language Models, where group-based approaches such as GRPO have emerged as efficient paradigms that optimize policies by leveraging intra-group performance differences. However, these methods typically rely on absolute numerical rewards, introducing intrinsic limitations. In verifiable tasks, identical group evaluations often result in sparse supervision, while in open-ended scenarios, the score range instability of reward models undermines advantage estimation based on group means. To address these limitations, we propose Reinforcement Learning with Relative Rewards (RLRR), a framework that shifts reward shaping from absolute scoring to relative ranking. Complementing this framework, we introduce the Ranking Reward Model, a listwise preference model tailored for group-based optimization to directly generate relative rankings. By transforming raw evaluations into robust relative signals, RLRR effectively mitigates signal sparsity and reward instability. Experimental results demonstrate that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
翻译:强化学习已成为增强大型语言模型推理能力的关键技术,其中基于群体的方法(如GRPO)已成为通过利用群体内性能差异来优化策略的高效范式。然而,这些方法通常依赖于绝对数值奖励,这带来了固有的局限性。在可验证任务中,相同的群体评估往往导致监督信号稀疏;而在开放场景中,奖励模型的分数范围不稳定性会损害基于群体均值的优势估计。为解决这些局限性,我们提出了相对奖励强化学习,这是一个将奖励塑造从绝对评分转向相对排名的框架。作为该框架的补充,我们引入了排序奖励模型,这是一个专为基于群体的优化设计的列表式偏好模型,可直接生成相对排名。通过将原始评估转化为鲁棒的相对信号,RLRR有效缓解了信号稀疏性和奖励不稳定性。实验结果表明,在推理基准测试和开放生成任务中,RLRR相比标准的基于群体基线方法,均能带来持续的性能提升。