Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
翻译:基于可验证奖励的强化学习通过提供显式的基于规则的监督,最近显著提升了大型语言模型在复杂推理任务中的能力。在RLVR方法中,GRPO及其变体取得了强大的实证性能。尽管取得了成功,我们发现这些方法存在正样本梯度错配和负样本梯度主导的问题,导致策略更新效率低下且次优。为解决这些问题,我们提出了奖励即标签(REAL)这一新颖框架,该框架将可验证奖励重新视为分类标签而非标量权重,从而将策略优化重新表述为一个分类问题。在此基础上,我们进一步引入锚定逻辑值以增强策略学习。我们的分析表明,REAL诱导出一种单调且有界的梯度加权机制,能够在不同轨迹间实现平衡的梯度分配,并有效缓解已识别的不匹配问题。在数学推理基准上的大量实验表明,REAL提高了训练稳定性,并持续优于GRPO及其强变体(如DAPO)。在15亿参数模型上,REAL将平均Pass@1指标较DAPO提升了6.7%。这些增益进一步扩展到70亿参数模型,REAL继续以6.2%和1.7%的优势分别超越DAPO和GSPO。值得注意的是,即使使用普通的二元交叉熵损失,REAL仍能保持稳定,并平均超越DAPO达4.5%。