Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
翻译:基于可验证奖励的强化学习近期通过提供显式的规则监督,显著提升了大型语言模型在复杂推理任务中的能力。在RLVR方法中,GRPO及其变体已展现出优异的实证性能。尽管取得了成功,我们发现这些方法存在正样本梯度错配与负样本梯度主导问题,导致策略更新效率低下且难以达到最优。为解决这些问题,我们提出奖励作为标签(REAL)这一新颖框架,将可验证奖励重新定义为分类标签而非标量权重,从而将策略优化重构为分类问题。在此基础上,我们进一步引入锚定逻辑值以增强策略学习。理论分析表明,REAL能够产生单调有界的梯度加权机制,实现训练轨迹间的均衡梯度分配,有效缓解已识别的梯度失配问题。在数学推理基准上的大量实验表明,REAL显著提升了训练稳定性,并持续超越GRPO及其强效变体(如DAPO)。在15亿参数模型上,REAL的Pass@1平均指标较DAPO提升6.7%。这种优势在70亿参数模型上继续扩大:REAL分别以6.2%和1.7%的优势超越DAPO和GSPO。值得注意的是,即使采用标准的二元交叉熵损失,REAL仍能保持训练稳定性,并以平均4.5%的优势超越DAPO。