Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment. The code is available at https://github.com/xiwenc1/DRA-GRPO.
翻译:通过强化学习对大型语言模型进行后训练,特别是采用组相对策略优化(Group Relative Policy Optimization, GRPO),已成为增强数学推理能力的一种范式。然而,标准GRPO依赖的标量正确性奖励通常对语义内容非单射:不同的推理路径会获得相同的奖励。这导致了“多样性-质量不一致性”,即策略会崩溃收敛到一组狭窄的优势模式,而忽略同样有效但结构新颖的策略。为弥补这一差距,我们提出了多样性感知奖励调整(Diversity-aware Reward Adjustment, DRA),这是一个理论严谨的框架,通过利用采样组的语义密度来校准奖励信号。通过利用子模互信息(Submodular Mutual Information, SMI),DRA实现了一种逆概率加权(Inverse Propensity Scoring, IPS)机制,有效地去偏梯度估计。这会产生一种对抗冗余的排斥力,驱动策略更好地覆盖高奖励空间。我们的方法即插即用,并能与GRPO变体无缝集成。在五个数学基准上的实证评估表明,DRA-GRPO持续优于强基线模型,在DeepSeek-R1-Distill-Qwen-1.5B上仅使用7,000个训练样本和55美元成本,平均准确率便达到58.2%,凸显了多样性校准在数据高效对齐中的关键作用。代码开源于https://github.com/xiwenc1/DRA-GRPO。