Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.
翻译:分组相对策略优化(GRPO)已成为数学推理模型训练的标准方法;然而,该方法依赖每个提示生成多个补全,导致训练计算成本高昂。尽管近期研究已减少了达到峰值性能所需的训练步数,但由于单步成本升高,整体训练时间往往未变甚至增加。我们提出MMR-GRPO,该方法通过整合最大边际相关性,基于补全多样性对奖励进行加权。核心洞见在于:语义冗余的补全提供的边际学习信号有限;优先选择多样性解决方案能产生更具信息量的更新,从而加速收敛。在三种模型规模(1.5B、7B、8B)、三种GRPO变体及五个数学推理基准上的广泛评估表明,MMR-GRPO在达到可比峰值性能的同时,平均减少47.9%的训练步数和70.2%的实际时间。这些收益在模型、方法和基准上均表现一致。我们的代码已开源:https://github.com/WeiKangda/MMR-GRPO。