Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.
翻译:群组相对策略优化(GRPO)已成为训练数学推理模型的标准方法;然而,其依赖每个提示生成多个补全的特点使得训练计算成本高昂。尽管近期研究减少了达到峰值性能所需的训练步数,但由于每步成本更高,总体挂钟训练时间通常保持不变甚至增加。我们提出了MMR-GRPO,该方法集成最大边际相关性,基于补全多样性对奖励进行重加权。我们的核心见解是:语义冗余的补全贡献的边际学习信号有限;优先考虑多样化的解决方案能产生更具信息量的更新并加速收敛。在三种模型规模(1.5B、7B、8B)、三种GRPO变体和五个数学推理基准上的广泛评估表明,MMR-GRPO在达到可比峰值性能的同时,平均需要减少47.9%的训练步数和70.2%的挂钟时间。这些收益在不同模型、方法和基准中保持一致。我们将发布代码、训练模型和实验方案。