Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.
翻译:大语言模型(LLMs)在解决复杂数学问题方面展现出潜力,但其生成的解决方案仍难以保证准确性与一致性。强化学习(RL)作为一种框架,可通过任务特定奖励对齐模型,从而提升整体质量与可靠性。组相对策略优化(GRPO)是一种高效、无需价值函数的替代方案,它基于组相对奖励归一化,可替代近端策略优化(PPO)。本文提出迭代组相对策略优化(iGRPO),这是GRPO的两阶段扩展方法,通过模型生成的草稿引入动态自条件机制。在第一阶段,iGRPO采样多个探索性草稿,并利用与优化过程相同的标量奖励信号选择奖励最高的草稿。在第二阶段,它将此最优草稿附加至原始提示,并对基于草稿条件的优化结果执行GRPO风格更新,从而训练策略使其超越先前的最强尝试。在相同的采样预算下,iGRPO在不同基础模型(如Nemotron-H-8B-Base-8K和DeepSeek-R1 Distilled)上均持续优于GRPO,在多样化的推理基准测试中验证了其有效性。此外,将iGRPO应用于基于AceReason-Math数据集训练的OpenReasoning-Nemotron-7B模型后,在AIME24和AIME25基准上分别取得了85.62%和79.64%的最新最优结果。消融实验进一步表明:优化封装器可泛化至GRPO变体之外;生成式评判器能带来增益;该方法通过延迟熵塌缩改变了学习动态。这些结果凸显了基于迭代自反馈的强化学习在推进可验证数学推理方面的潜力。