Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.
翻译:可验证奖励强化学习(RLVR)为增强大模型的数学推理能力提供了一种稳健机制。然而,我们发现现有方法在算法和数据层面均系统性地缺乏对更具挑战性问题的重视,尽管这些问题对于完善未充分发展的能力至关重要。算法层面,广泛使用的组相对策略优化(GRPO)存在一种隐性失衡:对于更难的问题,策略更新的幅度反而更低。数据层面,现有的增强方法主要通过改写问题来增加多样性,而非系统性地提升其内在难度。为解决这些问题,我们提出了一个双轮驱动的MathForge框架,旨在从两个层面共同针对难题提升数学推理能力。该框架包含一个难度感知组策略优化(DGPO)算法和一个多角度问题重构(MQR)策略。具体而言,DGPO首先通过难度平衡的组优势估计来纠正GRPO中的隐性失衡,并进一步通过难度感知的问题级加权来优先处理更难的问题。同时,MQR从多个角度重构问题以增加难度,同时保持原始的标准答案不变。总体而言,MathForge形成了一个协同循环:MQR拓展了数据前沿,而DGPO则有效地从增强数据中学习。大量实验表明,MathForge在各种数学推理任务上显著优于现有方法。代码与增强数据均已公开于 https://github.com/AMAP-ML/MathForge。