Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 44.5 per-benchmark accuracy and 51.3 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt-augmentation-GRPO.
翻译:强化学习算法,如组相对策略优化(GRPO),已展现出提升大语言模型数学推理能力的巨大潜力。然而,先前的研究一致观察到在强化后训练过程中存在熵崩溃现象,其特征是策略熵单调递减,最终导致训练不稳定和崩溃。因此,大多数现有方法将训练限制在较短的周期内(通常为5-20个轮次),限制了持续的探索并阻碍了策略的进一步改进。此外,几乎所有先前的工作在训练期间都依赖于单一、固定的推理提示或模板。在本研究中,我们引入了提示增强,这是一种训练策略,它指导模型在不同的模板和格式下生成推理轨迹,从而增加了训练轨迹的多样性。我们证明,在没有KL正则化项的情况下,提示增强能够在固定数据集下稳定地扩展训练时长,并使模型能够容忍低熵状态而不会过早崩溃。实证结果表明,在MATH Level 3-5数据集上使用提示增强训练的Qwen2.5-Math-1.5B模型实现了最先进的性能,在包括AIME24、AMC、MATH500、Minerva和OlympiadBench在内的标准数学推理基准测试中,达到了44.5%的每基准准确率和51.3%的每问题准确率。代码和模型检查点可在 https://github.com/wenquanlu/prompt-augmentation-GRPO 获取。