Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., +16.7% Pass@1 improvement on AIME 2024.
翻译:近期,利用数值奖励的强化学习(RL)研究显著提升了大语言模型(LLM)的复杂推理能力。然而,我们发现了纯数值反馈的三个根本性局限:性能停滞、无效的自发自我反思以及持续性的失败。我们证明,当获得自然语言批判时,性能停滞的强化学习模型能够成功优化失败的解决方案。受此启发,我们提出了批判性GRPO,一种集成了自然语言与数值反馈用于策略优化的在线强化学习框架。该方法使大语言模型能够同时从初始响应和批判引导的优化中学习,有效地内化了两个阶段的探索优势。大量实验表明,批判性GRPO在所有对比的监督式与基于强化学习的微调方法中表现更优,在八个具有挑战性的推理任务上,其在各Qwen模型上平均Pass@1提升了约+15.0-21.6%,在Llama-3.2-3B-Instruct上提升了+7.3%。值得注意的是,批判性GRPO通过自我批判促进了有效的自我改进,相比GRPO取得了显著增益,例如在AIME 2024上实现了+16.7%的Pass@1提升。