Reinforcement Learning with Verifiable Rewards(RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce MURPHY, a multi-turn RLVR framework that incorporates execution feedback directly into training, extending GRPO to optimize over multi-turn trajectories where models iteratively refine solutions. MURPHY combines a feedback conditioned rollout tree with trajectory-level credit assignment, and uses pruning to reduce the cost of multi-turn optimization. Evaluations on code generation benchmarks with two model families show that MURPHY consistently improves multi-iteration performance, achieving up to an 8% absolute gain in pass@1 over compute-matched GRPO baselines, and outperforming the prior leading method that incorporates multi-turn execution feedback.
翻译:基于可验证奖励的强化学习(RLVR)已成为增强大语言模型(LLM)推理能力的有力框架。然而,现有方法如组相对策略优化(GRPO)及其变体,虽然在推理基准测试上表现有效,但在需要迭代决策的智能体任务上却面临困难。我们提出了MURPHY,一个多轮RLVR框架,它将执行反馈直接纳入训练过程,扩展了GRPO以优化多轮轨迹,使模型能够迭代地完善解决方案。MURPHY结合了基于反馈条件的展开树与轨迹级信用分配机制,并利用剪枝技术来降低多轮优化的成本。在两个模型系列上进行的代码生成基准测试评估表明,MURPHY持续提升了多轮迭代性能,在计算匹配的GRPO基线之上实现了高达8%的pass@1绝对增益,并且优于先前结合多轮执行反馈的领先方法。