Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model's reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.
翻译:可验证奖励强化学习(RLVR)被广泛应用于提升多领域推理能力,然而仅基于结果的标量奖励通常稀疏且信息量不足,尤其在失败样本上,它们仅能指示失败而无法揭示推理失败的原因。本文研究了如何利用更丰富的语言反馈来指导失败样本上的RLVR训练,以及如何将此类反馈转化为可训练的学习信号。具体而言,我们提出了一种多轮反馈引导的强化学习框架。该框架基于三种机制构建:(1)由反馈引导的动态多轮再生机制,仅在失败样本上触发;(2)用于轮内优化与跨轮优化的两种互补学习信号;(3)将结构化反馈注入模型推理过程的方法。在采样的OpenR1-Math数据集上进行训练后,该方法在领域内表现优于监督微调和RLVR基线,并在领域外展现出良好的泛化能力。