R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

翻译：强化学习推动了大型语言模型推理与智能体能力的最新进展，但现有方法在探索与利用两方面均存在不足。探索阶段面临困难任务成功率低、从头重复执行的高昂成本问题；利用阶段则受困于粗粒度信用分配与训练不稳定性：轨迹级奖励会因后续错误而惩罚有效前缀，同时失败样本主导的数据集压制了少量正信号，导致优化缺乏建设性方向。为此，我们提出R$^3$L——基于语言引导探索、关键信用与正信号放大的反思-重试强化学习。为生成高质量轨迹，R$^3$L从随机采样转向基于"反思-重试"的主动合成策略：利用语言反馈诊断错误，将失败尝试转化为成功样本，并通过从识别出的失败点重启降低轨迹生成成本。在错误诊断与定位后，关键信用分配机制仅更新存在对比信号的分叉后缀，排除共享前缀的梯度更新。针对困难任务中失败样本占主导以及反思-重试产生离策略数据可能引发训练不稳定的问题，正信号放大策略通过提高成功轨迹权重，确保正信号主导优化过程。在智能体任务与推理任务上的实验表明，该方法在保持训练稳定性的同时，相比基线获得5%至52%的相对性能提升。我们的代码已开源至https://github.com/shiweijiezero/R3L。