R$^3$L：基于语言引导探索、关键信用分配与正信号放大的反思-重试强化学习 (R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification)

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

翻译：强化学习推动了大型语言模型推理与智能体能力的最新进展，然而现有方法在探索与利用两方面均面临挑战。探索方面，困难任务的成功率低，且从头开始重复执行轨迹的成本高昂。利用方面，存在信用分配粗糙和训练不稳定的问题：轨迹级奖励会因后续错误而惩罚有效的前缀，且以失败为主的样本组会淹没少数正信号，导致优化过程缺乏建设性方向。为此，我们提出R$^3$L——一种融合语言引导探索、关键信用分配与正信号放大的反思-重试强化学习方法。为合成高质量轨迹，R$^3$L从随机采样转向主动合成，通过“反思-重试”机制，利用语言反馈诊断错误、将失败尝试转化为成功尝试，并通过从已识别的失败点重启来降低轨迹执行成本。在错误被诊断和定位后，关键信用分配仅更新存在对比信号的分歧后缀，而将共享前缀排除在梯度更新之外。由于困难任务中失败样本占主导，且反思-重试会产生离策略数据，可能引发训练不稳定，正信号放大通过加权成功轨迹来确保正信号引导优化过程。在智能体与推理任务上的实验表明，相较于基线方法，本方法取得了5%至52%的相对性能提升，同时保持了训练稳定性。代码发布于 https://github.com/shiweijiezero/R3L。