Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.
翻译:大型语言模型正越来越多地被部署为自主智能体,这些智能体必须通过与提供丰富反馈的环境进行长期交互来规划、行动并从错误中恢复。然而,当前主流的以结果为导向的后训练方法(例如,带有可验证奖励的强化学习)主要优化最终的成功信号,未能充分利用丰富的环境反馈。因此,它们常常导致分布锐化:策略在重现一组狭窄的已成功行为方面变得更好,却未能提升在长期交互场景中扩展问题解决能力(例如,Pass@k)所需的、基于反馈的能动性。为解决这一问题,我们提出了LEAFE(从反思经验中学习基于反馈的能动性),这是一个从反思经验中内化恢复能动性的框架。具体而言,在探索过程中,智能体将环境反馈总结为可操作的经验,回溯到早期的决策点,并使用修订后的行动探索替代分支。然后,我们通过监督微调将这些经验引导的修正提炼到模型中,使策略在未来的交互中能更有效地恢复。在一系列具有固定交互预算的交互式编码和智能体任务中,LEAFE持续提升了基础模型的Pass@1性能,并且在Pass@k指标上优于以结果为导向的基线方法(如GRPO)以及基于经验的方法(如Early Experience),在Pass@128上取得了高达14%的性能提升。