Standard reinforcement learning (RL) for large language model (LLM) agents typically optimizes extrinsic rewards, prioritizing isolated task completion over continual adaptation. Consequently, agents often converge to suboptimal policies due to limited exploration. Furthermore, accumulated experience remains implicitly trapped within model parameters, hindering its explicit reuse for guiding future decisions. Inspired by human retrospective self-improvement, we introduce RetroAgent, an online RL framework that trains agents to master complex interactive environments not only by solving tasks, but by evolving under the joint guidance of extrinsic task rewards and retrospective dual intrinsic feedback. Specifically, RetroAgent employs a hindsight self-reflection mechanism that generates two complementary signals: (1) intrinsic numerical feedback, which rewards promising exploration by tracking real-time incremental subtask progress relative to prior attempts; and (2) intrinsic language feedback, which enables explicit experience reuse by distilling reusable lessons into a memory buffer for subsequent decision-making. To effectively leverage these textual experiences, we propose Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB), a retrieval strategy that balances relevance, historical utility, and exploration. Extensive experiments across four challenging agentic tasks show that RetroAgent achieves new state-of-the-art (SOTA) performance. Notably, it surpasses Group Relative Policy Optimization (GRPO) baselines by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper, while exhibiting strong test-time adaptation and out-of-distribution generalization.
翻译:标准强化学习(RL)在大型语言模型(LLM)智能体中的典型做法是优化外在奖励,优先考虑孤立任务完成而非持续适应。因此,智能体常因探索受限而收敛至次优策略。此外,累积的经验被隐式困在模型参数中,阻碍其显式复用指导未来决策。受人类回顾性自我改进启发,我们提出RetroAgent——一种在线强化学习框架,通过联合外在任务奖励与回顾性双内在反馈,训练智能体不仅掌握复杂交互环境的求解,更实现持续演化。具体而言,RetroAgent采用事后自我反思机制生成两种互补信号:(1)内在数值反馈,通过追踪相对先前尝试的实时增量子任务进度,奖励有前景的探索;(2)内在语言反馈,通过将可复用经验蒸馏至记忆缓冲区供后续决策,实现经验的显式重用。为有效利用这些文本经验,我们提出相似性与效用感知上置信界(SimUtil-UCB)——一种平衡相关性、历史效用与探索的检索策略。在四个具有挑战性的智能体任务上的大量实验表明,RetroAgent取得了新的最优(SOTA)性能。值得注意的是,它在ALFWorld、WebShop、Sokoban和MineSweeper上分别超越组相对策略优化(GRPO)基线+18.3%、+15.4%、+27.1%和+8.9%,同时展现出强大的测试时适应性与分布外泛化能力。