Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.
翻译:基于评论的强化学习已成为训练大型语言模型智能体的强大范式,它通过自然语言反馈增强了稀疏的结果奖励。然而,当前方法通常依赖于静态或离线的评论器模型,这些模型无法随策略的演化而自适应调整。在在线策略强化学习中,智能体的错误模式会随时间变化,导致固定评论器变得陈旧,其提供的反馈效用逐渐降低。为解决这一问题,我们提出了ECHO(用于后见指导优化的进化评论器)框架,该框架通过同步协同进化循环联合优化策略与评论器。ECHO采用级联式轨迹生成机制:评论器对初始轨迹生成多重诊断,随后进行策略精炼以实现分组结构优势估计。我们通过饱和度感知的增益塑造目标应对学习平台期挑战,该目标通过激励评论器在高性能轨迹中诱导渐进式改进来给予奖励。通过采用双轨GRPO更新机制,ECHO确保评论器的反馈与演化策略保持同步。实验结果表明,ECHO在开放世界环境中能实现更稳定的训练效果和更高的长周期任务成功率。