LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to $12\%$ absolute success rate gains.
翻译:基于强化学习训练的LLM智能体虽能优化逐步动作预测,但缺乏对任务进展的元认知意识,这种认知鸿沟阻碍了长程任务扩展。初步研究表明,在线进展提示会损害性能,而回顾性演示则有助于提升表现,但这种能力无法仅通过结果奖励训练自发产生。我们提出RePro(回顾性进展感知训练)框架,该框架通过“先执行后反思”的推演范式训练智能体自主生成进展信号:智能体在线执行动作后,基于完整轨迹和已知结果,对逐步进展进行回顾性再评估。RePro通过回顾热身(Retrospection Warmup)阶段,利用少量外部演示习得反思格式;随后通过RePro-PO阶段,采用复合奖励机制生成无需持续外部监督的自主信号。在WebShop、ALFWorld和Sokoban上的实验表明,RePro显著提升了Qwen系列模型的性能,绝对成功率最高提升12%。