Robotic manipulation benefits from foundation models that describe goals, but today's agents still lack a principled way to learn from their own mistakes. We ask whether natural language can serve as feedback, an error-reasoning signal that helps embodied agents diagnose what went wrong and correct course. We introduce LAGEA (Language Guided Embodied Agents), a framework that turns episodic, schema-constrained reflections from a vision language model (VLM) into temporally grounded guidance for reinforcement learning. LAGEA summarizes each attempt in concise language, localizes the decisive moments in the trajectory, aligns feedback with visual state in a shared representation, and converts goal progress and feedback agreement into bounded, step-wise shaping rewards whose influence is modulated by an adaptive, failure-aware coefficient. This design yields dense signals early when exploration needs direction and gracefully recedes as competence grows. On the Meta-World MT10 and Robotic Fetch embodied manipulation benchmark, LAGEA improves average success over the state-of-the-art (SOTA) methods by 9.0% on random goals, 5.3% on fixed goals, and 17% on fetch tasks, while converging faster. These results support our hypothesis: language, when structured and grounded in time, is an effective mechanism for teaching robots to self-reflect on mistakes and make better choices.
翻译:机器人操作得益于描述目标的基础模型,但现今的智能体仍缺乏一种从自身错误中进行学习的系统化方法。我们探讨自然语言能否作为反馈——一种帮助具身智能体诊断错误原因并修正路径的误差推理信号。我们提出了LAGEA(语言引导具身智能体)框架,该框架将来自视觉语言模型(VLM)的片段化、模式受限的反思转化为具有时间根基的强化学习指导。LAGEA以简洁语言总结每次尝试,定位轨迹中的关键决策时刻,在共享表征中将反馈与视觉状态对齐,并将目标进展与反馈一致性转化为有界的、逐步成形的奖励,其影响力通过一个自适应的、故障感知系数进行调节。该设计在探索需要引导的早期阶段产生密集信号,并随着能力提升而优雅地减弱。在Meta-World MT10和Robotic Fetch具身操作基准测试中,LAGEA相比最先进(SOTA)方法,在随机目标上平均成功率提升9.0%,在固定目标上提升5.3%,在抓取任务上提升17%,同时收敛速度更快。这些结果支持了我们的假设:当语言被结构化并基于时间进行根基化时,它能成为教导机器人对错误进行自我反思并做出更好选择的有效机制。