Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
翻译:摘要:具身指令跟随(Embodied Instruction Following, EIF)是具身学习中的关键任务,要求智能体通过第一人称视角观察与环境交互,以完成自然语言指令。近年来,以框架为中心的方法被广泛采用,通过集成大型语言模型(LLMs)来提升包括EIF在内的具身学习任务性能。然而,现有研究尚未对从视觉感知到动作执行等多样化组件如何影响任务性能形成统一认识。为填补这一空白,我们提出OPEx——一个系统化框架,明确界定了解决具身学习任务所需的核心组件:观察器(Observer)、规划器(Planner)与执行器(Executor)。通过大量实验评估,我们深入分析了各组件对EIF任务性能的影响机制。此外,我们创新性地在对应TextWorld环境中部署了多智能体对话策略,进一步提升了任务性能。研究结果表明:以LLM为中心的设计能显著改善EIF效果;视觉感知与低级动作执行被识别为关键性能瓶颈;同时,通过多智能体框架增强LLM可进一步提升系统性能。