Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments. A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data. To this end, we introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training. The Socratic Planner first facilitates self-questioning and answering by the Large Language Model (LLM), which in turn helps generate a sequence of subgoals. While executing the subgoals, an embodied agent may encounter unexpected situations, such as unforeseen obstacles. The Socratic Planner then adjusts plans based on dense visual feedback through a visually-grounded re-planning mechanism. Experiments demonstrate the effectiveness of the Socratic Planner, outperforming current state-of-the-art planning models on the ALFRED benchmark across all metrics, particularly excelling in long-horizon tasks that demand complex inference. We further demonstrate its real-world applicability through deployment on a physical robot for long-horizon tasks.
翻译:具身指令跟随(EIF)任务要求在交互式环境中通过导航和物体交互来执行自然语言指令。EIF的一个核心挑战是组合式任务规划,通常通过监督学习或使用标注数据的少量样本上下文学习来解决。为此,我们提出了苏格拉底规划器,一种基于自问自答的零样本规划方法,无需额外训练即可推断出合适的规划方案。苏格拉底规划器首先促使大语言模型(LLM)进行自我提问与回答,进而帮助生成一系列子目标。在执行子目标的过程中,具身智能体可能会遇到意外情况,例如不可预见的障碍物。苏格拉底规划器随后通过基于视觉感知的重新规划机制,依据密集的视觉反馈调整规划方案。实验证明了苏格拉底规划器的有效性,其在ALFRED基准测试的所有指标上均优于当前最先进的规划模型,尤其在需要复杂推理的长时程任务中表现突出。我们进一步通过在物理机器人上部署执行长时程任务,展示了其在实际应用中的可行性。