Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information.
翻译:具身指令遵循 (Embodied Instruction Following, EIF) 是一项通过在三维环境中导航并与物体交互来执行自然语言指令的任务。EIF 的主要挑战之一是组合任务规划,这通常通过有监督学习或基于标注数据的上下文学习来处理。为此,我们提出了苏格拉底规划器(Socratic Planner),这是首个无需任何训练数据即可进行推理的零样本规划方法。苏格拉底规划器首先通过自我提问与回答将指令分解为任务的子结构信息,并将其转化为高层规划(即一系列子目标序列)。子目标按顺序执行,同时我们的视觉接地重规划机制通过密集的视觉反馈动态调整规划。我们还引入了一个高层规划评估指标 RelaxedHLP,以进行更全面的评估。实验证明了苏格拉底规划器的有效性,在 ALFRED 基准测试的零样本和少样本任务规划中均取得了具有竞争力的性能,尤其在高维推理任务中表现优异。此外,通过融入环境视觉信息,规划实现了精确调整。