Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.
翻译:使具身智能体能够根据自然语言完成复杂的人类指令,对家庭服务等自主系统至关重要。传统方法只能在已知环境中完成人类指令,其中所有可交互对象均已提供给具身智能体;若直接将现有方法部署于未知环境,通常会产生操作不存在对象的不可行计划。与此相反,我们提出了一种适用于未知环境中复杂任务的具身指令跟随方法,该方法通过智能体高效探索未知环境,利用现有对象生成可行计划以完成抽象指令。具体而言,我们构建了一个分层具身指令跟随框架,包含基于多模态大语言模型的高层任务规划器与低层探索控制器。随后,我们通过动态区域注意力构建场景语义表征地图以呈现已知视觉线索,从而使任务规划与场景探索的目标与人类指令对齐。对于任务规划器,我们根据任务完成过程与已知视觉线索,生成实现人类目标的可执行分步计划。对于探索控制器,则基于生成的分步计划与已知视觉线索预测最优导航或对象交互策略。实验结果表明,在大型房屋级场景中执行"制作早餐""整理房间"等204项复杂人类指令时,我们的方法能达到45.09%的成功率。