Robotic agents performing domestic chores by natural language directives are required to master the complex job of navigating environment and interacting with objects in the environments. The tasks given to the agents are often composite thus are challenging as completing them require to reason about multiple subtasks, e.g., bring a cup of coffee. To address the challenge, we propose to divide and conquer it by breaking the task into multiple subgoals and attend to them individually for better navigation and interaction. We call it Multi-level Compositional Reasoning Agent (MCR-Agent). Specifically, we learn a three-level action policy. At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller. At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies. Finally, at the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy. Our approach not only generates human interpretable subgoals but also achieves 2.03% absolute gain to comparable state of the arts in the efficiency metric (PLWSR in unseen set) without using rule-based planning or a semantic spatial memory.
翻译:接受自然语言指令执行家务的机器人需要掌握导航环境并与环境中物体交互的复杂任务。由于分配给智能体的任务通常具有组合性,完成这些任务需推理多个子任务(例如"端一杯咖啡"),因此颇具挑战性。为应对这一挑战,我们提出分而治之策略:将任务分解为多个子目标并分别关注以实现更优导航与交互,称为多层级组合推理智能体(MCR-Agent)。具体而言,我们学习一个三层动作策略:最高层级通过高层策略组合控制器,基于语言指令推断需执行的可解释子目标序列;中间层级由主策略通过交替执行导航策略与多种独立交互策略,实现智能体导航的判别式控制;最低层级则利用相应交互策略,结合物体掩膜推断操作动作。该方法不仅能生成人类可解释的子目标,且在不依赖规则规划或语义空间记忆的情况下,在效率指标(未见场景的PLWSR)上相较同级最优基准实现2.03%的绝对提升。