Robotic agents performing domestic chores by natural language directives are required to master the complex job of navigating environment and interacting with objects in the environments. The tasks given to the agents are often composite thus are challenging as completing them require to reason about multiple subtasks, e.g., bring a cup of coffee. To address the challenge, we propose to divide and conquer it by breaking the task into multiple subgoals and attend to them individually for better navigation and interaction. We call it Multi-level Compositional Reasoning Agent (MCR-Agent). Specifically, we learn a three-level action policy. At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller. At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies. Finally, at the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy. Our approach not only generates human interpretable subgoals but also achieves 2.03% absolute gain to comparable state of the arts in the efficiency metric (PLWSR in unseen set) without using rule-based planning or a semantic spatial memory.
翻译:通过自然语言指令执行家务的机器人智能体需要掌握在环境中导航并与物体交互的复杂任务。由于分配给智能体的任务通常具有复合性,例如"端一杯咖啡",完成这些任务需要推理多个子任务,因此具有挑战性。为应对这一挑战,我们提出"分而治之"策略,将任务分解为多个子目标并分别处理以实现更好的导航和交互,称之为多层级组合推理智能体(MCR-Agent)。具体而言,我们学习了一个三层动作策略:在最高层,通过高层策略组合控制器基于语言指令推断可解释的子目标序列;在中间层,通过主策略在导航策略与各种独立交互策略之间交替切换,以判别式控制智能体的导航行为;在最底层,通过相应交互策略结合物体掩码推理操作动作。本方法不仅能生成人类可理解的子目标,还在效率指标(未见场景下的PLWSR)上比当前最优方法取得2.03%的绝对提升,且无需依赖基于规则的规划或语义空间记忆。