Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

Humans, even at a very early age, can learn visual concepts and understand geometry and layout through active interaction with the environment, and generalize their compositions to complete tasks described by natural languages in novel scenes. To mimic such capability, we propose Embodied Concept Learner (ECL) in an interactive 3D environment. Specifically, a robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks by learning purely from human demonstrations and language instructions, without access to ground-truth semantic and depth supervisions from simulations. ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program. ECL has several appealing benefits thanks to its modularized design. Firstly, it enables the robotic agent to learn semantics and depth unsupervisedly acting like babies, e.g., ground concepts through active interaction and perceive depth by disparities when moving forward. Secondly, ECL is fully transparent and step-by-step interpretable in long-term planning. Thirdly, ECL could be beneficial for the embodied instruction following (EIF), outperforming previous works on the ALFRED benchmark when the semantic label is not provided. Also, the learned concept can be reused for other downstream tasks, such as reasoning of object states. Project page: http://ecl.csail.mit.edu/

翻译：人类即使在非常年幼时，也能通过与环境的主动交互学习视觉概念、理解几何结构与空间布局，并将这些概念的组合泛化至全新场景中执行自然语言描述的任务。为模仿这一能力，本文在交互式三维环境中提出具身概念学习者（ECL）。具体而言，机器人智能体可仅通过从人类示教和语言指令中学习（无需访问仿真环境中真实语义与深度监督信息），完成视觉概念锚定、语义地图构建及动作规划等任务。ECL由以下模块组成：（i）指令解析器，将自然语言转化为可执行程序；（ii）具身概念学习者，基于语言描述锚定视觉概念；（iii）地图构建器，利用已学概念估计深度并构建语义地图；（iv）程序执行器，依据确定性策略执行各程序。得益于模块化设计，ECL具备多项优势：首先，使机器人智能体像婴儿一样通过无监督方式学习语义与深度信息——例如通过主动交互锚定概念，利用运动视差感知深度；其次，ECL在长期规划中完全透明且具备步骤级可解释性；再次，ECL可提升具身指令跟随（EIF）任务性能，在未提供语义标签的ALFRED基准测试中超越此前方法；此外，学得的概念可迁移至其他下游任务（如物体状态推理）。项目页面：http://ecl.csail.mit.edu/