Relying on multi-modal observations, embodied robots could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into new sequential tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent. It can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-sharedsemantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method. Visual results, code, and dataset are provided at: https://neragent.github.io.
翻译:基于多模态观测,具身机器人能够在非结构化真实环境中执行多种机器人操作任务。然而,大多数语言条件的行为克隆代理在适应实际场景中的新序列任务时,仍面临长期存在的挑战,即三维场景表征和类人任务学习。本文通过NBAgent这一开创性的语言条件永不停息行为克隆代理,在具身机器人中研究了上述挑战。该代理能够分别从技能共享属性和技能特定属性中持续学习新三维场景语义的观测知识和机器人操作技能。具体而言,我们提出了技能共享语义渲染模块和技能共享表征蒸馏模块,以从技能共享属性中有效学习三维场景语义,进一步解决三维场景表征忽视问题。同时,我们构建了技能特定演化规划器以执行操作知识解耦,能够像人类一样从潜在空间和低秩空间中持续嵌入新的技能特定知识。最后,我们设计了一个永不停息的具身机器人操作基准,大量实验证明了我们方法的显著性能。可视化结果、代码和数据集请访问:https://neragent.github.io。