Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive realworld experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm and policy architecture. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
翻译:提升通用机器人操作代理在现实世界中的泛化能力长期以来一直是一个重大挑战。现有方法通常依赖于收集大规模机器人数据,例如RT-1数据集,但这成本高昂且耗时。然而,由于数据多样性不足,这些方法通常在面对新物体和多样化环境的开放域场景时能力受限。本文提出一种新颖范式,有效利用互联网规模基础模型生成的语言推理分割掩码,以条件化机器人操作任务。通过将掩码模态——该模态融合了源自视觉基础模型的语义、几何和时间关联先验——集成到端到端策略模型中,我们的方法能够有效且鲁棒地感知物体姿态,并实现样本高效的泛化学习,包括对新物体实例、语义类别和未见背景的适应。我们首先引入一系列基础模型,以在多任务中锚定自然语言指令。其次,我们开发了一种基于模仿学习的双流二维策略模型,该模型以局部-全局感知方式处理原始图像和物体掩码,以预测机器人动作。在Franka Emika机械臂上进行的大量真实世界实验证明了我们提出的范式和策略架构的有效性。演示视频可在我们提交的视频中查看,更全面的演示可访问链接1或链接2。