Improving the generalization capabilities of general-purpose robotic agents has long been a significant challenge actively pursued by research communities. Existing approaches often rely on collecting large-scale real-world robotic data, such as the RT-1 dataset. However, these approaches typically suffer from low efficiency, limiting their capability in open-domain scenarios with new objects, and diverse backgrounds. In this paper, we propose a novel paradigm that effectively leverages language-grounded segmentation masks generated by state-of-the-art foundation models, to address a wide range of pick-and-place robot manipulation tasks in everyday scenarios. By integrating precise semantics and geometries conveyed from masks into our multi-view policy model, our approach can perceive accurate object poses and enable sample-efficient learning. Besides, such design facilitates effective generalization for grasping new objects with similar shapes observed during training. Our approach consists of two distinct steps. First, we introduce a series of foundation models to accurately ground natural language demands across multiple tasks. Second, we develop a Multi-modal Multi-view Policy Model that incorporates inputs such as RGB images, semantic masks, and robot proprioception states to jointly predict precise and executable robot actions. Extensive real-world experiments conducted on a Franka Emika robot arm validate the effectiveness of our proposed paradigm. Real-world demos are shown in YouTube (https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili (https://www.bilibili.com/video/BV178411Z7H2/ ).
翻译:提升通用型机器人智能体的泛化能力长期以来一直是研究界积极探索的重大挑战。现有方法通常依赖于大规模真实世界机器人数据的采集(如RT-1数据集),但此类方法往往效率低下,限制了其在包含新物体与多样背景的开放域场景中的适用性。本文提出一种新范式,通过有效利用由最先进基础模型生成的语言引导分割掩码,解决日常场景中广泛的拾放机器人操作任务。通过将掩码所蕴含的精确语义与几何信息融入多视角策略模型,我们的方法能够感知精确的物体位姿,并实现样本高效的学习。此外,这种设计有助于对训练中观察到的相似形状新物体实现有效的泛化抓取。该方法包含两个明确步骤:首先,引入一系列基础模型以精确锚定跨多个任务的自然语言需求;其次,开发多模态多视角策略模型,融合RGB图像、语义掩码及机器人本体感知状态等输入,共同预测精确且可执行的机器人动作。在Franka Emika机器人臂上开展的大量真实世界实验验证了所提范式的有效性。真实世界演示视频见YouTube(https://www.youtube.com/watch?v=1m9wNzfp_4E)及Bilibili(https://www.bilibili.com/video/BV178411Z7H2/)。