Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

Improving the generalization capabilities of general-purpose robotic agents has long been a significant challenge actively pursued by research communities. Existing approaches often rely on collecting large-scale real-world robotic data, such as the RT-1 dataset. However, these approaches typically suffer from low efficiency, limiting their capability in open-domain scenarios with new objects, and diverse backgrounds. In this paper, we propose a novel paradigm that effectively leverages language-grounded segmentation masks generated by state-of-the-art foundation models, to address a wide range of pick-and-place robot manipulation tasks in everyday scenarios. By integrating precise semantics and geometries conveyed from masks into our multi-view policy model, our approach can perceive accurate object poses and enable sample-efficient learning. Besides, such design facilitates effective generalization for grasping new objects with similar shapes observed during training. Our approach consists of two distinct steps. First, we introduce a series of foundation models to accurately ground natural language demands across multiple tasks. Second, we develop a Multi-modal Multi-view Policy Model that incorporates inputs such as RGB images, semantic masks, and robot proprioception states to jointly predict precise and executable robot actions. Extensive real-world experiments conducted on a Franka Emika robot arm validate the effectiveness of our proposed paradigm. Real-world demos are shown in YouTube (https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili (https://www.bilibili.com/video/BV178411Z7H2/ ).

翻译：提升通用型机器人智能体的泛化能力长期以来一直是研究界积极探索的重大挑战。现有方法通常依赖于大规模真实世界机器人数据的采集（如RT-1数据集），但此类方法往往效率低下，限制了其在包含新物体与多样背景的开放域场景中的适用性。本文提出一种新范式，通过有效利用由最先进基础模型生成的语言引导分割掩码，解决日常场景中广泛的拾放机器人操作任务。通过将掩码所蕴含的精确语义与几何信息融入多视角策略模型，我们的方法能够感知精确的物体位姿，并实现样本高效的学习。此外，这种设计有助于对训练中观察到的相似形状新物体实现有效的泛化抓取。该方法包含两个明确步骤：首先，引入一系列基础模型以精确锚定跨多个任务的自然语言需求；其次，开发多模态多视角策略模型，融合RGB图像、语义掩码及机器人本体感知状态等输入，共同预测精确且可执行的机器人动作。在Franka Emika机器人臂上开展的大量真实世界实验验证了所提范式的有效性。真实世界演示视频见YouTube（https://www.youtube.com/watch?v=1m9wNzfp_4E）及Bilibili（https://www.bilibili.com/video/BV178411Z7H2/）。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/