Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

Improving the generalization capabilities of general-purpose robotic agents has long been a significant challenge actively pursued by research communities. Existing approaches often rely on collecting large-scale real-world robotic data, such as the RT-1 dataset. However, these approaches typically suffer from low efficiency, limiting their capability in open-domain scenarios with new objects, and diverse backgrounds. In this paper, we propose a novel paradigm that effectively leverages language-grounded segmentation masks generated by state-of-the-art foundation models, to address a wide range of pick-and-place robot manipulation tasks in everyday scenarios. By integrating precise semantics and geometries conveyed from masks into our multi-view policy model, our approach can perceive accurate object poses and enable sample-efficient learning. Besides, such design facilitates effective generalization for grasping new objects with similar shapes observed during training. Our approach consists of two distinct steps. First, we introduce a series of foundation models to accurately ground natural language demands across multiple tasks. Second, we develop a Multi-modal Multi-view Policy Model that incorporates inputs such as RGB images, semantic masks, and robot proprioception states to jointly predict precise and executable robot actions. Extensive real-world experiments conducted on a Franka Emika robot arm validate the effectiveness of our proposed paradigm. Real-world demos are shown in YouTube (https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili (https://www.bilibili.com/video/BV178411Z7H2/ ).

翻译：提升通用型机器人智能体的泛化能力一直是研究界积极探索的重大挑战。现有方法通常依赖大规模真实世界机器人数据（如RT-1数据集），但这类方法效率较低，在面对新物体和多样化背景的开领域场景时能力受限。本文提出一种创新范式，通过有效利用先进基础模型生成的语言引导分割掩码，解决日常生活中多种拾放机器人操作任务。通过将掩码传递的精确语义与几何信息融入多视角策略模型，我们的方法能够感知精确物体位姿并实现样本高效学习。此外，该设计有助于对新物体（训练时观察到相似形状）实现有效泛化抓取。本方法包含两个明确步骤：首先，引入系列基础模型以精确对齐多任务中的自然语言需求；其次，开发多模态多视角策略模型，融合RGB图像、语义掩码及机器人本体状态等输入，联合预测精确可执行的机器人动作。在Franka Emika机器人臂上开展的大量真实世界实验验证了所提范式的有效性。真实世界演示视频已发布于YouTube（https://www.youtube.com/watch?v=1m9wNzfp_4E ）和Bilibili（https://www.bilibili.com/video/BV178411Z7H2/ ）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日