面向杂乱场景中目标抓取的视觉-语言-动作联合建模 (A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter)

We focus on the task of language-conditioned grasping in clutter, in which a robot is supposed to grasp the target object based on a language instruction. Previous works separately conduct visual grounding to localize the target object, and generate a grasp for that object. However, these works require object labels or visual attributes for grounding, which calls for handcrafted rules in planner and restricts the range of language instructions. In this paper, we propose to jointly model vision, language and action with object-centric representation. Our method is applicable under more flexible language instructions, and not limited by visual grounding error. Besides, by utilizing the powerful priors from the pre-trained multi-modal model and grasp model, sample efficiency is effectively improved and the sim2real problem is relived without additional data for transfer. A series of experiments carried out in simulation and real world indicate that our method can achieve better task success rate by less times of motion under more flexible language instructions. Moreover, our method is capable of generalizing better to scenarios with unseen objects and language instructions. Our code is available at https://github.com/xukechun/Vision-Language-Grasping

翻译：本文聚焦于语言条件化杂乱场景抓取任务，即机器人需根据语言指令抓取目标物体。以往研究工作通常分别进行视觉定位以确定目标物体位置，并为该物体生成抓取动作。然而，这些方法需要依赖物体标签或视觉属性进行定位，这既需要在规划器中设计人工规则，也限制了语言指令的适用范围。本文提出采用以物体为中心的表示方法，对视觉、语言和动作进行联合建模。我们的方法适用于更灵活的语言指令，且不受视觉定位误差的限制。此外，通过利用预训练多模态模型和抓取模型的强大先验知识，有效提升了样本效率，并在无需额外迁移数据的情况下缓解了仿真到现实的问题。在仿真和真实环境中进行的一系列实验表明，我们的方法能够在更灵活的语言指令下，以更少的动作次数实现更高的任务成功率。更重要的是，该方法在面对未见过的物体和语言指令时展现出更好的泛化能力。代码已开源：https://github.com/xukechun/Vision-Language-Grasping

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日