We focus on the task of language-conditioned grasping in clutter, in which a robot is supposed to grasp the target object based on a language instruction. Previous works separately conduct visual grounding to localize the target object, and generate a grasp for that object. However, these works require object labels or visual attributes for grounding, which calls for handcrafted rules in planner and restricts the range of language instructions. In this paper, we propose to jointly model vision, language and action with object-centric representation. Our method is applicable under more flexible language instructions, and not limited by visual grounding error. Besides, by utilizing the powerful priors from the pre-trained multi-modal model and grasp model, sample efficiency is effectively improved and the sim2real problem is relived without additional data for transfer. A series of experiments carried out in simulation and real world indicate that our method can achieve better task success rate by less times of motion under more flexible language instructions. Moreover, our method is capable of generalizing better to scenarios with unseen objects and language instructions. Our code is available at https://github.com/xukechun/Vision-Language-Grasping
翻译:我们聚焦于语言指令引导下的杂乱场景抓取任务,即机器人需根据自然语言指令抓取目标物体。现有方法分别进行视觉定位以确定目标物体位置,并为其生成抓取动作。然而,这些方法需要物体标签或视觉属性进行定位,这要求规划器中预设人工规则,且限制了语言指令的多样性。本文提出采用以物体为中心的表示方法,对视觉、语言和动作进行联合建模。该方法能适配更灵活的语言指令,且不受视觉定位误差限制。此外,通过利用预训练多模态模型和抓取模型的强大先验知识,有效提升了样本效率,并在无需额外迁移数据的情况下缓解了仿真到现实(sim2real)问题。在仿真和真实环境中开展的一系列实验表明,本方法能够在更灵活的语言指令下,通过更少的运动次数实现更高的任务成功率。同时,本方法对未见物体和语言指令场景具备更强的泛化能力。我们的代码开源在 https://github.com/xukechun/Vision-Language-Grasping