Towards Open-World Grasping with Large Vision-Language Models

The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.

翻译：从开放语言指令中抓取现实世界物体的能力构成了机器人学中的一个基本挑战。开放世界抓取系统应能结合高层上下文推理与低层物理几何推理，以适用于任意场景。近期研究利用大型语言模型（LLM）中固有的网络规模知识进行机器人场景下的规划与推理，但依赖外部视觉和动作模型将此类知识接地到环境中并参数化执行。这种设置存在两大瓶颈：a) LLM的推理能力受限于视觉接地的质量；b) LLM缺乏对世界的低层空间理解，而这在接触密集的抓取场景中至关重要。本研究表明，现代视觉语言模型（VLM）能够应对这些局限，因为它们具有隐式接地能力，并能联合推理语义与几何信息。我们提出OWG——一种开放世界抓取流程，将VLM与分割及抓取合成模型相结合，通过三个阶段实现接地世界理解：开放指称分割、接地抓取规划以及通过接触推理的抓取排序。所有阶段均可通过合适的视觉提示机制以零样本方式实现。我们在杂乱室内场景数据集中进行了广泛评估，展示了OWG在开放语言接地方面的鲁棒性；同时在仿真和硬件上进行了开放世界机器人抓取实验，结果表明其性能优于以往基于监督学习和零样本LLM的方法。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日