Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.
翻译:基于人类指令从杂乱箱体中执行机器人抓取是一项具有挑战性的任务,因为它需要同时理解自由形式语言的细微差别以及物体之间的空间关系。基于网络规模数据训练的视觉-语言模型(如GPT-4o)已在文本和图像领域展现出卓越的推理能力。但它们能否在零样本设置下真正适用于此类任务?其局限性又是什么?本文通过自由形式语言驱动的机器人抓取任务探讨这些研究问题,并提出一种新颖方法FreeGrasp,该方法利用预训练视觉-语言模型的世界知识来推理人类指令与物体空间布局。我们的方法将所有物体检测为关键点,并利用这些关键点在图像上标注标记,旨在促进GPT-4o的零样本空间推理。这使得我们的方法能够判断目标物体是否可直接抓取,或是否需要先抓取移除其他物体。由于现有数据集均未专门针对此任务设计,我们通过扩展MetaGraspNetV2数据集引入合成数据集FreeGraspData,其中包含人工标注的指令与真实抓取序列。我们在FreeGraspData上进行广泛分析,并利用配备夹爪的机械臂进行真实场景验证,结果表明该方法在抓取推理与执行方面达到最先进性能。项目网站:https://tev-fbk.github.io/FreeGrasp/。