基于自由形式语言的机器人推理与抓取 (Free-form language-based robotic reasoning and grasping)

Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.

翻译：基于人类指令从杂乱箱体中执行机器人抓取是一项具有挑战性的任务，因为它需要同时理解自由形式语言的细微差别以及物体之间的空间关系。基于网络规模数据训练的视觉-语言模型（如GPT-4o）已在文本和图像领域展现出卓越的推理能力。但它们能否在零样本设置下真正适用于此类任务？其局限性又是什么？本文通过自由形式语言驱动的机器人抓取任务探讨这些研究问题，并提出一种新颖方法FreeGrasp，该方法利用预训练视觉-语言模型的世界知识来推理人类指令与物体空间布局。我们的方法将所有物体检测为关键点，并利用这些关键点在图像上标注标记，旨在促进GPT-4o的零样本空间推理。这使得我们的方法能够判断目标物体是否可直接抓取，或是否需要先抓取移除其他物体。由于现有数据集均未专门针对此任务设计，我们通过扩展MetaGraspNetV2数据集引入合成数据集FreeGraspData，其中包含人工标注的指令与真实抓取序列。我们在FreeGraspData上进行广泛分析，并利用配备夹爪的机械臂进行真实场景验证，结果表明该方法在抓取推理与执行方面达到最先进性能。项目网站：https://tev-fbk.github.io/FreeGrasp/。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日