Task-oriented grasping (TOG), which refers to the problem of synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the complex relationship between objects, tasks, and grasps requires rich prior knowledge about objects and tasks. Existing methods typically limit the prior knowledge to a closed-set scope and cannot support the generalization to novel objects and tasks out of the training set. To address such a limitation, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Comprehensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoudationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoudationGrasp is validated in real-robot grasping and manipulation experiments on a 7 DoF robotic arm. Our code, data, appendix, and video are publicly available at https://sites.google.com/view/foundationgrasp.
翻译:任务导向抓取(TOG)是指合成与下游操作任务构型兼容的物体抓取姿势的问题,是实现工具操控的首要里程碑。类似于认知过程中负责语义与几何推理的两个大脑区域的协同激活,对物体、任务与抓取之间复杂关系的建模需要丰富的物体与任务先验知识。现有方法通常将先验知识限制在封闭集范围内,无法泛化到训练集之外的新物体与新任务。为解决这一局限,我们提出FoundationGrasp——一种基于基础模型的TOG框架,利用基础模型中的开放知识学习可泛化的TOG技能。在贡献的语言与视觉增强型TaskGrasp(LaViA-TaskGrasp)数据集上进行了全面实验,结果表明FoundationGrasp在泛化到训练集之外的新物体实例、新物体类别及新任务时,性能优于现有方法。此外,在7自由度机械臂上的真实机器人抓取与操作实验验证了FoundationGrasp的有效性。我们的代码、数据、附录及视频已公开在https://sites.google.com/view/foundationgrasp。