Task-oriented grasping (TOG), which refers to synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the intricate relationship between objects, tasks, and grasps necessitates rich semantic and geometric prior knowledge about these elements. Existing methods typically restrict the prior knowledge to a closed-set scope, limiting their generalization to novel objects and tasks out of the training set. To address such a limitation, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Extensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoundationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoundationGrasp is validated in real-robot grasping and manipulation experiments on a 7-DoF robotic arm. Our code, data, appendix, and video are publicly available at https://sites.google.com/view/foundationgrasp.
翻译:任务导向抓取(TOG)指在物体上合成与下游操作任务在构型上兼容的抓取姿态,是实现工具操作的首要里程碑。类比于认知过程中负责语义推理与几何推理的两个脑区被激活,对物体、任务与抓取姿态之间复杂关系的建模,需要关于这些要素丰富的语义与几何先验知识。现有方法通常将先验知识限制在闭集范围内,限制了其对训练集外新物体和新任务的泛化能力。为克服这一局限,我们提出了FoundationGrasp,一个基于基础模型的任务导向抓取框架,它利用基础模型中的开放知识来学习可泛化的任务导向抓取技能。我们在构建的语言与视觉增强任务抓取数据集(LaViA-TaskGrasp)上进行了大量实验,结果表明在泛化至训练集外的新物体实例、新物体类别以及新任务时,FoundationGrasp优于现有方法。此外,我们在一个7自由度机械臂上进行的真实机器人抓取与操作实验验证了FoundationGrasp的有效性。我们的代码、数据、附录及视频已公开于 https://sites.google.com/view/foundationgrasp。