Task-oriented grasping, which involves grasping specific parts of objects based on their functions, is crucial for developing advanced robotic systems capable of performing complex tasks in dynamic environments. In this paper, we propose a training-free framework that incorporates both semantic and geometric priors for zero-shot task-oriented grasp generation. The proposed framework, SegGrasp, first leverages the vision-language models like GLIP for coarse segmentation. It then uses detailed geometric information from convex decomposition to improve segmentation quality through a fusion policy named GeoFusion. An effective grasp pose can be generated by a grasping network with improved segmentation. We conducted the experiments on both segmentation benchmark and real-world robot grasping. The experimental results show that SegGrasp surpasses the baseline by more than 15\% in grasp and segmentation performance.
翻译:任务导向抓取旨在根据物体功能抓取其特定部位,对于开发能够在动态环境中执行复杂任务的高级机器人系统至关重要。本文提出一种无需训练的框架,该框架融合语义与几何先验知识,用于零样本任务导向抓取生成。所提出的框架SegGrasp首先利用GLIP等视觉语言模型进行粗粒度分割,随后通过凸分解获取的精细几何信息,借助名为GeoFusion的融合策略提升分割质量。改进后的分割结果输入抓取网络即可生成有效的抓取姿态。我们在分割基准测试和真实机器人抓取任务上进行了实验。结果表明,SegGrasp在抓取与分割性能上均超越基线方法超过15%。