In this paper, we propose Lan-grasp, a novel approach towards more appropriate semantic grasping. We use foundation models to provide the robot with a deeper understanding of the objects, the right place to grasp an object, or even the parts to avoid. This allows our robot to grasp and utilize objects in a more meaningful and safe manner. We leverage the combination of a Large Language Model, a Vision Language Model, and a traditional grasp planner to generate grasps demonstrating a deeper semantic understanding of the objects. We first prompt the Large Language Model about which object part is appropriate for grasping. Next, the Vision Language Model identifies the corresponding part in the object image. Finally, we generate grasp proposals in the region proposed by the Vision Language Model. Building on foundation models provides us with a zero-shot grasp method that can handle a wide range of objects without the need for further training or fine-tuning. We evaluated our method in real-world experiments on a custom object data set. We present the results of a survey that asks the participants to choose an object part appropriate for grasping. The results show that the grasps generated by our method are consistently ranked higher by the participants than those generated by a conventional grasping planner and a recent semantic grasping approach. In addition, we propose a Visual Chain-of-Thought feedback loop to assess grasp feasibility in complex scenarios. This mechanism enables dynamic reasoning and generates alternative grasp strategies when needed, ensuring safer and more effective grasping outcomes.
翻译:本文提出Lan-grasp,一种实现更恰当语义抓取的新方法。我们利用基础模型使机器人更深入地理解物体、合适的抓取位置甚至应避免接触的部位,从而使机器人能以更具意义且安全的方式抓取和使用物体。通过结合大型语言模型、视觉语言模型与传统抓取规划器,我们生成的抓取方案展现出对物体更深层的语义理解。首先,我们提示大型语言模型判断物体的哪个部位适合抓取;接着,视觉语言模型在物体图像中识别对应部位;最后,在视觉语言模型指定的区域生成抓取提案。基于基础模型的架构使我们获得零样本抓取方法,无需额外训练或微调即可处理多种物体。我们在自定义物体数据集上进行了真实环境实验评估,并通过问卷调查参与者选择合适抓取部位。结果表明,与传统抓取规划器及近期语义抓取方法相比,本方法生成的抓取方案在参与者评价中持续获得更高排名。此外,我们提出视觉思维链反馈循环机制,用于评估复杂场景中的抓取可行性。该机制支持动态推理,并在需要时生成替代抓取策略,从而确保更安全有效的抓取结果。