Lan-grasp: Using Large Language Models for Semantic Object Grasping and Placement

In this paper, we propose Lan-grasp, a novel approach towards more appropriate semantic grasping and placing. We leverage foundation models to equip the robot with a semantic understanding of object geometry, enabling it to identify the right place to grasp, which parts to avoid, and the natural pose for placement. This is an important contribution to grasping and utilizing objects in a more meaningful and safe manner. We leverage a combination of a Large Language Model, a Vision-Language Model, and a traditional grasp planner to generate grasps that demonstrate a deeper semantic understanding of the objects. Building on foundation models provides us with a zero-shot grasp method that can handle a wide range of objects without requiring further training or fine-tuning. We also propose a method for safely putting down a grasped object. The core idea is to rotate the object upright utilizing a pretrained generative model and the reasoning capabilities of a VLM. We evaluate our method in real-world experiments on a custom object dataset and present the results of a survey that asks participants to choose an object part appropriate for grasping. The results show that the grasps generated by our method are consistently ranked higher by the participants than those generated by a conventional grasping planner and a recent semantic grasping approach. In addition, we propose a Visual Chain-of-Thought feedback loop to assess grasp feasibility in complex scenarios. This mechanism enables dynamic reasoning and generates alternative grasp strategies when needed, ensuring safer and more effective grasping outcomes.

翻译：本文提出Lan-grasp，一种实现更恰当语义化抓取与放置的新方法。我们利用基础模型赋予机器人对物体几何特征的语义理解能力，使其能够识别合适的抓取位置、规避区域以及自然的放置姿态。这项研究为以更具意义且安全的方式抓取和使用物体做出了重要贡献。我们结合大语言模型、视觉语言模型和传统抓取规划器，生成能体现对物体更深层次语义理解的抓取方案。基于基础模型的架构为我们提供了零样本抓取方法，无需额外训练或微调即可处理各种物体。我们还提出了一种安全放置已抓取物体的方法，其核心思想是通过预训练生成模型和视觉语言模型的推理能力将物体旋转至直立状态。我们在自定义物体数据集上进行了真实环境实验评估，并通过问卷调查参与者选择适合抓取的物体部位。结果表明，与传统抓取规划器及近期语义抓取方法相比，参与者对我们方法生成的抓取方案评价 consistently更高。此外，我们提出视觉思维链反馈循环机制，用于评估复杂场景下的抓取可行性。该机制支持动态推理，并在需要时生成替代抓取策略，从而确保更安全有效的抓取效果。