Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.
翻译:柔性指令引导的6自由度抓取是现实世界机器人系统面临的一项重要且具有挑战性的任务。现有方法利用大语言模型(LLMs)的上下文理解能力,建立表达式与目标之间的映射,使机器人能够理解指令中的用户意图。然而,尽管与抓取任务紧密相关,大语言模型关于物体物理属性的知识仍未得到充分探索。在本工作中,我们提出了GraspCoT,一个集成面向物理属性的思维链(CoT)推理机制的6自由度抓取检测框架,该机制由辅助问答(QA)任务引导。具体而言,我们设计了一套QA模板,以实现包含三个阶段的分层推理:目标解析、物理属性分析和抓取动作选择。此外,GraspCoT提出了一种统一的多模态大语言模型架构,该架构将3D场景的多视角观测编码为3D感知的视觉标记,然后在大语言模型内将这些视觉标记与CoT衍生的文本标记联合嵌入,以生成抓取姿态预测。进一步地,我们提出了IntentGrasp,一个大规模基准数据集,填补了在多样化和间接语言命令下进行多物体抓取检测的公开数据集的空白。在IntentGrasp上进行的大量实验证明了我们方法的优越性,在真实世界机器人应用中的额外验证也证实了其实用性。代码和数据将公开。