The ability for robotic systems to understand human language and execute grasping actions is a pivotal challenge in the field of robotics. In target-oriented grasping, prior researches achieve matching human textual commands with images of target objects. However, these works are hard to understand complex or flexible instructions. Moreover, these works lack the capability to autonomously assess the feasibility of instructions, leading to blindly execute grasping tasks even there is no target object. In this paper, we introduce a combination model called QwenGrasp, which combines a large vision language model with a 6-DoF grasp network. By leveraging a pre-trained large vision language model, our approach is capable of working in open-world with natural human language environments, accepting complex and flexible instructions. Furthermore, the specialized grasp network ensures the effectiveness of the generated grasp pose. A series of experiments conducted in real world environment show that our method exhibits a superior ability to comprehend human intent. Additionally, when accepting erroneous instructions, our approach has the capability to suspend task execution and provide feedback to humans, improving safety.
翻译:机器人系统理解人类语言并执行抓取动作的能力是机器人领域的一项关键挑战。在目标导向抓取中,以往的研究实现了将人类文本命令与目标物体的图像进行匹配。然而,这些方法难以理解复杂或灵活的语言指令。此外,它们缺乏自主评估指令可行性的能力,导致即使不存在目标物体也会盲目执行抓取任务。本文提出了一种名为QwenGrasp的组合模型,该模型将大型视觉语言模型与六自由度抓取网络相结合。通过利用预训练的大型视觉语言模型,我们的方法能够在开放世界环境中处理自然语言指令,接受复杂且灵活的命令。同时,专用的抓取网络确保了生成的抓取姿态的有效性。一系列在真实环境中进行的实验表明,我们的方法在理解人类意图方面表现出卓越的能力。此外,当接收错误指令时,该方法能够暂停任务执行并向人类提供反馈,从而提高了安全性。