Target-oriented grasping in unstructured scenes with language control is essential for intelligent robot arm grasping. The ability for the robot arm to understand the human language and execute corresponding grasping actions is a pivotal challenge. In this paper, we propose a combination model called QwenGrasp which combines a large vision-language model with a 6-DoF grasp neural network. QwenGrasp is able to conduct a 6-DoF grasping task on the target object with textual language instruction. We design a complete experiment with six-dimension instructions to test the QwenGrasp when facing with different cases. The results show that QwenGrasp has a superior ability to comprehend the human intention. Even in the face of vague instructions with descriptive words or instructions with direction information, the target object can be grasped accurately. When QwenGrasp accepts the instruction which is not feasible or not relevant to the grasping task, our approach has the ability to suspend the task execution and provide a proper feedback to humans, improving the safety. In conclusion, with the great power of large vision-language model, QwenGrasp can be applied in the open language environment to conduct the target-oriented grasping task with freely input instructions.
翻译:在非结构化场景中通过语言控制实现目标导向抓取,对于智能机械臂抓取至关重要。使机械臂能够理解人类语言并执行相应抓取动作是一项关键挑战。本文提出一种名为QwenGrasp的组合模型,该模型融合了大规模视觉语言模型与6自由度抓取神经网络。QwenGrasp能够根据文本语言指令对目标物体执行6自由度抓取任务。我们设计了包含六维指令的完整实验,以测试QwenGrasp在不同情况下的表现。结果表明,QwenGrasp具有卓越的人类意图理解能力。即使面对包含描述性词语的模糊指令或带有方向信息的指令,也能精准抓取目标物体。当QwenGrasp接收到不可执行或与抓取任务无关的指令时,该方法能够暂停任务执行并向人类提供适当反馈,从而提升安全性。总之,借助大规模视觉语言模型的强大能力,QwenGrasp可在开放语言环境中通过自由输入指令完成目标导向抓取任务。