Robotic grasping faces new challenges in human-robot-interaction scenarios. We consider the task that the robot grasps a target object designated by human's language directives. The robot not only needs to locate a target based on vision-and-language information, but also needs to predict the reasonable grasp pose candidate at various views and postures. In this work, we propose a novel interactive grasp policy, named Visual-Lingual-Grasp (VL-Grasp), to grasp the target specified by human language. First, we build a new challenging visual grounding dataset to provide functional training data for robotic interactive perception in indoor environments. Second, we propose a 6-Dof interactive grasp policy combined with visual grounding and 6-Dof grasp pose detection to extend the universality of interactive grasping. Third, we design a grasp pose filter module to enhance the performance of the policy. Experiments demonstrate the effectiveness and extendibility of the VL-Grasp in real world. The VL-Grasp achieves a success rate of 72.5\% in different indoor scenes. The code and dataset is available at https://github.com/luyh20/VL-Grasp.
翻译:机器人抓取在人机交互场景中面临新的挑战。我们考虑机器人根据人类语言指令抓取目标物体的任务。机器人不仅需要基于视觉与语言信息定位目标,还需在不同视角和姿态下预测合理的抓取姿态候选。本文提出一种新颖的交互式抓取策略——视觉-语言-抓取(VL-Grasp),用于抓取人类语言指定的目标。首先,我们构建了一个具有挑战性的新型视觉定位数据集,为室内环境中机器人交互感知提供功能性训练数据。其次,我们提出一种结合视觉定位与六自由度抓取姿态检测的交互式抓取策略,以扩展交互式抓取的普适性。第三,我们设计了一个抓取姿态过滤模块以提升策略性能。实验证明了VL-Grasp在真实世界中的有效性与可扩展性。该策略在不同室内场景中达到了72.5%的成功率。代码与数据集已开源:https://github.com/luyh20/VL-Grasp。