In communication between humans, gestures are often preferred or complementary to verbal expression since the former offers better spatial referral. Finger pointing gesture conveys vital information regarding some point of interest in the environment. In human-robot interaction, a user can easily direct a robot to a target location, for example, in search and rescue or factory assistance. State-of-the-art approaches for visual pointing estimation often rely on depth cameras, are limited to indoor environments and provide discrete predictions between limited targets. In this paper, we explore the learning of models for robots to understand pointing directives in various indoor and outdoor environments solely based on a single RGB camera. A novel framework is proposed which includes a designated model termed PointingNet. PointingNet recognizes the occurrence of pointing followed by approximating the position and direction of the index finger. The model relies on a novel segmentation model for masking any lifted arm. While state-of-the-art human pose estimation models provide poor pointing angle estimation accuracy of 28deg, PointingNet exhibits mean accuracy of less than 2deg. With the pointing information, the target is computed followed by planning and motion of the robot. The framework is evaluated on two robotic systems yielding accurate target reaching.
翻译:在人类交流中,手势常优于或补充语言表达,因其能提供更好的空间指代。手指指向手势传递了关于环境中某个兴趣点的重要信息。在人机交互中,用户可轻松引导机器人到达目标位置,例如在搜索救援或工厂辅助场景中。当前最先进的视觉指向估计方法常依赖深度摄像头,局限于室内环境,且仅能在有限目标间提供离散预测。本文探索了仅基于单个RGB摄像头,让机器人理解在各种室内外环境中指向指令的模型学习方法。我们提出了一种新颖框架,其中包含一个名为PointingNet的专用模型。PointingNet能识别手指指向行为的发生,随后估算食指的位置与方向。该模型依赖一种新颖的分割模型来遮挡任何抬起的手臂。尽管当前最先进的人体姿态估计模型提供28度的指向角度估计精度,PointingNet的平均精度低于2度。利用指向信息,可计算目标位置,随后规划机器人的运动。该框架在两个机器人系统上进行了评估,实现了精准的目标到达。