Hand gestures play a significant role in human interactions where non-verbal intentions, thoughts and commands are conveyed. In Human-Robot Interaction (HRI), hand gestures offer a similar and efficient medium for conveying clear and rapid directives to a robotic agent. However, state-of-the-art vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters. Such a short distance range limits practical HRI with, for example, service robots, search and rescue robots and drones. In this work, we address the Ultra-Range Gesture Recognition (URGR) problem by aiming for a recognition distance of up to 25 meters and in the context of HRI. We propose a novel deep-learning framework for URGR using solely a simple RGB camera. First, a novel super-resolution model termed HQ-Net is used to enhance the low-resolution image of the user. Then, we propose a novel URGR classifier termed Graph Vision Transformer (GViT) which takes the enhanced image as input. GViT combines the benefits of a Graph Convolutional Network (GCN) and a modified Vision Transformer (ViT). Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%. The framework has also exhibited superior performance compared to human recognition in ultra-range distances. With the framework, we analyze and demonstrate the performance of an autonomous quadruped robot directed by human gestures in complex ultra-range indoor and outdoor environments.
翻译:手势在人类交互中扮演着重要角色,用于传递非语言的意图、思想与指令。在人机交互(HRI)中,手势为向机器人代理传达明确且快速的指令提供了类似的高效媒介。然而,现有基于视觉的手势识别方法已被证明仅在用户与摄像头距离不超过7米时有效。这一较短的距离限制阻碍了人机交互在服务机器人、搜救机器人和无人机等场景中的实际应用。本研究针对超远距离手势识别(URGR)问题,旨在实现面向人机交互、距离达25米的识别目标。我们提出一种仅使用简单RGB摄像头的URGR深度学习框架。首先,采用名为HQ-Net的新型超分辨率模型增强用户的低分辨率图像;随后,提出一种名为图视觉Transformer(GViT)的新型URGR分类器,该分类器以增强图像为输入,融合了图卷积网络(GCN)与改进型视觉Transformer(ViT)的优势。在多样化测试数据上的评估表明,该框架的识别率高达98.1%,并在超远距离下的识别性能优于人类。借助该框架,我们分析并验证了在复杂超远距离室内外环境中,由人类手势引导自主四足机器人完成动作的可行性。