Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

Hand gestures play a significant role in human interactions where non-verbal intentions, thoughts and commands are conveyed. In Human-Robot Interaction (HRI), hand gestures offer a similar and efficient medium for conveying clear and rapid directives to a robotic agent. However, state-of-the-art vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters. Such a short distance range limits practical HRI with, for example, service robots, search and rescue robots and drones. In this work, we address the Ultra-Range Gesture Recognition (URGR) problem by aiming for a recognition distance of up to 25 meters and in the context of HRI. We propose the URGR framework, a novel deep-learning, using solely a simple RGB camera. Gesture inference is based on a single image. First, a novel super-resolution model termed High-Quality Network (HQ-Net) uses a set of self-attention and convolutional layers to enhance the low-resolution image of the user. Then, we propose a novel URGR classifier termed Graph Vision Transformer (GViT) which takes the enhanced image as input. GViT combines the benefits of a Graph Convolutional Network (GCN) and a modified Vision Transformer (ViT). Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%. The framework has also exhibited superior performance compared to human recognition in ultra-range distances. With the framework, we analyze and demonstrate the performance of an autonomous quadruped robot directed by human gestures in complex ultra-range indoor and outdoor environments, acquiring 96% recognition rate on average.

翻译：手势在人类交互中承载着非语言的意图、思想与指令，发挥着重要作用。在人机交互（HRI）中，手势为向机器人代理传递清晰快速指令提供了类似的高效媒介。然而，现有最先进的基于视觉的手势识别方法仅能在用户与摄像头距离不超过7米的范围内有效工作。这种短距离限制制约了服务机器人、搜救机器人和无人机等实际HRI应用。本研究针对超距离手势识别（URGR）问题，在HRI场景下将识别距离扩展至25米。我们提出URGR框架，这是一种仅依赖普通RGB相机的深度学习新方法，手势推断基于单帧图像。首先，提出名为高质量网络（HQ-Net）的新型超分辨率模型，通过自注意力与卷积层组合增强用户低分辨率图像。随后，提出名为图视变换器（GViT）的新型URGR分类器，将增强图像作为输入。GViT融合了图卷积网络（GCN）与改进型视觉变换器（ViT）的优势。在多样化测试数据上的评估显示，该框架达到98.1%的高识别率，并在超距离范围内展现出优于人类识别的性能。我们通过该框架分析并演示了由人类手势引导的四足自主机器人在复杂超距离室内外环境中的表现，平均识别率达96%。