THETA：基于三角测量的手部状态估计用于机器人手控制的遥操作与自动化 (THETA: Triangulated Hand-State Estimation for Teleoperation and Automation in Robotic Hand Control)

The teleoperation of robotic hands is limited by the high costs of depth cameras and sensor gloves, commonly used to estimate hand relative joint positions (XYZ). We present a novel, cost-effective approach using three webcams for triangulation-based tracking to approximate relative joint angles (theta) of human fingers. We also introduce a modified DexHand, a low-cost robotic hand from TheRobotStudio, to demonstrate THETA's real-time application. Data collection involved 40 distinct hand gestures using three 640x480p webcams arranged at 120-degree intervals, generating over 48,000 RGB images. Joint angles were manually determined by measuring midpoints of the MCP, PIP, and DIP finger joints. Captured RGB frames were processed using a DeepLabV3 segmentation model with a ResNet-50 backbone for multi-scale hand segmentation. The segmented images were then HSV-filtered and fed into THETA's architecture, consisting of a MobileNetV2-based CNN classifier optimized for hierarchical spatial feature extraction and a 9-channel input tensor encoding multi-perspective hand representations. The classification model maps segmented hand views into discrete joint angles, achieving 97.18% accuracy, 98.72% recall, F1 Score of 0.9274, and a precision of 0.8906. In real-time inference, THETA captures simultaneous frames, segments hand regions, filters them, and compiles a 9-channel tensor for classification. Joint-angle predictions are relayed via serial to an Arduino, enabling the DexHand to replicate hand movements. Future research will increase dataset diversity, integrate wrist tracking, and apply computer vision techniques such as OpenAI-Vision. THETA potentially ensures cost-effective, user-friendly teleoperation for medical, linguistic, and manufacturing applications.

翻译：机器人手的遥操作常受限于用于估计手部相对关节位置（XYZ）的深度相机和传感器手套的高昂成本。本文提出一种新颖且经济高效的方法，利用三个网络摄像头进行基于三角测量的跟踪，以近似估计人类手指的相对关节角度（theta）。我们还引入了一种改进的DexHand（来自TheRobotStudio的低成本机器人手），以演示THETA的实时应用。数据采集涉及使用三个以120度间隔排列的640x480p网络摄像头，捕捉40种不同的手势，生成了超过48,000张RGB图像。关节角度通过测量手指MCP、PIP和DIP关节的中点手动确定。捕获的RGB帧使用基于ResNet-50骨干网络的DeepLabV3分割模型进行多尺度手部分割处理。分割后的图像随后经过HSV滤波，并输入THETA架构，该架构包含一个基于MobileNetV2的CNN分类器（针对分层空间特征提取进行了优化）和一个编码多视角手部表示的9通道输入张量。该分类模型将分割后的手部视图映射到离散的关节角度，实现了97.18%的准确率、98.72%的召回率、0.9274的F1分数和0.8906的精确率。在实时推理中，THETA捕获同步帧，分割手部区域，进行滤波，并编译一个9通道张量进行分类。关节角度预测结果通过串口传输至Arduino，使DexHand能够复现手部动作。未来研究将增加数据集多样性，集成手腕跟踪，并应用如OpenAI-Vision等计算机视觉技术。THETA有望为医疗、语言和制造应用提供经济高效、用户友好的遥操作解决方案。