The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID Glosses, which serve to uniquely identify ASL signs. This paper proposes SignX, a novel framework for continuous sign language recognition in compact pose-rich latent space. First, we construct a unified latent representation that encodes heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation) into a compact, information-dense space. Second, we train a ViT-based Video2Pose module to extract this latent representation directly from raw videos. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space. This multi-stage design achieves end-to-end sign language recognition while significantly reducing computational consumption. Experimental results demonstrate that SignX achieves state-of-the-art accuracy on continuous sign language recognition.
翻译:手语数据处理的复杂性带来了诸多挑战。当前美国手语识别方法旨在通过姿态信息将RGB手语视频转换为基于英语的ID词元,这些词元用于唯一标识美国手语词汇。本文提出SignX,一种在紧凑且富含姿态信息的潜在空间中进行连续手语识别的新框架。首先,我们构建了一个统一的潜在表示,将异构姿态格式(SMPLer-X、DWPose、Mediapipe、PrimeDepth和Sapiens Segmentation)编码到一个紧凑且信息密集的空间中。其次,我们训练了一个基于ViT的Video2Pose模块,以直接从原始视频中提取此潜在表示。最后,我们开发了一种完全在此潜在空间中运行的时序建模与序列优化方法。这种多阶段设计实现了端到端的手语识别,同时显著降低了计算消耗。实验结果表明,SignX在连续手语识别任务上达到了最先进的准确率。