SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

We present SignAvatars, the first large-scale, multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for Deaf and hard-of-hearing individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for Deaf and hard-of-hearing communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the Deaf and hard-of-hearing communities as well as people interacting with them.

翻译：我们提出SignAvatars，这是首个为弥合聋哑及听力障碍群体沟通鸿沟而设计的大规模、多提示三维手语运动数据集。尽管数字通信相关研究呈指数级增长，但现有通信技术主要服务于口语或书面语，而非作为聋哑及听力障碍群体核心沟通方式的手语。现有手语数据集、词典及手语生成方法通常局限于二维领域，因为为手语标注三维模型与虚拟形象通常完全依赖手语专家进行人工密集型处理，常导致虚拟形象动作不自然。为应对这些挑战，我们系统构建了SignAvatars数据集，该数据集包含来自153位手语者的70,000个视频，总计834万帧，涵盖孤立手语与连续协同发音手语，并提供包括HamNoSys符号系统、口语词汇及文字在内的多模态提示。为获得包含身体、手部及面部的网格数据、生物力学有效姿态以及二维与三维关键点的整体三维标注，我们开发了基于大规模手语视频语料的自动化标注流程。SignAvatars支持三维手语识别及创新的三维手语生成等多种任务，其输入可包括文本脚本、独立词汇及HamNoSys符号标注。为此，我们进一步提出三维手语整体运动生成的统一基准评估体系，以验证SignAvatars的应用潜力。我们相信这项研究在推动数字世界融入聋哑及听力障碍群体及其交流对象方面迈出了重要一步。