SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

In this paper, we present SignAvatars, the first large-scale multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for hearing-impaired individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for hearing-impaired communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as the annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the hearing-impaired communities. Our project page is at https://signavatars.github.io/

翻译：本文提出SignAvatars，首个大规模多提示的3D手语（SL）全身运动数据集，旨在弥合听障群体的沟通鸿沟。尽管关于数字通信的研究呈指数级增长，但现有通信技术主要服务于口语或书面语言，而非听障群体核心沟通方式的手语。现有的手语数据集、词典及手语生成（SLP）方法通常局限于2D领域，因为3D模型和虚拟人物的标注通常由手语专家完全手动完成，过程费时费力且常导致虚拟人物动作不自然。为解决这些挑战，我们整理并策划了SignAvatars数据集，包含来自153位手语者的70,000个视频，总计834万帧，覆盖孤立手势及连续协同发音手势，并提供包括HamNoSys、口语及词汇在内的多类型提示。为获得包含身体、手部及面部网格与生物力学有效姿态的3D全身标注，以及2D和3D关键点，我们引入了一套基于大规模手语视频语料库的自动标注流程。SignAvatars支持多种任务，如3D手语识别（SLR）以及从文本脚本、独立词汇及HamNoSys符号等不同输入生成的新型3D手语生产（SLP）。为评估SignAvatars的潜力，我们进一步提出统一的3D手语全身运动生成基准。我们相信，这项工作是在将数字世界带给听障群体道路上迈出的重要一步。项目页面见https://signavatars.github.io/。