Sign Languages (SL) serve as the predominant mode of communication for the Deaf and Hard of Hearing communities. The advent of deep learning has aided numerous methods in SL recognition and translation, achieving remarkable results. However, Sign Language Production (SLP) poses a challenge for the computer vision community as the motions generated must be realistic and have precise semantic meanings. Most SLP methods rely on 2D data, thus impeding their ability to attain a necessary level of realism. In this work, we propose a diffusion-based SLP model trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through a series of quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. We believe that this work presents an important and necessary step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities. The code, method and generated data will be made publicly available.
翻译:手语(SL)是聋哑及听力障碍群体最主要的交流方式。深度学习的发展推动了手语识别与翻译领域的诸多方法取得显著成果,但手语生成(SLP)对计算机视觉领域仍构成挑战——其生成的肢体动作需兼具真实性和精确语义。现有SLP方法多依赖二维数据,难以达到所需的真实度。本文提出一种基于扩散模型的SLP方法,基于精心构建的大规模4D手语虚拟人物数据集及其对应文本标注进行训练。该方法通过定义在SMPL-X骨架结构上的新型解剖学感知图神经网络构建扩散过程,能够从无约束语篇中生成动态三维虚拟人物序列。定量与定性实验表明,本方法显著优于现有SLP技术。我们相信,这项研究为构建逼真的神经手语虚拟人物迈出了关键一步,有望弥合聋哑群体与健听群体之间的沟通鸿沟。相关代码、方法及生成数据将公开发布。