The field of Sign Language Production (SLP) lacked a large-scale, pre-trained model based on deep learning for continuous American Sign Language (ASL) production in the past decade. This limitation hampers communication for all individuals with disabilities relying on ASL. To address this issue, we undertook the secondary development and utilization of How2Sign, one of the largest publicly available ASL datasets. Despite its significance, prior researchers in the field of sign language have not effectively employed this corpus due to the intricacies involved in American Sign Language Production (ASLP). To conduct large-scale ASLP, we propose SignDiff based on the latest work in related fields, which is a dual-condition diffusion pre-training model that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames reduce the occurrence of multiple fingers in the diffusion model. In addition, our ASLP method proposes two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We also evaluated our model on the previous mainstream dataset called PHOENIX14T, and the main experiments achieved the results of SOTA. In addition, our image quality far exceeds all previous results by 10 percentage points on the SSIM indicator. Finally, we conducted ablation studies and qualitative evaluations for discussion.
翻译:手语生成(SLP)领域在过去十年中缺乏基于深度学习的、大规模预训练模型用于连续美国手语(ASL)生成。这一局限阻碍了依赖ASL的全体残障人士的交流。为解决该问题,我们对规模最大的公开ASL数据集之一How2Sign进行了二次开发与利用。尽管该数据集意义重大,但由于美国手语生成(ASLP)涉及的复杂性,此前手语领域的研究者并未有效运用这一语料库。为开展大规模ASLP研究,我们基于相关领域的最新工作提出SignDiff,这是一种双条件扩散预训练模型,能够从骨架姿态生成人类手语发言者。SignDiff配备了一种新颖的帧强化网络FR-Net,类似于密集人体姿态估计工作,可增强文本词汇符号与手语密集姿态帧之间的对应关系,减少扩散模型中的多指生成现象。此外,我们的ASLP方法提出了两个改进模块和一个新损失函数,以提高手语骨架姿态的准确性和质量,并增强模型在大规模数据上的训练能力。我们为ASL生成提出了首个基线,并在How2Sign开发/测试集上分别报告了17.19和12.85的BLEU-4得分。我们还在此前主流数据集PHOENIX14T上评估了模型,主要实验达到了当前最优结果(SOTA)。此外,在SSIM指标上,我们的图像质量远超此前所有结果达10个百分点。最后,我们进行了消融实验与定性评估以展开讨论。