SignDiff: Learning Diffusion Models for American Sign Language Production

The field of Sign Language Production (SLP) lacked a large-scale, pre-trained model based on deep learning for continuous American Sign Language (ASL) production in the past decade. This limitation hampers communication for all individuals with disabilities relying on ASL. To address this issue, we undertook the secondary development and utilization of How2Sign, one of the largest publicly available ASL datasets. Despite its significance, prior researchers in the field of sign language have not effectively employed this corpus due to the intricacies involved in American Sign Language Production (ASLP). To conduct large-scale ASLP, we propose SignDiff based on the latest work in related fields, which is a dual-condition diffusion pre-training model that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames reduce the occurrence of multiple fingers in the diffusion model. In addition, our ASLP method proposes two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We also evaluated our model on the previous mainstream dataset called PHOENIX14T, and the main experiments achieved the results of SOTA. In addition, our image quality far exceeds all previous results by 10 percentage points on the SSIM indicator. Finally, we conducted ablation studies and qualitative evaluations for discussion.

翻译：手语生成（SLP）领域在过去十年中缺乏基于深度学习的、大规模预训练模型用于连续美国手语（ASL）生成。这一局限阻碍了依赖ASL的全体残障人士的交流。为解决该问题，我们对规模最大的公开ASL数据集之一How2Sign进行了二次开发与利用。尽管该数据集意义重大，但由于美国手语生成（ASLP）涉及的复杂性，此前手语领域的研究者并未有效运用这一语料库。为开展大规模ASLP研究，我们基于相关领域的最新工作提出SignDiff，这是一种双条件扩散预训练模型，能够从骨架姿态生成人类手语发言者。SignDiff配备了一种新颖的帧强化网络FR-Net，类似于密集人体姿态估计工作，可增强文本词汇符号与手语密集姿态帧之间的对应关系，减少扩散模型中的多指生成现象。此外，我们的ASLP方法提出了两个改进模块和一个新损失函数，以提高手语骨架姿态的准确性和质量，并增强模型在大规模数据上的训练能力。我们为ASL生成提出了首个基线，并在How2Sign开发/测试集上分别报告了17.19和12.85的BLEU-4得分。我们还在此前主流数据集PHOENIX14T上评估了模型，主要实验达到了当前最优结果（SOTA）。此外，在SSIM指标上，我们的图像质量远超此前所有结果达10个百分点。最后，我们进行了消融实验与定性评估以展开讨论。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日