Sign Language Production (SLP) is a challenging task, given the limited resources available and the inherent diversity within sign data. As a result, previous works have suffered from the problem of regression to the mean, leading to under-articulated and incomprehensible signing. In this paper, we propose using dictionary examples and a learnt codebook of facial expressions to create expressive sign language sequences. However, simply concatenating signs and adding the face creates robotic and unnatural sequences. To address this we present a 7-step approach to effectively stitch sequences together. First, by normalizing each sign into a canonical pose, cropping, and stitching we create a continuous sequence. Then, by applying filtering in the frequency domain and resampling each sign, we create cohesive natural sequences that mimic the prosody found in the original data. We leverage a SignGAN model to map the output to a photo-realistic signer and present a complete Text-to-Sign (T2S) SLP pipeline. Our evaluation demonstrates the effectiveness of the approach, showcasing state-of-the-art performance across all datasets. Finally, a user evaluation shows our approach outperforms the baseline model and is capable of producing realistic sign language sequences.
翻译:手语生成(SLP)是一项具有挑战性的任务,原因在于可用资源有限且手语数据本身具有内在多样性。因此,先前的研究存在均值回归问题,导致生成的手语动作欠连贯且难以理解。本文提出利用词典范例和学习得到的面部表情码本,创建富有表现力的手语序列。然而,简单拼接手语手势并叠加面部表情会产生机械生硬且不自然的序列。为解决这一问题,我们提出七步法实现序列的有效缝合:首先将每个手语手势归一化为标准姿态,通过裁剪与拼接形成连续序列;随后在频域中应用滤波并对每个手势进行重采样,生成模仿原始数据韵律特征的连贯自然序列。我们利用SignGAN模型将输出映射为逼真的手语表达者,并构建完整的文本到手语(T2S)SLP流水线。评估表明该方法具有有效性,在全部数据集上均展现出最先进性能。最终用户评估显示,我们的方法优于基线模型,能够生成逼真的手语序列。