Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
翻译:手语生成(SLP)是将复杂输入文本转换为真实视频的过程。先前的研究大多聚焦于文本到手语词(Text2Gloss)、手语词到姿态(Gloss2Pose)以及姿态到视频(Pose2Vid)等阶段,部分工作则集中于提示到手语词(Prompt2Gloss)和文本到虚拟形象(Text2Avatar)阶段。然而,由于这些阶段在文本转换、姿态生成以及将姿态渲染为真实人体视频方面存在不准确性,导致误差逐步累积,该领域进展缓慢。因此,本文对传统冗余结构进行了精简,简化并优化了任务目标,设计了一种新的手语生成模型——稳定手语者(Stable Signer)。该模型将SLP任务重新定义为仅包含文本理解(Prompt2Gloss、Text2Gloss)和姿态到视频(Pose2Vid)的分层端到端生成任务,并通过我们提出的新型手语理解链接器(Sign Language Understanding Linker, SLUL)执行文本理解,同时通过名为SLP-MoE的手势渲染专家模块生成手势动作,从而端到端地生成高质量、多风格的手语视频。SLUL采用新开发的语义感知手语词掩码损失(Semantic-Aware Gloss Masking Loss, SAGM Loss)进行训练。与当前最先进的生成方法相比,其性能提升了48.6%。