Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives, leading to richer and more grounded sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, reducing the combinatorial complexity of unmasking orders by over $10^{41}$ times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through learnable gates and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%. Code and models will be released on our project page.
翻译:手语生成(SLG)旨在将书面文本转换为富有表现力的手语动作,从而为聋哑及听力障碍群体消除沟通障碍。近期研究采用自回归语言模型在语言建模框架内构建SLG系统,但此类模型存在上下文建模单向性及逐令牌推理速度缓慢的局限性。为克服这些缺陷,本文提出MaDiS——一种基于掩码扩散的手语生成语言模型,该模型能够捕捉双向依赖关系并支持高效的并行多令牌生成。我们进一步提出三层次跨模态预训练方案,通过联合学习令牌空间、潜在空间及三维物理空间目标,获得更丰富且更具实体感知的手语表征。为加速微调阶段的模型收敛,我们设计了具有时序检查点的新型解掩码策略,将解掩码顺序的组合复杂度降低超过$10^{41}$倍。此外,开发了部件混合嵌入层,通过可学习门控机制与优化码本,有效融合存储在不同部件手语令牌中的信息。在CSL-Daily、Phoenix-2014T和How2Sign数据集上的大量实验表明,MaDiS在DTW误差及两个新评价指标(SiBLEU与SiCLIP)上均取得优越性能,同时将推理延迟降低近30%。代码与模型将在项目页面发布。