Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives to leverage complementary, multi-level sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, which restructures generation in a coarse-to-fine manner and reduces the combinatorial complexity of unmasking orders by over $10^{41}$ times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through a learnable gate and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while delivering a 40\% higher throughput. Code and models will be publicly released.
翻译:手语生成(SLG)旨在将书面文本转化为富有表现力的手语动作,从而为聋哑及听力障碍群体消除沟通障碍。近期研究采用自回归语言模型在语言建模框架内构建SLG系统,但此类模型存在上下文建模单向性及逐令牌推理速度缓慢的局限。为克服这些缺陷,本文提出MaDiS——一种基于掩码扩散的手语生成语言模型,该模型能够捕获双向依赖关系并支持高效的并行多令牌生成。我们进一步提出三层次跨模态预训练方案,通过联合学习令牌空间、潜在空间与三维物理空间目标,充分利用互补的多层次手语表征。为加速微调阶段的模型收敛,我们设计了具有时序检查点的新型去掩码策略,以由粗到细的方式重构生成过程,将去掩码顺序的组合复杂度降低超过$10^{41}$倍。此外,开发了部件混合嵌入层,通过可学习门控机制与优化码本有效融合存储在不同部件手语令牌中的信息。在CSL-Daily、Phoenix-2014T和How2Sign数据集上的大量实验表明,MaDiS在DTW误差及两个新指标SiBLEU与SiCLIP等多项评估中均取得优越性能,同时实现40%的吞吐量提升。代码与模型将公开发布。