Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches. Although they achieve competitive performance on several tasks, a substantial gap remains in open-ended text generation. We hypothesize that one cause of this gap is that strict positional prediction makes MDLM decoding highly sensitive to token misalignment, and we show through controlled interventions that a one-position shift can severely disrupt semantics. This observation suggests that enforcing strict positional supervision during training is misaligned with the irreversible denoising dynamics of MDLM decoding. Motivated by this mismatch, we adopt an alignment-flexible supervision strategy during fine-tuning. Specifically, we introduce a special token <slack> via the connectionist temporal classification objective. We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks. Our method consistently outperforms the original model and improves robustness to positional shifts, indicating that relaxing strict positional supervision is an important factor in improving generation quality in MDLMs.
翻译:掩蔽扩散语言模型(MDLM)已成为主流自回归方法的一种有前景的替代方案。尽管在多项任务上取得了有竞争力的性能,但在开放式文本生成方面仍存在显著差距。我们假设造成这一差距的原因之一是严格的位置预测使得MDLM解码对词元错位高度敏感,并通过受控干预实验证明,单一位移即可严重破坏语义连贯性。这一观察表明,训练期间施加严格的位置监督与MDLM解码不可逆的去噪动态机制存在错位。基于这种不匹配性,我们在微调阶段采用了对齐灵活化的监督策略。具体而言,我们通过连接时序分类目标引入特殊标记<slack>。我们将该方法应用于广泛使用的MDLM模型,并在五个开放式文本生成基准上开展实验。我们的方法持续超越原始模型,并提升了对位置偏移的鲁棒性,这表明松弛严格的位置监督是提升MDLM生成质量的关键因素。