Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and sequential decoding.
翻译:掩码扩散模型已成为语言建模中一种前景广阔的方法,但其性能仍落后于自回归模型,且需要更多的训练迭代次数。本研究提出了自回归掩码扩散模型,该架构旨在通过融合自回归模型的训练效率与基于扩散模型的并行生成能力来弥合这一差距。我们的核心洞见在于将掩码扩散过程重新构建为分块因果模型。这一视角使我们能够设计一个严格因果且置换等变的架构,该架构可在单次并行前向传播中计算多个去噪步骤的所有条件概率。所得架构支持高效的自回归式解码和渐进置换训练方案,使模型能够同时学习规范从左到右和随机的词元排序。利用这种灵活性,我们提出了一种新颖的跨步并行生成策略,通过在多条并行流中生成词元来加速推理,同时保持全局连贯性。实证结果表明,ARMD在标准语言建模基准测试中取得了最先进的性能,在显著减少训练步数的同时超越了现有扩散基线。此外,该模型为并行文本生成设立了新的基准,有效弥合了并行解码与顺序解码之间的性能差距。