Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
翻译:扩散语言模型(DLMs)常被宣传为能够实现并行令牌生成,然而实际中的快速DLMs往往收敛为从左到右、类似自回归(AR)的解码动态。相比之下,真正的非自回归生成具有广阔前景,因为它消除了AR的顺序瓶颈,能更好地利用并行硬件来减少同步/通信开销,并改善输出长度相关的延迟扩展。我们认为,导致类似AR解码的一个主要驱动因素是DLM目标与广泛使用的训练数据(包括标准预训练语料和长链思维(CoT)监督)的高度顺序结构之间的不匹配。基于这一诊断,我们提出了NAP(非自回归并行DLMs),这是一种以数据为中心的概念验证方法,能更好地将监督与非自回归并行解码对齐。NAP通过整理多个独立推理轨迹的示例,并结合一种并行强制解码策略来鼓励多令牌并行更新。在数学推理基准测试中,与基于标准长CoT数据训练的DLMs相比,NAP在并行解码下展现出更强的性能,且随着并行度的增加,性能提升更为显著。我们的结果表明,重新审视数据与监督是缓解类似AR行为、推动DLMs实现真正非自回归并行生成的一个原则性方向。我们的代码可在 https://github.com/pixeli99/NAP 获取。