We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax
翻译:我们提出DMax——一种用于高效扩散语言模型(dLLMs)的新范式。该方法能有效缓解并行解码中的误差累积问题,在保持生成质量的同时实现激进解码并行性。与通过二元掩码-令牌转换进行解码的传统掩码dLLMs不同,DMax将解码重构为从掩码嵌入到令牌嵌入的渐进式自我精炼过程。本方法的核心是"策略一致统一训练"——一种新型训练策略,能高效统一掩码dLLMs与均匀dLLMs,使模型既能从掩码输入恢复干净令牌,也能从自身的错误预测中恢复。基于此基础,我们进一步提出"软并行解码"机制。每个中间解码状态被表示为预测令牌嵌入与掩码嵌入之间的插值,从而在嵌入空间中实现迭代式自我修正。跨多个基准的广泛实验证明了DMax的有效性。与原始LLaDA-2.0-mini相比,我们的方法在GSM8K上使TPF从2.04提升至5.47,同时保持准确率;在MBPP上,TPF从2.71提升至5.86,性能相当。在双H200 GPU配置下,模型在批量大小为1时达到平均1338 TPS。代码开源地址:https://github.com/czg1225/DMax