Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: https://aim-uofa.github.io/EvoTokenDLM.
翻译:扩散语言模型(DLMs)通过迭代精炼实现并行解码,为语言建模提供了一种前景广阔的替代方案。然而,大多数DLMs依赖于硬二值掩码和离散词元分配,这阻碍了对早期决策的修正,且未能充分利用中间概率表示。本文提出EvoToken-DLM,一种新颖的基于扩散的语言建模方法,该方法将硬二值掩码替换为演化的软词元分布。EvoToken-DLM实现了从掩码状态到离散输出的渐进式过渡,支持可修正的解码。为了有效支持这种演化,我们引入了连续轨迹监督,将训练目标与迭代概率更新对齐。在多个基准测试上的广泛实验表明,EvoToken-DLM始终取得卓越性能,优于强大的基于扩散和掩码的DLM基线。项目网页:https://aim-uofa.github.io/EvoTokenDLM。