Semantic DLM+: Improving Diffusion Language Models through Bias-variance Trade-off in Transition Kernel Design

Diffusion Language Models (DLMs) have demonstrated strong scaling capacity as alternatives to autoregressive language models. However, their performance is highly sensitive to the choice of transition kernels, and poorly designed kernels can lead to issues like training instability, slow convergence, and biased sampling. In this paper, we study this sensitivity through a principled analysis of generalization error and identify three critical factors: asymptotic bias (difficulty in approximating the posterior distribution), exposure bias (error propagation during sampling), and optimization variance induced by kernel dispersion. We further compare different transition kernels: masking diffusion yields sparse and easier posterior-approximation targets, while uniform diffusion provides stronger sampling-side repair but induces harder approximation. Motivated by this trade-off, we revisit a previously overlooked variant, semantic DLM (SemDLM), where the transition kernel corrupts tokens to neighborhoods that are semantically similar. Our theory suggests that SemDLM can serve as a plausible middle ground by reducing the posterior approximation difficulty of uniform diffusion while retaining repair ability. However, we find that SemDLM suffers from a semantic basin problem, where sampling repeatedly stays within a semantic region and produces low-diversity text. To address this, we propose SemDLM+, which adds a global transition and a semantic-frequency penalty during sampling. Experiments on LM1B and OpenWebText show that SemDLM+ improves training dynamics and achieves competitive language modeling and generation quality with satisfactory diversity.

翻译：扩散语言模型（DLMs）作为自回归语言模型的替代方案，已展现出强大的扩展能力。然而，其性能对转移核的选择高度敏感，设计不当的核会导致训练不稳定、收敛缓慢以及采样偏差等问题。本文通过泛化误差的理论分析系统研究了这种敏感性，并识别出三个关键因素：渐近偏差（后验分布逼近困难）、暴露偏差（采样过程中误差传播）以及由核弥散引起的优化方差。我们进一步比较了不同转移核：掩蔽扩散生成稀疏且易于后验逼近的目标，而均匀扩散虽能提供更强的采样侧修复能力，但会带来更困难的逼近任务。受此权衡启发，我们重新审视了一个曾被忽视的变体——语义扩散语言模型（SemDLM），其转移核将词元腐蚀至语义相似邻域。理论表明，SemDLM通过降低均匀扩散的后验逼近难度同时保留修复能力，可成为合理的中间方案。但我们发现SemDLM存在语义盆地问题，即采样过程会反复滞留于语义区域，生成低多样性文本。为此，我们提出SemDLM+，在采样阶段引入全局转移机制与语义频率惩罚。在LM1B和OpenWebText上的实验表明，SemDLM+改善了训练动态，并在保持良好多样性的同时，实现了具有竞争力的语言建模与生成质量。