Adaptation to Intrinsic Dependence in Diffusion Language Models

Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) approaches, enabling parallel token generation beyond a rigid left-to-right order. Despite growing empirical success, the theoretical understanding of how unmasking schedules -- which specify the order and size of unmasked tokens during sampling -- affect generation quality remains limited. In this work, we introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution, without requiring any prior knowledge or hyperparameter tuning. In contrast to prior deterministic procedures that fix unmasking sizes, our method randomizes the number of tokens revealed at each iteration. We show that, for two specific parameter choices, the sampling convergence guarantees -- measured by Kullback-Leibler (KL) divergence -- scale as $\widetilde O(\mathsf{TC}/K)$ and $\widetilde O(\mathsf{DTC}/K)$ respectively. Here, $K$ is the number of iterations, and $\mathsf{TC}$ and $\mathsf{DTC}$ are the total correlation and dual total correlation of the target distribution, capturing the intrinsic dependence structure underlying the data. Importantly, our guarantees hold in the practically relevant parallel-sampling regime $K<L$ where $L$ is the token sequence length. These results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions. Overall, our findings unveil the adaptivity of DLMs to intrinsic data structures and shed light on the benefit of randomized unmasking sizes in inference schedule design.

翻译：扩散语言模型（DLMs）近年来已成为自回归（AR）方法的一种有前景的替代方案，它能够超越严格的从左到右顺序实现并行令牌生成。尽管经验上的成功日益增长，但关于解掩码调度（即采样过程中指定解掩码令牌的顺序和数量）如何影响生成质量的理论理解仍然有限。在本工作中，我们为DLMs引入了一种与分布无关的解掩码调度方法，该方法能够自适应于目标数据分布的（未知）依赖结构，而无需任何先验知识或超参数调整。与先前固定解掩码数量的确定性方法不同，我们的方法在每次迭代中随机化揭示的令牌数量。我们证明，对于两种特定的参数选择，采样收敛保证——以Kullback-Leibler（KL）散度衡量——分别按 $\widetilde O(\mathsf{TC}/K)$ 和 $\widetilde O(\mathsf{DTC}/K)$ 缩放。其中，$K$ 是迭代次数，$\mathsf{TC}$ 和 $\mathsf{DTC}$ 分别是目标分布的总相关和双总相关，它们捕捉了数据背后的内在依赖结构。重要的是，我们的保证在具有实际意义的并行采样机制 $K<L$ 下成立，其中 $L$ 是令牌序列长度。这些结果显著改进了先前的收敛理论，并为低复杂度分布带来了实质性的采样加速。总体而言，我们的发现揭示了DLMs对内在数据结构的自适应性，并阐明了在推理调度设计中随机化解掩码数量的益处。