Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines in sample quality with few denoising steps.
翻译:离散扩散模型近期在复杂离散数据建模中展现出巨大潜力,其中掩码扩散模型(MDMs)在生成质量与速度之间实现了良好平衡。MDMs通过从全掩码输入逐步解掩多个维度实现去噪,但因其对维度间依赖关系的建模能力有限,当去噪步数较少时性能可能下降。本文提出变分自编码离散扩散(VADD)——一种通过引入潜在变量建模隐式捕捉维度相关性的新型离散扩散增强框架。通过引入辅助识别模型,VADD利用变分下界最大化实现稳定训练,并在训练集上支持摊销推理。该方法在保持传统MDM高效性的同时,显著提升了样本质量,尤其在去噪步数较少的场景下。在二维玩具数据、像素级图像生成和文本生成任务上的实证结果表明,VADD在少量去噪步数下始终优于MDM基线模型。