One of the most compelling features of global discrete diffusion language models is their global bidirectional contextual capability. However, existing block-based diffusion studies tend to introduce autoregressive priors, which, while offering benefits, can cause models to lose this global coherence at the macro level. To regain global contextual understanding while preserving the advantages of the semi-autoregressive paradigm, we propose Diffusion in Diffusion, a 'draft-then-refine' framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models. Our approach first employs block diffusion to generate rapid drafts using small blocks, then refines these drafts through global bidirectional diffusion with a larger bidirectional receptive field. We utilize snapshot confidence remasking to identify the most critical tokens that require modification, and apply mix-scale training to expand the block diffusion model's global capabilities. Empirical results demonstrate that our approach sets a new benchmark for discrete diffusion models on the OpenWebText dataset. Using only 26% of the fine-tuning budget of baseline models, we reduce generative perplexity from 25.7 to 21.9, significantly narrowing the performance gap with autoregressive models.
翻译:全局离散扩散语言模型最具吸引力的特征之一是其全局双向上下文建模能力。然而,现有的基于块的扩散研究往往引入自回归先验,这虽然带来一定优势,却可能导致模型在宏观层面丧失这种全局连贯性。为了在保留半自回归范式优点的同时重新获得全局上下文理解能力,我们提出"扩散中的扩散"——一种"先草拟后精修"的框架,旨在克服块扩散模型固有的不可逆性和短视性问题。我们的方法首先利用块扩散通过小尺寸块快速生成草稿,随后通过具有更大双向感受野的全局双向扩散对这些草稿进行精修。我们采用快照置信度重掩码技术来识别最需要修改的关键词元,并应用混合尺度训练以扩展块扩散模型的全局能力。实验结果表明,我们的方法在OpenWebText数据集上为离散扩散模型设立了新的性能基准。仅使用基线模型26%的微调计算量,我们便将生成困惑度从25.7降至21.9,显著缩小了与自回归模型的性能差距。