Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.
翻译:扩散大语言模型(dLLMs)已成为纯自回归语言模型的一种有前景的替代方案,因为它们能并行解码多个token。然而,现有的逐块dLLM依赖于"重新掩码"机制,该机制仅解码最可信的token并丢弃剩余部分,实质上浪费了计算资源。我们证明,回收被丢弃token的计算过程是有益的,因为这些token保留了后续迭代解码所需的上下文信息。基于此,我们提出残差上下文扩散(RCD)模块,该模块将丢弃的token表示转化为上下文残差,并将其重新注入下一去噪步骤。RCD采用解耦的两阶段训练流程,以规避反向传播带来的内存瓶颈。我们在长链思维推理(SDAR)和短链指令跟随(LLaDA)两种模型上验证了该方法。实验表明,标准dLLM仅需约3亿token即可高效转换为RCD范式。在几乎不增加额外计算开销的前提下,RCD在广泛基准测试中将前沿dLLM的准确率持续提升4-11个百分点。尤为显著的是,在最具挑战性的AIME任务中,RCD使基线准确率近乎翻倍,并在达到基线峰值准确率时,去噪步骤减少至原方法的1/4至1/5倍。