Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We apply MRP across the two operating regimes of DLM decoding. In the high-quality-low-throughput static denoising regime, MRP serves as a drafter for speculative decoding: its proposals are verified against the backbone, yielding lossless acceleration of up to 1.4x in SGLang. In the low-quality-high-throughput dynamic denoising regime, MRP instead drives a remasking scheme that revokes over-eager reveals, recovering most of the accuracy lost to aggressive low-threshold decoding and improving accuracy by up to 22.6 points on code generation task HumanEval and 17.7 points on reasoning task GSM8K.
翻译:扩散语言模型(Diffusion Language Models, DLMs)通过迭代去噪掩码令牌序列生成文本,与自回归模型相比,在并行性和生成质量之间实现了权衡。在当前实践中,每步解码的令牌数量由置信度阈值控制,且随着每步去噪令牌数增加,生成质量会单调下降。我们提出多令牌残差预测(Multi-token Residual Prediction, MRP),这是一种轻量级模块,能够在单次骨干网络前向传播中实现依赖感知的多令牌去噪。MRP利用了去噪过程的一个关键特性:相邻去噪步的logit分布高度相似。MRP并非通过第二次运行骨干网络来获取下一步logit,而是从骨干网络隐藏状态中预测步间残差,从而以极小代价在每次骨干网络前向传播中有效去噪更多令牌。我们将MRP应用于DLM解码的两种运行模式。在高质量-低吞吐量的静态去噪模式下,MRP作为投机解码的起草模块:其提案经骨干网络验证后,在SGLang中可实现最高1.4倍的损失感知加速。在低质量-高吞吐量的动态去噪模式下,MRP则驱动一种修正过度解码的重掩码方案,可恢复因激进低阈值解码丢失的大部分准确率,在代码生成任务HumanEval上提升准确率最高达22.6个百分点,在推理任务GSM8K上提升17.7个百分点。