Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

from arxiv, Initial version accepted at Workshop on Structured Probabilistic Inference & Generative Modeling, ICML 2026. Project Page: https://ringo-star.github.io/projectpage_frechet/

Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.

翻译：扩散大语言模型虽然支持并行令牌生成，但其推理过程仍受限于如何安全地一次性确定可提交的掩码令牌集合。Fast-dLLM方法通过KV缓存和置信度引导的并行解码解决了这一问题，但其解码理论基于同质高置信度假设，实际上将每个候选集合简化为其中最弱的选定令牌。我们认为这种做法未能充分发挥加速潜力，因为实际解码步骤呈现异质置信度分布。为此提出**Fast-dLLM++**，一种无需训练扩展方法，引入*Fréchet轮廓解码*：从完整排序的置信度轮廓中选取并行提交集合，而非基于单一最差置信度。所得规则是Fast-dLLM因子选择器的异质置信度泛化形式，在等置信度情形下精确还原原规则，并在选定令牌置信度不均匀时引入可证明的*异质增益*。Fast-dLLM++完全保持模型、扩散过程和缓存实现不变，可作为现有Fast-dLLM解码的直接替代方案。基于LLaDA-8B模型在GSM8K、MATH、HumanEval和MBPP上的实验表明，理论改进直接转化为经验增益：通过利用最弱令牌规则无法捕捉的安全并行性，轮廓感知选择改进了精度-吞吐量帕累托前沿，在可比精度下实现最高37%的吞吐量提升。我们的代码已发布在https://github.com/Ringo-Star/FastdLLM_plusplus。