Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS, an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy in large-batch settings, while preserving or improving generation quality across multiple benchmarks.
翻译:摘要:扩散大型语言模型(DLLMs)为自回归模型提供了一种颇具吸引力的替代方案,但其部署受限于高昂的解码成本。本研究揭示了DLLM解码中的一项关键低效问题:尽管计算对令牌块实现了并行化,但在每个扩散步骤中仅有少量令牌可被解码,导致大部分计算资源浪费在不可解码的令牌上。我们进一步观察到,基于注意力机制的令牌重要性与令牌级解码概率之间存在强相关性。基于此洞察,我们提出了FOCUS——一个专为DLLMs设计的推理系统。通过动态地将计算聚焦于可解码令牌并实时驱逐不可解码令牌,FOCUS提升了有效批次大小,从而缓解了计算限制并实现了可扩展的吞吐量。实验评估表明,在大批次场景下,FOCUS相较于生产级引擎LMDeploy实现了高达3.52倍的吞吐量提升,同时在多个基准测试中保持或提升了生成质量。