Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM
翻译:扩散大语言模型(dLLMs)在非自回归解码范式中展现出强大的长上下文处理能力。然而,双向全注意力的巨大计算成本限制了其推理效率。尽管稀疏注意力机制前景广阔,但现有方法仍效果不佳。这源于需要为尚未解码的标记估计注意力重要性,而在扩散过程中未掩码标记的位置是未知的。本文提出Focus-dLLM,一种专为准确高效的长上下文dLLM推理设计的、无需训练的新型注意力稀疏化框架。基于标记置信度在相邻步骤间高度相关的发现,我们首先设计了一种基于历史置信度的指示器来预测未掩码区域。在此基础上,我们提出一种汇感知的剪枝策略,以准确估计并移除冗余的注意力计算,同时保留具有高度影响力的注意力汇。为进一步降低开销,该策略利用观察到的跨层一致性,在多个层间复用已识别的汇位置。实验结果表明,在$32K$上下文长度下,我们的方法能提供超过$29\times$的无损加速。代码已公开于:https://github.com/Longxmas/Focus-dLLM