Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
翻译:扩散语言模型(DLMs)有望实现高度并行的文本生成,但其实际推理速度常受限于次优的解码调度器。标准方法依赖于"分散接受"——在整个序列的不连续位置上提交高置信度标记。这种方法无意中分裂了键值(KV)缓存,破坏了内存局部性,并迫使模型在不稳定的标记边界上进行代价高昂的重复修复。为解决此问题,我们提出了最长稳定前缀(LSP)调度器,这是一种基于整体前缀吸收的免训练且与模型无关的推理范式。在每个去噪步骤中,LSP通过单次前向传播评估标记稳定性,动态识别一个连续左对齐的稳定预测块,并在原子化提交前将其边界对齐至自然语言或结构分隔符。这种前缀优先的拓扑结构带来双重优势:系统层面,它将碎片化的KV缓存更新转换为高效的连续追加;算法层面,它在几何收缩的活动后缀上保持双向前瞻,大幅降低标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的大量评估表明,在数学推理、代码生成、多语言(CJK)任务和创意写作等严格基准测试中,LSP将推理速度提升最高达3.4倍,同时保持或略微提升输出质量。通过根本性重构提交拓扑,LSP弥合了DLMs理论并行性与实际硬件效率之间的鸿沟。