Large language model (LLM) serving is fundamentally limited by inefficient hardware utilization. Autoregressive (AR) decoding underutilizes GPUs due to its strictly sequential execution, while diffusion LLMs (DLLMs) improve throughput by decoding multiple tokens per iteration. However, fixed block-size diffusion decoding exhibits strong load sensitivity: large blocks exploit idle GPU resources under low load, but saturate early and incur substantial redundant computation under high load. As a result, throughput gains vanish beyond saturation, and no single decoding granularity performs well across dynamic serving workloads. We present Optimus, a serving system that enables elastic decoding for diffusion LLMs by dynamically adapting decoding granularity to runtime load. The key idea is to treat decoding granularity as a runtime control variable, balancing GPU utilization and token efficiency. Optimus combines chunked decoding, which enables fine-grained execution without retraining, with saturation-aware scheduling, a closed-loop mechanism that selects chunk sizes based on runtime conditions. Together with system-level optimizations and customized attention kernels, Optimus achieves significant performance improvements while preserving model accuracy. Experiments show that Optimus delivers up to 6.1x throughput improvement over AR decoding and 4.3x improvement over fixed-block diffusion LLM, while maintaining stable performance across diverse load regimes and improving end-to-end serving capacity under latency constraints. The source code is available at https://github.com/dubcyfor3/Optimus.
翻译:大语言模型(LLM)服务从根本上受限于低效的硬件利用率。自回归(AR)解码由于其严格顺序执行而未能充分利用GPU,而扩散LLM(DLLM)通过每轮迭代解码多个token提升了吞吐量。然而,固定块大小的扩散解码表现出强烈的负载敏感性:大块在低负载下可充分利用空闲GPU资源,但在高负载下会过早饱和并产生大量冗余计算。因此,吞吐量增益在饱和后消失,且不存在任何单一解码粒度能在动态服务负载下表现良好。我们提出了Optimus,一个通过动态调整解码粒度以适应运行时负载、实现DLLM弹性解码的服务系统。其核心思想是将解码粒度视为运行时控制变量,平衡GPU利用率和token效率。Optimus将支持无需重新训练即可实现细粒度执行的块状解码,与基于运行时条件选择块大小的闭环机制——饱和感知调度相结合。结合系统级优化和定制注意力核,Optimus在保持模型精度的同时实现了显著的性能提升。实验表明,与AR解码相比,Optimus可实现高达6.1倍的吞吐量提升;与固定块扩散LLM相比,可实现4.3倍的提升;同时在不同负载条件下保持稳定性能,并在延迟约束下提升了端到端服务容量。源代码可在https://github.com/dubcyfor3/Optimus获取。