基于PD-Multiplexing优化面向SLO的大语言模型服务 (Optimizing SLO-oriented LLM Serving with PD-Multiplexing)

Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average $5.1\times$ throughput improvement (up to $17.5\times$) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.

翻译：现代大语言模型（LLM）服务需要在两个不同的推理阶段——预填充和解码——以及复杂的多轮工作流中，同时实现高吞吐量和严格的服务水平目标（SLO）保证。然而，现有系统面临一个根本性的权衡：异地计算分区能够实现各阶段的SLO达成，而原地内存共享则通过KV缓存重用最大化吞吐量。此外，现有的原地计算分区由于采用阶段耦合设计，也存在利用率低和开销高的问题。我们提出了Drift，一种新的LLM服务框架，它通过PD-Multiplexing解决了这一矛盾，实现了原地且阶段解耦的计算分区。Drift利用底层GPU分区技术，在共享的GPU上对预填充和解码阶段进行空间自适应复用，同时保持原地内存共享。为了充分利用复用能力，Drift引入了自适应组调度机制、无争用建模方法以及SLO感知的调度策略。评估表明，在复杂的LLM工作负载下，Drift在持续满足SLO目标的同时，相比最先进的基线方法，平均实现了$5.1\times$的吞吐量提升（最高可达$17.5\times$）。