Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average $5.1\times$ throughput improvement (up to $17.5\times$) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.
翻译:现代大语言模型(LLM)服务需要在两个不同的推理阶段——预填充和解码——以及复杂的多轮工作流中,同时实现高吞吐量和严格的服务水平目标(SLO)保证。然而,现有系统面临一个根本性的权衡:异地计算分区能够实现各阶段的SLO达成,而原地内存共享则通过KV缓存重用最大化吞吐量。此外,现有的原地计算分区由于采用阶段耦合设计,也存在利用率低和开销高的问题。我们提出了Drift,一种新的LLM服务框架,它通过PD-Multiplexing解决了这一矛盾,实现了原地且阶段解耦的计算分区。Drift利用底层GPU分区技术,在共享的GPU上对预填充和解码阶段进行空间自适应复用,同时保持原地内存共享。为了充分利用复用能力,Drift引入了自适应组调度机制、无争用建模方法以及SLO感知的调度策略。评估表明,在复杂的LLM工作负载下,Drift在持续满足SLO目标的同时,相比最先进的基线方法,平均实现了$5.1\times$的吞吐量提升(最高可达$17.5\times$)。