Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Yoda, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Yoda leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Yoda introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Yoda achieves an average $5.1\times$ throughput improvement (up to $17.5\times$) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.
翻译:现代大语言模型服务在预填充和解码两个不同的推理阶段以及复杂的多轮工作流中,既要求高吞吐量,又需要严格的服务水平目标保证。然而,现有系统面临一个根本性的权衡:异地计算分区可以实现各阶段的SLO达标,而原地内存共享则通过KV缓存重用最大化吞吐量。此外,现有的原地计算分区由于采用阶段耦合设计,还存在利用率低和开销高的问题。本文提出Yoda,一种新的大语言模型服务框架,它通过PD多路复用解决了这一矛盾,实现了原地且阶段解耦的计算分区。Yoda利用底层GPU分区技术,在共享的GPU上对预填充和解码阶段进行空间自适应多路复用,同时保持原地内存共享。为了充分利用多路复用能力,Yoda引入了自适应组调度机制、无冲突建模方法和SLO感知的调度策略。评估结果表明,在最先进基线方法之上,Yoda实现了平均$5.1\times$(最高$17.5\times$)的吞吐量提升,同时在复杂的大语言模型工作负载下始终满足SLO目标。