Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill, which processes only the new input tokens while reusing cached KV states, incurs an order-of-magnitude smaller decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by $\sim$68\% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. PPD provides a flexible and efficient paradigm for multi-turn LLM serving.

翻译：预填充-解码（PD）分解已成为现代LLM推理引擎的标准架构，可有效缓解两类不同工作负载间的相互干扰。随着聊天机器人与智能体系统对多轮交互需求的日益增长，我们重新审视了该场景下的PD机制，发现其存在两个根本性低效问题：（1）每轮交互都需对当前新提示词及上一轮响应进行预填充；（2）预填充节点与解码节点间反复的KV缓存传输会饱和带宽，导致高延迟甚至服务降级。我们的核心洞察在于：并非所有预填充操作都具有同等干扰性——仅处理新输入令牌同时复用缓存KV状态的追加式预填充（append-prefill），其造成的解码减速幅度比完整预填充小一个数量级。这一发现启示我们可将追加式预填充路由至解码节点本地处理。然而通过全面分析，我们证明没有任何单一固定路由策略能同时满足所有服务水平目标（SLO）。基于此洞察，我们提出预填充-可预化解码（PPD）分解——一种动态路由系统，可自主决策何时利用缓存KV状态在解码节点本地处理第二轮及后续请求。PPD通过可配置权重适配不同SLO，并能无缝集成传统PD部署。大量评估表明，PPD将第二轮及后续请求的首令牌生成时间（TTFT）降低约68%，同时保持具有竞争力的每输出令牌时间（TPOT），有效缓解高负载下的KV传输拥塞。PPD为多轮LLM服务提供了灵活高效的范式。