DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

翻译：DistServe通过解耦预填充与解码计算来提升语言大模型（LLM）的服务性能。现有的LLM服务系统将这两个阶段共置处理，并在所有用户和请求间对预填充与解码计算进行批处理。我们发现，这种策略不仅会导致强烈的预填充-解码相互干扰，还会耦合两个阶段的资源分配与并行策略。LLM应用通常关注各阶段的独立延迟：预填充阶段的首词元生成时间（TTFT）和解码阶段每个请求的单输出词元时间（TPOT）。在严格的延迟要求下，现有系统不得不优先保障某一延迟指标，或过度配置计算资源以同时满足两者。DistServe将预填充与解码计算分配至不同的GPU，从而消除了预填充-解码干扰。根据应用的TTFT与TPOT要求，DistServe协同优化针对各阶段定制的资源分配与并行策略。同时，DistServe依据服务集群的带宽特性部署两个阶段，以最小化解耦带来的通信开销。因此，DistServe在每GPU同时满足TTFT与TPOT约束的最大服务请求率方面，显著提升了LLM服务性能。评估结果表明，在不同主流LLM、应用场景及延迟要求下，与现有最优系统相比，DistServe能在>90%的请求满足延迟约束的同时，实现7.4倍请求处理量提升或12.6倍服务等级协议（SLO）严格度提升。