Prefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D hardware resources, subject to constraints on total throughput, service level objectives (SLOs), and request characteristics - specifically input and output lengths. To address this gap, we propose a hybrid approach that combines theoretical modeling with empirical benchmarking. First, we present a theoretical model for calculating P/D resource counts, which is based on total throughput requirements, request input and output lengths, as well as prefill and decode throughput. Then, to obtain the actual prefill and decode throughput under SLO constraints, we model the prefill process using M/M/1 queuing theory, deriving the achieved prefill throughput from the benchmarked maximum prefill throughput and Time-To-First-Token (TTFT). For the decode phase, we determine the decode batch sizes that meet Time-Per-Output-Token (TPOT) requirements and obtain the corresponding decode throughput through empirical measurements. Our experimental results demonstrate that the proposed method can accurately predict optimal P/D resource allocation in real-world LLM inference scenarios.
翻译:预填充-解码解耦已成为大语言模型推理中广泛采用的优化策略。然而,目前尚缺乏完善的方法论来确定最优的预填充-解码硬件资源数量,该问题需在总吞吐量、服务等级目标及请求特征(具体指输入输出长度)的约束下求解。为填补这一空白,我们提出了一种理论建模与实证基准测试相结合的混合方法。首先,我们建立了基于总吞吐量需求、请求输入输出长度以及预填充与解码吞吐量的预填充-解码资源数量计算理论模型。随后,为获取服务等级目标约束下的实际预填充与解码吞吐量,我们采用M/M/1排队理论对预填充过程建模,通过基准测试获得的最大预填充吞吐量与首令牌生成时间推导出实际预填充吞吐量。针对解码阶段,我们确定满足每输出令牌时间要求的解码批次大小,并通过实验测量获得相应的解码吞吐量。实验结果表明,所提方法能准确预测实际大语言模型推理场景中的最优预填充-解码资源分配方案。