To guarantee service quality in transformer based large language model (LLM) serving, it is essential to meet the latency constraints of both the prefill phase (measured by Time-to-First-Token, TTFT) and the decode phase (measured by Time-per-Output-Token, TPOT). Non-disaggregated serving places prefill and decode on the same worker, while disaggregated serving places the prefill and decode on isolated workers. However, no single architecture excels in both TTFT and TPOT metrics. After conducting a root cause analysis, we concluded that indisaggregated LLM serving, prefill execution has minimal interference with decode execution but result in high queuing times. In contrast,non-disaggregated LLM serving effectively reduces queuing times but introduces significant interference between prefills and decodes. In order to leverage the best aspects of both non-disaggregated anddisaggregated LLM serving, we have designed and implemented Tropical.Tropical introduces an sevice-level objectives (SLO)-aware multiplexing strategy that balances the queuing time and the interference, enabling the LLM serving to achieve high TTFT and TPOT SLOs simultaneously. Our evaluation of real-world datasets reveals that Tropical outperforms both state-of-the-art non-disaggregated and disaggregated LLM serving systems, achieving up to 2.09 more requests within a 90% SLO attainment. Specially, compared to the disaggregated LLM serving system, Tropicalimproves P90 TTFT performance by 9 with only an 15% reduction in P90 TPOT. Against the non-disaggregated LLM serving systems, Tropicaldelivers a 2.8 performance improvement in P90 TPOT while maintaining the same P90 TTFT.
翻译:为保障基于Transformer的大语言模型(LLM)服务的服务质量,必须同时满足预填充阶段(以首次令牌时间TTFT衡量)和解码阶段(以输出令牌时间TPOT衡量)的延迟约束。非分离式服务将预填充与解码部署于同一工作节点,而分离式服务则将二者部署于独立工作节点。然而,单一架构难以在TTFT和TPOT两项指标上同时表现优异。经过根因分析,我们得出结论:在分离式LLM服务中,预填充执行对解码执行的干扰极小,但会导致较高的排队时延;相反,非分离式服务虽能有效降低排队时延,却会引发预填充与解码之间的显著干扰。为融合两种服务模式的各自优势,我们设计并实现了Tropical系统。Tropical引入一种感知服务等级协议(SLO)的复用策略,通过平衡排队时延与执行干扰,使LLM服务能同时达成高TTFT和高TPOT的SLO指标。基于真实数据集的评估表明,Tropical性能超越当前最先进的非分离式与分离式LLM服务系统:在90% SLO达成率下,可多处理多达2.09倍的请求。特别地,与分离式LLM服务系统相比,Tropical使P90 TTFT性能提升9%,而P90 TPOT仅下降15%;与非分离式LLM服务系统相比,Tropical在保持相同P90 TTFT的前提下,将P90 TPOT性能提升2.8倍。