Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests, ScaleLLM achieves a 4.3x speed up over vLLM and outperforms state-of-the-arts with 1.5x higher throughput.
翻译:大型语言模型(LLM)已广受欢迎,并广泛应用于商业应用中,其中模型服务效率对用户体验至关重要。当前大多数研究集中于优化单个子过程,例如本地推理和通信,然而,目前缺乏一个全面的框架,能够以端到端的方式为优化LLM服务提供整体系统视角。在本工作中,我们进行了详细分析,以识别影响LLM服务系统中端到端延迟的主要瓶颈。我们的分析表明,一个全面的LLM服务端点必须解决一系列超越LLM推理本身的效率瓶颈。基于此,我们提出了ScaleLLM,一个为资源高效LLM服务而优化的系统。我们的大量实验表明,在64个并发请求下,ScaleLLM相比vLLM实现了4.3倍的加速,并以1.5倍更高的吞吐量优于现有最优方案。