Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests, ScaleLLM achieves a 4.3x speed up over vLLM and outperforms state-of-the-arts with 1.5x higher throughput.
翻译:大型语言模型(LLM)已迅速普及并广泛应用于商业应用中,其中模型服务效率对用户体验至关重要。当前大多数研究侧重于优化单个子流程,例如本地推理和通信,然而目前尚缺乏一个提供整体系统视角、以端到端方式优化LLM服务的综合性框架。在本工作中,我们通过详细分析识别了影响LLM服务系统端到端延迟的主要瓶颈。分析表明,一个完整的LLM服务端点必须解决一系列超越LLM推理本身的效率瓶颈。基于此,我们提出了ScaleLLM——一个面向资源高效LLM服务的优化系统。大量实验表明,在64个并发请求下,ScaleLLM相比vLLM实现了4.3倍的加速,并以1.5倍更高的吞吐量超越了现有最优方案。