Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal throughput for inference serving. However, due to the unpredictability of LLM execution, the first-come-first-serve (FCFS) scheduling policy employed by current LLM serving systems suffers from head-of-line (HoL) blocking issues and long job response times. In this paper, we propose a new efficient LLM inference serving framework, named ALISE. The key design paradigm of ALISE is to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads. Furthermore, to mitigate the memory overhead of the intermediate key-value (KV) cache, we employ a priority-based adaptive memory management protocol and quantization-based compression techniques. Evaluations demonstrate that in comparison to the state-of-the-art solution vLLM, ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
翻译:大语言模型(LLMs)代表了当前通用人工智能(AGI)领域的一项革命性进展。以ChatGPT为例,基于LLM的应用程序需要极低的响应延迟和极高的推理服务吞吐量。然而,由于LLM执行过程的不确定性,当前LLM服务系统所采用的先到先服务(FCFS)调度策略存在队头(HoL)阻塞问题和较长的作业响应时间。本文提出了一种新型高效LLM推理服务框架,命名为ALISE。ALISE的核心设计范式在于利用一种新颖的推测式调度器,通过预估每个作业的执行时间,并利用此类先验知识分配合适的作业优先级顺序,从而最小化异构工作负载下潜在的排队延迟。此外,为了缓解中间键值(KV)缓存的内存开销,我们采用了基于优先级的自适应内存管理协议和基于量化的压缩技术。评估结果表明,与最先进的解决方案vLLM相比,在Alpaca和ShareGPT数据集上,在相同延迟约束下,ALISE分别将推理服务的吞吐量提升了最高1.8倍和2.1倍。