Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
翻译:大规模语言模型(LLMs)驱动了以ChatGPT为代表的新一代交互式人工智能应用。这些应用的交互特性要求LLM推理具备低延迟。现有的LLM服务系统对推理任务采用运行至完成的处理方式,这会导致队头阻塞和长延迟问题。本文提出FastServe,一个面向LLMs的分布式推理服务系统。FastServe利用LLM推理的自回归特性,实现了以每个输出词元为粒度的抢占机制。FastServe采用抢占式调度,通过一种新颖的跳过连接多级反馈队列调度器来最小化延迟。基于LLM推理这种新的半信息未知场景,该调度器利用输入长度信息为每个到达任务分配一个合适的初始队列加入。高于所加入队列的优先级队列将被跳过,以减少降级次数。我们设计了一种高效的GPU内存管理机制,该机制为LLM推理主动在GPU内存与主机内存之间卸载和上传中间状态。我们构建了FastServe的系统原型,实验结果表明,与最先进的解决方案vLLM相比,在相同的平均延迟和尾部延迟要求下,FastServe分别将吞吐量提升了最高31.4倍和17.9倍。