Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demand low job completion time (JCT) for model inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long JCT. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize JCT with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate states between GPU memory and host memory for LLM inference. We build a system prototype of FastServe based on NVIDIA FasterTransformer. Experimental results show that compared to the state-of-the-art solution Orca, FastServe improves the average and tail JCT by up to 5.1$\times$ and 6.4$\times$, respectively.
翻译:大型语言模型支撑了以ChatGPT为代表的新一代交互式AI应用。这类应用的交互特性要求模型推理具备低作业完成时间(JCT)。现有的大模型服务系统采用运行至完成的处理方式处理推理作业,会导致队首阻塞和长JCT问题。本文提出FastServe——面向大语言模型的分布式推理服务系统。FastServe利用大模型推理的自回归特性,实现了每个输出令牌粒度的抢占机制。通过创新的跳跃加入多级反馈队列调度器,FastServe采用抢占式调度策略最小化JCT。基于大模型推理的新型半信息不可知场景,该调度器利用输入长度信息为每个到达的作业分配恰当的初始队列,并跳过高于该队列的优先级队列以减少降级次数。我们设计了高效的GPU内存管理机制,主动在GPU内存与主机内存之间卸载和加载大模型推理的中间状态。基于NVIDIA FasterTransformer构建的FastServe系统原型实验结果表明,与当前最先进的Orca方案相比,FastServe将平均JCT和尾部JCT分别提升了高达5.1倍和6.4倍。