Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.
翻译:大语言模型(LLMs)正推动着众多领域交互式AI应用的新浪潮。然而,由于生成模型的自回归特性导致执行时间不可预测,高效服务LLM推理请求面临挑战。现有LLM服务系统采用先来先服务(FCFS)调度策略,存在头部阻塞问题。针对LLM的非确定性特性并实现高效交互式服务,我们提出了一种投机性最短作业优先(SSJF)调度器,该调度器使用轻量级代理模型预测LLM输出序列长度。我们的开源SSJF实现无需修改内存管理或批处理策略。在真实数据集和生产工作负载轨迹上的评估显示:在不批处理、动态批处理和连续批处理三种模式下,与FCFS调度器相比,SSJF将平均作业完成时间降低30.5-39.6%,吞吐量提升2.2-3.6倍。