Efficient LLM inference scheduling is crucial for user experience.However, LLM inferences exhibit remarkable demand uncertainty (with unknown output length beforehand) and hybridity (being both compute and memory intensive). Existing LLM schedulers rely on simple heuristics or focus purely on compute resource, suffering suboptimal performance. In this work, we propose SageSched, an efficient LLM scheduler that properly handles demand uncertainty and hybridity of inference workloads.SageSched combines prompt contents with the past inference results to predict output-length distribution in a light-weight and also accurate manner.Meanwhile, it models the true service cost of an inference request with both compute and memory aspects considered.Finally, SageSched employs an uncertainty-aware scheduling policy that can yield the best overall efficiency given the request cost distributions.Testbed experiments over diverse setups confirm that SageSched can attain an efficiency improvement of over 28.7%.
翻译:高效的大语言模型推理调度对于用户体验至关重要。然而,大语言模型推理表现出显著的需求不确定性(输出长度事先未知)和混合性(同时具有计算密集和内存密集的特点)。现有的大语言模型调度器依赖于简单的启发式方法或仅专注于计算资源,导致性能欠佳。在本工作中,我们提出了SageSched,一种高效的大语言模型调度器,能够妥善处理推理工作负载的需求不确定性与混合性。SageSched将提示内容与过去的推理结果相结合,以轻量级且准确的方式预测输出长度分布。同时,它在建模推理请求的真实服务成本时,综合考虑了计算和内存两方面。最后,SageSched采用一种不确定性感知的调度策略,能够在给定请求成本分布的情况下实现最佳的整体效率。在不同设置下的测试平台实验证实,SageSched能够实现超过28.7%的效率提升。