Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2$\times$ (average: 1.5$\times$) compared to state-of-the-art serving systems.
翻译:服务大型语言模型(LLM)可通过跨多设备并行化模型与输入请求获得显著收益,但实际负载呈现明显的空间与时间异质性。在空间维度上,负载由计算和内存需求各异的多类型请求构成;在时间维度上,负载组成随时间动态变化。现有系统通常假设负载在空间上均匀、在时间上稳定,并采用同质化的静态模型部署方案。这种假设与现实世界中时空异质性的偏差导致性能欠佳。本文提出OServe——一种支持异构灵活模型部署的LLM服务系统,可同时应对空间与时间异质性。首先,OServe引入新颖的负载感知调度算法,根据实时负载特性优化异构模型部署组合。其次,OServe提出高效的负载自适应切换方法,基于预测的负载变化进行模型部署迁移。基于真实轨迹的实验表明,与当前最优服务系统相比,OServe的性能提升最高达2倍(平均1.5倍)。