Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2$\times$ (average: 1.5$\times$) compared to state-of-the-art serving systems.
翻译:大语言模型(LLM)服务可通过在多个设备上并行化模型与输入请求而获得巨大收益,但实际工作负载表现出显著的时空异质性。空间上,工作负载包含具有不同计算与内存需求的异构请求。时间上,工作负载的构成随时间动态变化。然而,现有系统通常假设工作负载在空间上均匀且时间上稳定,采用同质化、静态的模型部署方式。这种假设与实际时空异质性之间的不匹配导致了次优的系统性能。本文提出OServe,一种具备异构且灵活模型部署能力的LLM服务系统,旨在同时应对空间与时间异质性。首先,OServe引入一种新颖的工作负载感知调度算法,可根据实时工作负载特征优化异构模型部署。其次,OServe提出一种高效的工作负载自适应切换方法,能够根据预测的工作负载变化迁移模型部署。基于真实场景负载轨迹的实验表明,相较于当前最先进的服务系统,OServe最高可提升2倍(平均1.5倍)的性能表现。