Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems.
翻译:当前,服务提供商常在共享集群内部署多种类型的大语言模型服务。服务共置虽提升了资源利用率,却给具有严格推理延迟服务等级目标要求的延迟敏感服务带来了显著的干扰风险,同时因可用内存有限,严重制约了尽力而为服务的服务容量。为应对干扰,现有系统通常依赖预留余量来限制尽力而为服务的资源使用。然而,这种方法的粗粒度特性既损害了延迟敏感服务的服务等级目标合规性,又不必要地限制了尽力而为服务的生成潜力。本文提出OmniServe,一种新颖的大语言模型服务系统,它能高效利用CPU和GPU资源以缓解干扰并提升吞吐量。OmniServe的核心是注意力捎带机制,该机制能动态地将尽力而为服务的注意力计算卸载至CPU。此机制还促进了CPU与GPU流之间的异步通信,防止GPU在聚合注意力结果时被阻塞。此外,OmniServe采用了一种动态批处理控制策略,以适应波动的请求到达,并通过层间批处理促进稠密模块的计算。实验结果表明,与最先进的系统相比,OmniServe将延迟敏感服务的服务等级目标达成率最高提升了$1.48\times$,同时将尽力而为服务的服务吞吐量最高提升了$9.85\times$。