Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems.
翻译:当前,服务提供商通常在共享集群内部署多种类型的大语言模型服务。虽然服务共置提高了资源利用率,却给具有严格推理延迟服务等级目标要求的延迟敏感型服务带来了显著的干扰风险,同时由于可用内存有限,严重制约了尽力而为型服务的服务容量。为应对干扰,现有系统通常依赖预留余量来限制尽力而为型服务的资源使用。然而,这种方法的粗粒度特性既损害了延迟敏感型服务的服务等级目标合规性,又不必要地限制了尽力而为型服务的生成潜力。本文提出OmniServe,一种新颖的大语言模型服务系统,能高效利用CPU和GPU资源以减轻干扰并提升吞吐量。OmniServe的核心是注意力捎带机制,该机制能动态地将尽力而为型服务的注意力计算卸载至CPU。此机制还促进了CPU与GPU流之间的异步通信,防止GPU在聚合注意力结果时被阻塞。此外,OmniServe采用动态批处理控制策略以适应波动的请求到达,并通过分层批处理促进稠密模块计算。实验结果表明,与最先进系统相比,OmniServe将LS服务的服务等级目标达成率最高提升$1.48\times$,同时将BE服务吞吐量最高提升$9.85\times$。