Large language models (LLMs) are increasingly deployed in AI infrastructure, driving the need for high throughput, resource efficient serving systems. Disaggregated LLM serving, which separates prompt prefill from auto-regressive decode, has emerged as a promising architecture by isolating their heterogeneous compute and memory demands. However, current disaggregated systems face three key limitations: (i) static resource allocation cannot adapt to highly dynamic workloads, causing over-provisioning that wastes resources or under-provisioning that violates service level objectives (SLOs); (ii) inherent load imbalance between prefill and decode stages, where prefill is compute-bound and decode is memory-bound, causes under-utilization in one tier while the other becomes a bottleneck; and (iii) prefix cache aware routing skews load distribution, as high cache hit rate prefill nodes attract disproportionately more requests, further degrading balance and efficiency. To address these issues, we present BanaServe, a dynamic orchestration framework that continuously rebalances computational and memory resources across prefill and decode instances while eliminating hotspots induced by cache. BanaServe introduces layer level weight migration, attention level Key Value Cache (KV Cache) migration, and Global KV Cache Store sharing with layer wise overlapped transmission, enabling both coarse grained (layer level) and fine grained (attention level) load redistribution with minimal latency overhead. These mechanisms allow routers to perform purely load aware scheduling, unconstrained by cache placement. Compared to vLLM, BanaServe achieves 1.2x-3.9x higher throughput with 3.9%-78.4% lower total processing time, and outperforms DistServe by 1.1x-2.8x in throughput with 1.4%-70.1% latency reduction.
翻译:大语言模型(LLM)在AI基础设施中的部署日益广泛,推动了对高吞吐、资源高效服务系统的需求。解耦式LLM服务架构将提示词预填充阶段与自回归解码阶段分离,通过隔离二者异构的计算与内存需求,成为一种前景广阔的服务范式。然而,现有解耦式系统面临三个关键局限:(i)静态资源分配无法适应高度动态的工作负载,导致资源过度配置造成浪费或配置不足违反服务等级目标;(ii)预填充阶段(计算密集型)与解码阶段(内存密集型)固有的负载不均衡,造成某一层级资源利用率不足而另一层级成为瓶颈;(iii)基于前缀缓存感知的路由策略扭曲负载分布,高缓存命中率的预填充节点会吸引过多请求,进一步破坏系统均衡与效率。为解决这些问题,本文提出BanaServe——一种动态编排框架,能够持续在预填充与解码实例间重新平衡计算与内存资源,同时消除由缓存引发的热点。BanaServe引入了层级权重迁移、注意力级键值缓存迁移,以及结合层级重叠传输的全局KV缓存共享存储机制,支持以极低延迟开销实现粗粒度(层级)与细粒度(注意力级)的负载重分布。这些机制使路由器能够执行纯负载感知的调度,不受缓存放置策略约束。实验表明,相较于vLLM,BanaServe实现了1.2倍至3.9倍的吞吐量提升,总处理时间降低3.9%至78.4%;相比DistServe,吞吐量提升1.1倍至2.8倍,延迟降低1.4%至70.1%。