In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25$\times$ under the same TTFT SLO constraints compared with SOTA work.
翻译:在大语言模型服务中,跨请求复用提示的KV缓存对于降低首次令牌时间和服务成本至关重要。缓存亲和性调度将具有相同提示前缀的请求共置以最大化KV缓存复用,但这常与将请求均匀分配到计算实例的负载均衡调度相冲突。现有调度器因在单一映射空间内运作而无法协调这一权衡,通常仅对部分请求应用缓存亲和性路由,对其余请求采用负载均衡路由,缺乏实现双重目标的统一方案。为克服此局限,本文提出DualMap,一种用于分布式大语言模型服务的双映射调度策略,可同时实现缓存亲和性与负载均衡。其核心思想是:基于请求提示,通过两个独立的哈希函数将每个请求映射至两个候选实例,再根据当前系统状态智能选择更优候选。该设计通过“双选择的力量”提高了共享前缀请求被共置的可能性,同时将不同前缀均匀分散至整个集群。为使DualMap在动态且倾斜的真实工作负载下保持鲁棒性,我们整合了三种技术:1)SLO感知请求路由,优先保障缓存亲和性,但当首次令牌时间超过服务等级目标时切换至负载感知调度,在不牺牲缓存复用的前提下增强负载均衡;2)热点感知再平衡,动态将请求从过载实例迁移至欠载实例,以缓解热点并重新平衡系统;3)轻量级双哈希环扩缩容,利用双哈希环映射支持快速低开销的实例扩缩容,无需昂贵的全局重映射。真实工作负载实验表明,在相同首次令牌时间服务等级目标约束下,DualMap相较于现有最优工作可将有效请求容量提升高达2.25$\times$。