The key-value (KV) cache has become the dominant contributor to memory consumption in large language model (LLM) inference. Although offloading KVCache from GPU high-bandwidth memory (HBM) to CPU DRAM alleviates device memory pressure, DRAM remains capacity-limited and costly for large, persistent workloads. Solid-state drives (SSDs) provide a cost-effective alternative, but naive SSD-based paging is fundamentally bandwidth-bound due to limited PCIe throughput and per-device bandwidth constraints. In this paper, we observe that KVCache activations in real-world workloads exhibit strong and stable correlations. We term this phenomenon KVCache Co-Activation, where accessing a KV entry is often accompanied by a stable and recurring set of other KV entries. Leveraging this property, we present Swarm, an SSD-based KVCache offloading system that converts bandwidth-bound single-device access into parallel I/O across multiple SSDs. Specifically, Swarm clusters co-activated KV entries offline and distributes the resulting clusters across SSDs using graph-based placement with selective replication to maximize parallel I/O bandwidth. At runtime, Swarm performs load-balanced cluster retrieval and dynamically adapts clustering and caching decisions to sustain high bandwidth utilization under evolving access patterns. Evaluations show that Swarm reduces I/O time by 2.41x and improves effective bandwidth utilization by 2.72x.
翻译:摘要:键值(KV)缓存已成为大语言模型(LLM)推理中内存消耗的主导因素。尽管将KV缓存从GPU高带宽内存(HBM)卸载到CPU DRAM可缓解设备内存压力,但对于大规模持久化工作负载而言,DRAM仍存在容量限制且成本高昂。固态硬盘(SSD)提供了经济高效的替代方案,但由于PCIe吞吐量限制和单设备带宽约束,基于SSD的朴素分页策略本质上受限于带宽瓶颈。本文观察到,实际工作负载中的KV缓存激活呈现出强且稳定的相关性,我们将其称为KV缓存协同激活现象——即访问某个KV条目时,常伴随一组稳定重复的其他KV条目被同时访问。基于这一特性,我们提出Swarm系统,这是一种基于SSD的KV缓存卸载方案,可将受带宽限制的单设备访问转化为跨多SSD的并行I/O。具体而言,Swarm离线聚类协同激活的KV条目,并通过基于图的分区策略配合选择性副本复制,将聚类结果分布到多个SSD上,以最大化并行I/O带宽。运行时,Swarm执行负载均衡的聚类检索,并动态调整聚类与缓存决策,以在演变的访问模式下维持高带宽利用率。评估表明,Swarm可将I/O时间减少2.41倍,有效带宽利用率提升2.72倍。