LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately, restoring KV cache from SSDs suffers from poor I/O performance and incurs significant GPU stalls. This is primarily because the fragmented GPU memory layout results in a massive number of tiny random I/Os, rendering the low-parallelism CPU a severe bottleneck even with GPU Direct Storage (GDS), which still relies on CPU intervention to initiate each I/O and thus remains CPU-centric. This paper presents Tutti, an efficient SSD-backed KV caching solution that eliminates CPU intervention from the critical data and I/O control paths between HBM and SSDs. At the core of Tutti is a GPU-centric KV cache object store, in which the CPU is only responsible for asynchronously loading I/O kernels once per layer to the GPU. Tutti saturates NVMe SSD bandwidth and reduces GPU stalls to near zero through the following designs: (i) we provide a GPU-native object abstraction that enables bulk KV cache transfers and management; (ii) we re-architect the GPU storage stack by introducing GPU io_uring to support asynchronous GPU direct object I/O; and (iii) we propose slack-aware I/O scheduling to avoid GPU resource contention. We have implemented Tutti and integrated it to vLLM. Extensive evaluation shows that compared to the state-of-the-art GDS-enabled, SSD-backed LMCache, Tutti reduces TTFT by 78.3% under strict SLO constraints and improves the achievable request rate by 2x. The serving cost is reduced by 27%. Tutti achieves nearly the same inference performance as DRAM-backed LMCache, while providing almost infinite capacity.
翻译:LLM服务依赖前缀缓存来提升推理性能。随着上下文不断增长,键值(KV)缓存的占用空间远超GPU HBM和CPU DRAM容量,KV缓存被越来越多地卸载到NVMe SSD上。然而,从SSD恢复KV缓存会遭遇I/O性能低下和显著的GPU停顿问题。这主要是由于碎片化的GPU内存布局导致大量微小的随机I/O,使得低并行度的CPU成为严重瓶颈——即便采用GPU Direct Storage(GDS),该技术仍需CPU介入来发起每次I/O,本质仍是CPU主导的。本文提出Tutti,一种高效的SSD支持型KV缓存解决方案,它消除了HBM与SSD之间关键数据及I/O控制路径上的CPU干预。Tutti的核心是一个以GPU为中心的KV缓存对象存储,其中CPU仅负责为每层异步加载一次I/O内核至GPU。Tutti通过以下设计实现了NVMe SSD带宽饱和并使GPU停顿近乎为零:(i) 提供GPU原生的对象抽象,支持批量KV缓存传输与管理;(ii) 通过引入GPU io_uring重新架构GPU存储栈,以支持异步GPU直接对象I/O;(iii) 提出考虑松弛度的I/O调度机制以避免GPU资源争用。我们实现了Tutti并将其集成至vLLM。大量评估表明,与最先进的基于GDS且由SSD支持的LMCache相比,Tutti在严格SLO约束下将TTFT降低了78.3%,并将可达到的请求率提升了2倍。服务成本降低27%。Tutti实现了与DRAM支持的LMCache几乎相同的推理性能,同时提供了近乎无限的容量。