ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management

Efficiently serving Large Language Models (LLMs) with persistent Prefix Key-Value (KV) Cache is critical for applications like conversational search and multi-turn dialogue. Serving a request requires loading the pre-computed prefix KV cache and generating the first token, defined as the Re-Prefill Phase. Offloading this shared prefix cache to secondary storage is essential for memory scalability. Re-Prefill with offloading suffers from severe I/O bottlenecks in two aspects. First, semantic-aware KV cache pruning algorithms select important tokens in fine granularity, while systems manage I/O in coarse, fixed-size blocks, causing severe read amplification. Second, the sequential dependency between identifying important tokens and loading KV cache creates idle I/O and compute bubbles, under-utilizing system resources. This paper proposes \textit{ContiguousKV}, a high-performance prefix KV cache offloading system that bridges algorithmic semantics with I/O efficiency to accelerate the Re-Prefill phase. We first introduce \textit{ContiguousChunk}, a unified data management granularity that aligns KV cache pruning with I/O operations. All the mechanisms critical for I/O performance are performed at the granularity of ContiguousChunk, thereby eliminating read amplification. By exploiting the high similarity in important ContiguousChunk indices across layers, we propose intra- and inter-period asynchronous prefetching to break the sequential dependency between I/O and compute, effectively eliminating idle bubbles. Finally, we propose attention-guided cache management to retain semantically critical prefix data in memory. Evaluations on Qwen2.5 series models show that ContiguousKV achieves a 3.85x speedup in the Re-Prefill phase over the state-of-the-art offloading system IMPRESS, while maintaining high output quality.

翻译：高效服务具有持久前缀键值（KV）缓存的大型语言模型（LLM）对于对话式搜索和多轮对话等应用至关重要。服务一个请求需要加载预计算的前缀KV缓存并生成首个令牌，这一过程被定义为重预填充阶段。将此共享前缀缓存卸载到二级存储对于内存可扩展性至关重要。采用卸载机制的重预填充阶段在两方面面临严重的I/O瓶颈。首先，语义感知的KV缓存剪枝算法以细粒度选择重要令牌，而系统以粗粒度、固定大小的块管理I/O，导致严重的读放大。其次，识别重要令牌与加载KV缓存之间的顺序依赖关系会产生空闲的I/O和计算气泡，导致系统资源利用率不足。本文提出 \textit{ContiguousKV}，一种高性能的前缀KV缓存卸载系统，它桥接了算法语义与I/O效率，以加速重预填充阶段。我们首先引入 \textit{ContiguousChunk}，这是一种统一的数据管理粒度，使KV缓存剪枝与I/O操作对齐。所有对I/O性能至关重要的机制均在ContiguousChunk的粒度上执行，从而消除了读放大。通过利用各层间重要ContiguousChunk索引的高度相似性，我们提出了周期内和周期间的异步预取机制，以打破I/O与计算之间的顺序依赖，有效消除空闲气泡。最后，我们提出注意力引导的缓存管理，将语义关键的前缀数据保留在内存中。在Qwen2.5系列模型上的评估表明，ContiguousKV在重预填充阶段相比最先进的卸载系统IMPRESS实现了3.85倍的加速，同时保持了高输出质量。