Retrieval-Augmented Generation (RAG) systems enhance the performance of large language models (LLMs) by incorporating supplementary retrieved documents, enabling more accurate and context-aware responses. However, integrating these external documents often results in very long input sequences, which significantly increases computation costs during the prefill stage, where key-value (KV) representations for all input tokens are generated. This latency bottleneck becomes especially pronounced under high-throughput serving scenarios. KV-cache reuse offers a promising solution by storing previously computed KV states for shared input prefixes, thereby avoiding redundant computation across requests that contain overlapping context. Yet, the effectiveness of cache reuse is often limited by three practical challenges: low cache hit rates due to naive eviction policies, high CPU-GPU data transfer overhead, and slow SSD I/O when caches spill to storage. To address these issues, we propose PCR, a system designed to maximize KV-cache reuse efficiency through intelligent prefetching and pipelined data movement. Specifically, PCR introduces three key techniques: (1) a prefix-tree caching structure with a look-ahead LRU replacement policy that uses pending requests in the scheduler queue to improve cache hit ratios; (2) layer-wise overlapping that pipelines KV-cache loading and GPU computation across CUDA streams to hide communication latency; and (3) queue-based prefetching that proactively loads relevant KV caches from SSD into DRAM before they are needed. Extensive experiments show that PCR outperforms existing KV-cache reuse methods, achieving up to a 2.47x speedup in terms of average TTFT.
翻译:检索增强生成(RAG)系统通过融入检索到的辅助文档来提升大语言模型(LLM)的性能,从而生成更准确且具有上下文感知能力的响应。然而,整合这些外部文档通常会导致极长的输入序列,这显著增加了预填充阶段的计算成本——该阶段需要为所有输入令牌生成键值(KV)表示。这种延迟瓶颈在高吞吐量服务场景中尤为突出。KV缓存复用提供了一种有前景的解决方案,通过存储共享输入前缀的先前计算出的KV状态,避免对包含重叠上下文的请求进行重复计算。然而,缓存复用的有效性常受到三个实际挑战的限制:由朴素驱逐策略导致的低缓存命中率、较高的CPU-GPU数据传输开销,以及缓存溢出到存储时缓慢的SSD I/O。为解决这些问题,我们提出了PCR系统,通过智能预取和流水线数据移动来最大化KV缓存复用的效率。具体而言,PCR引入了三项关键技术:(1)一种前缀树缓存结构,采用具有前瞻性LRU替换策略,利用调度队列中的待处理请求提高缓存命中率;(2)跨CUDA流的层级重叠技术,将KV缓存加载与GPU计算流水线化以隐藏通信延迟;(3)基于队列的预取机制,在需要之前主动将相关KV缓存从SSD加载至DRAM。大量实验表明,PCR优于现有的KV缓存复用方法,在平均TTFT上实现了高达2.47倍的加速。