Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
翻译:检索增强生成(RAG)通过融合大语言模型(LLMs)与外部知识数据库的优势,已在多种自然语言处理任务中展现出显著性能提升。然而,RAG 导致长序列生成问题,并带来高昂的计算与内存开销。本文提出 RAGCache——一种专为 RAG 设计的新型多级动态缓存系统。通过基准测试分析现有 RAG 系统,我们定位了性能瓶颈(即知识注入导致的长序列问题)与优化空间(即缓存知识的中间状态)。基于上述洞察,我们设计了 RAGCache:该系统将检索知识的中间状态组织成知识树,并分层缓存在 GPU 与主机内存中。RAGCache 提出了能感知 LLM 推理特性与 RAG 检索模式的替换策略,同时动态重叠检索与推理步骤以最小化端到端延迟。我们在当前最先进的 LLM 推理系统 vLLM 和向量数据库 Faiss 上实现了 RAGCache。实验结果表明,相较于集成 Faiss 的 vLLM,RAGCache 将首令牌延迟(TTFT)降低至 1/4,吞吐量提升至 2.1 倍。