Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose Thoth, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design Thoth, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. Thoth proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement Thoth and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that Thoth reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
翻译:检索增强生成(Retrieval-Augmented Generation, RAG)通过融合大语言模型(LLMs)与外部知识数据库的优势,在各类自然语言处理任务中取得了显著进展。然而,RAG引入了长序列生成问题,导致计算与内存成本高昂。本文提出Thoth——一种专为RAG设计的新型多层级动态缓存系统。通过基准测试分析现有RAG系统,我们精准定位了性能瓶颈(即知识注入导致的长序列问题)与优化空间(即缓存知识的中间状态)。基于上述洞察,我们设计Thoth:该系统将检索知识的中间状态组织成知识树,并利用GPU与主机内存层级结构进行缓存。Thoth提出的缓存替换策略同时兼顾了大模型推理特性与RAG检索模式,并通过动态重叠检索与推理步骤以最小化端到端延迟。我们在当前最先进的LLM推理系统vLLM与向量数据库Faiss上实现了Thoth并展开评估。实验结果表明,与集成Faiss的vLLM相比,Thoth可将首令牌生成时间(TTFT)降低至1/4,吞吐量提升至2.1倍。