RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.
翻译:检索增强生成(RAG)使大语言模型(LLM)能够利用外部知识生成更优响应,但使用更多外部知识通常以增加响应延迟为代价来提升生成质量。现有研究要么致力于降低响应延迟(通过优化RAG查询调度),要么力求最大化生成质量(涉及调整RAG工作流程),但均未能有效优化RAG响应延迟与质量之间的权衡关系。本文提出RAGServe,这是首个能够联合调度查询并自适应调整每个查询关键RAG配置(如检索文本块数量与合成方法)的系统,以实现质量优化与响应延迟降低的平衡。通过在4个主流RAG-QA数据集上的实验表明,相较于最先进的RAG优化方案,RAGServe在不牺牲生成质量的前提下,将生成延迟降低了$1.64-2.54\times$。