Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.
翻译:检索增强生成(RAG)通过结合外部数据源扩展大型语言模型(LLM),以提升事实准确性与领域覆盖能力。现代RAG流水线依赖大规模数据存储,导致在延迟敏感的应用部署中面临系统挑战,尤其在GPU内存受限的情况下。为应对这些挑战,我们提出TeleRAG——一种高效推理系统,它能在最小化GPU内存占用的同时显著降低RAG延迟。TeleRAG的核心创新是前瞻检索机制,该预取机制能够预测所需数据,并在LLM生成过程中并行地将数据从CPU传输至GPU。通过利用RAG流水线的模块化特性、倒排文件索引(IVF)搜索算法以及查询间的相似性,TeleRAG实现了数据迁移与计算过程的最优重叠。实验结果表明,与现有先进系统相比,TeleRAG平均可将端到端RAG推理延迟降低至1.72倍,从而为高级RAG应用提供更快速、更高内存效率的部署方案。