Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, creating a significant system challenge: achieving high throughput and low latency is difficult, especially when GPU memory is limited. To address these challenges, we propose TeleRAG, an efficient inference system that reduces latency and improves throughput with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that predicts required data and transfers them from CPU to GPU in parallel with LLM generation. In addition, TeleRAG adopts a prefetching scheduler and a cache-aware scheduler to support efficient multi-GPU inference with minimal overhead. Evaluations show TeleRAG achieves up to a 1.53x average end-to-end latency reduction (single-query) and 1.83x higher average throughput (batched), as well as good scalability in throughput. This confirms the practical utility of TeleRAG for faster and more memory-efficient deployments of RAG applications.
翻译:检索增强生成(RAG)通过引入外部数据源扩展大语言模型(LLM),以提升事实准确性与领域覆盖度。现代RAG流水线依赖大规模数据存储库,这带来了显著的系统性挑战:在GPU内存受限条件下,实现高吞吐量与低延迟尤为困难。为应对这些挑战,本文提出TeleRAG——一种以极低GPU内存需求实现延迟降低与吞吐量提升的高效推理系统。TeleRAG的核心创新在于前瞻检索(lookahead retrieval),这是一种预取机制,通过预测所需数据并在LLM生成过程中并行地将数据从CPU传输至GPU。此外,TeleRAG采用预取调度器与缓存感知调度器,以最小开销支持高效的多GPU推理。评估表明,在单查询场景下,TeleRAG的平均端到端延迟降低达到1.53倍;在批量场景下,平均吞吐量提升达1.83倍,且吞吐量具有良好可扩展性。这证实了TeleRAG在RAG应用中实现更快速、更节省内存部署的实用价值。