Retrieval augmented models show promise in enhancing traditional language models by improving their contextual understanding, integrating private data, and reducing hallucination. However, the processing time required for retrieval augmented large language models poses a challenge when applying them to tasks that require real-time responses, such as composition assistance. To overcome this limitation, we propose the Hybrid Retrieval-Augmented Generation (HybridRAG) framework that leverages a hybrid setting that combines both client and cloud models. HybridRAG incorporates retrieval-augmented memory generated asynchronously by a Large Language Model (LLM) in the cloud. By integrating this retrieval augmented memory, the client model acquires the capability to generate highly effective responses, benefiting from the LLM's capabilities. Furthermore, through asynchronous memory integration, the client model is capable of delivering real-time responses to user requests without the need to wait for memory synchronization from the cloud. Our experiments on Wikitext and Pile subsets show that HybridRAG achieves lower latency than a cloud-based retrieval-augmented LLM, while outperforming client-only models in utility.
翻译:检索增强模型通过提升上下文理解能力、整合私有数据并减少幻觉,在增强传统语言模型方面展现出良好前景。然而,检索增强型大语言模型所需的处理时间对其在需要实时响应的任务(如写作辅助)中的应用构成挑战。为克服这一局限,我们提出混合检索增强生成(HybridRAG)框架,该框架采用结合客户端模型与云端模型的混合配置。HybridRAG整合了由云端大语言模型(LLM)异步生成的检索增强记忆。通过集成这一检索增强记忆,客户端模型获得了生成高效响应的能力,从而受益于LLM的性能优势。此外,借助异步记忆整合机制,客户端模型能够在不等待云端记忆同步的情况下,对用户请求做出实时响应。我们在Wikitext和Pile子集上的实验表明,HybridRAG在实现比云端检索增强LLM更低延迟的同时,在实用性方面优于仅使用客户端的模型。