Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge, leading to improved accuracy and relevance. However, scaling RAG pipelines remains computationally expensive as retrieval sizes grow. To address this, we introduce OSCAR, a novel query-dependent online soft compression method that reduces computational overhead while preserving performance. Unlike traditional hard compression methods, which shorten retrieved texts, or soft compression approaches, which map documents to continuous embeddings offline, OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates. Additionally, we extend OSCAR to simultaneously perform reranking, further optimizing the efficiency of the RAG pipeline. Our experiments demonstrate state-of-the-art performance with a 2-5x speed-up in inference and minimal to no loss in accuracy for LLMs ranging from 1B to 24B parameters. The models are available at: https://huggingface.co/collections/naver/oscar-67d446a8e3a2551f57464295.
翻译:检索增强生成(RAG)通过整合外部知识来增强大型语言模型(LLM),从而提升其准确性与相关性。然而,随着检索规模的扩大,扩展RAG流程的计算成本依然高昂。为解决此问题,我们提出了OSCAR,一种新颖的查询相关在线软压缩方法,可在保持性能的同时降低计算开销。与传统的硬压缩方法(直接缩短检索文本)或离线将文档映射为连续嵌入的软压缩方法不同,OSCAR在推理时动态压缩检索到的信息,消除了存储开销,并实现了更高的压缩率。此外,我们将OSCAR扩展为同时执行重排序,进一步优化了RAG流程的效率。实验结果表明,该方法在1B至24B参数的LLM上实现了最先进的性能,推理速度提升了2-5倍,且精度损失极小甚至为零。模型发布于:https://huggingface.co/collections/naver/oscar-67d446a8e3a2551f57464295。