Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.
翻译:检索增强生成(RAG)通过引入外部信息扩展输入,能够克服大语言模型(LLM)的知识局限性。然而,这导致模型的上下文输入显著增长,从而延长了解码时间,直接转化为用户等待答案的时间。为应对这一挑战,本文提出COCOM——一种高效的上下文压缩方法,该方法将长上下文压缩为少量上下文嵌入,从而大幅提升生成速度。我们的方法支持不同的压缩率,可在解码时间与答案质量之间进行权衡。与早期方法相比,COCOM能更有效地处理多上下文场景,显著减少长输入的解码时间。实验表明,相较于现有高效上下文压缩方法,本方法在实现更高性能的同时,可获得高达5.69$\times$的加速比。