Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work using utilizing retrieved content by simply prepending retrieved contents to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose \textsc{FlashBack}, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after specific fine-tuning without heavily destruct the knowledge integrity of the LLM. \textsc{FlashBack} appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. Our experiment shows that the inference speed of \textsc{FlashBack} is up to $4\times$ faster than the prepending method on a 7B LLM (Llama 2). Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost. Our code will be publicly available.
翻译:检索增强语言建模(RALM)通过将大型语言模型(LLM)与外部语料库中的相关文档集成,已被证明是使LLM能够生成超出其预训练语料库范围信息的有效方法。先前的工作使用检索内容,仅简单地将检索内容前置到输入中,这会带来高运行时问题,导致LLM的推理效率下降,因为它们无法高效利用键值(KV)缓存。在本文中,我们提出\textsc{FlashBack},一种模块化的RALM,旨在通过追加上下文模式提高RALM的推理效率,同时在不严重破坏LLM知识完整性的前提下,经过特定微调后保持相当的性能。\textsc{FlashBack}将检索到的文档追加到上下文的末尾,而非前置,从而高效利用KV缓存。我们的实验表明,在7B LLM(Llama 2)上,\textsc{FlashBack}的推理速度比前置方法快达$4\times$。通过避免不必要的重复计算,它实现了显著更快的推理速度,展现出一项进步,且这种高效率将大幅降低推理成本。我们的代码将公开提供。