Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work using utilizing retrieved content by simply prepending retrieved contents to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose \textsc{FlashBack}, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after specific fine-tuning without heavily destruct the knowledge integrity of the LLM. \textsc{FlashBack} appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. Our experiment shows that the inference speed of \textsc{FlashBack} is up to $4\times$ faster than the prepending method on a 7B LLM (Llama 2). Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost. Our code will be publicly available.
翻译:检索增强语言建模(RALM)通过将大语言模型(LLM)与外部语料库中的相关文档相结合,是一种已被验证的使LLM能够生成超出其预训练语料范围信息的方法。先前通过简单将检索内容置于输入前段来利用检索内容的工作存在严重的运行时问题,会降低LLM的推理效率,原因在于这些方法未能高效利用键值(KV)缓存。本文提出模块化RALM框架FlashBack,旨在通过追加上下文模式提升RALM的推理效率,同时通过特定微调保持良好性能,且不严重破坏LLM的知识完整性。FlashBack将检索文档追加到上下文末尾而非前置,以高效利用KV缓存。实验表明,在7B参数LLM(Llama 2)上,FlashBack的推理速度相比前置方法可达4倍提升。通过规避不必要的重复计算,该方法实现了推理速度的显著提升,这种高效率将大幅降低推理成本。我们的代码将公开发布。