Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work utilizing retrieved content by simply prepending it to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after fine-tuning by Low-Rank Adaption. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. And we introduce Marking Token as two special prompt tokens for marking the boundary of the appending context during fine-tuning. Our experiments on testing generation quality show that FlashBack can remain decent generation quality in perplexity. And the inference speed of FlashBack is up to $4\times$ faster than the prepending counterpart on a 7B LLM (Llama 2) in the runtime test. Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost.
翻译:检索增强语言建模(RALM)通过将大语言模型(LLM)与外部语料库中的相关文档相结合,已被证明是一种使LLM能够生成超出其预训练语料库范围信息的有效方法。以往的研究通常通过将检索到的内容直接拼接在输入前段,这导致严重的运行时开销问题,使得LLM因无法高效利用键值缓存(KV cache)而降低推理效率。本文提出FlashBack——一种模块化RALM框架,通过采用追加式上下文模式提升RALM的推理效率,同时利用低秩适配(Low-Rank Adaptation)微调保持良好性能。FlashBack将检索到的文档追加至上下文末尾而非前段,从而高效利用KV缓存。我们引入标记令牌(Marking Token)作为两种特殊提示令牌,在微调过程中标注追加上下文的边界。生成质量测试实验表明,FlashBack在困惑度指标上能保持可观的生成质量。在运行时测试中,基于7B参数LLM(Llama 2)的FlashBack推理速度相比前拼接方案最高提升4倍。通过避免不必要的重复计算,该方法实现了显著的推理速度提升,这种高效率将大幅降低推理成本。