We introduce Block-Attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context. Instead, Block-Attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-Attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models (68.4\% vs 67.9\% on Llama3) or even superior performance (62.8\% vs 59.6\% on Mistral). Notably, Block-Attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the self-attention models, the time consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%, respectively.
翻译:本文提出块注意力机制,旨在解决检索增强生成场景中推理延迟与计算成本上升的问题。传统方法通常对整个上下文进行编码,而块注意力将检索到的文档划分为离散块,除最后一个块外,每个块独立计算键值状态。在RAG场景中,通过将每个文本段落定义为一个块,块注意力使我们能够重用已见过段落的KV状态,从而显著降低推理过程中的延迟与计算开销。块注意力的实现涉及块分割、位置重编码以及对大语言模型进行微调以适应块注意力机制。在四个RAG基准测试上的实验表明,经过块微调后,块注意力模型取得了与自注意力模型相当的性能(Llama3上68.4%对67.9%),甚至更优的性能(Mistral上62.8%对59.6%)。值得注意的是,块注意力将首词生成时间与浮点运算量降至极低水平:对于总长32K的输入序列,仅需45毫秒即可输出首个词元;与自注意力模型相比,时间消耗与相应浮点运算量分别降低了98.7%与99.8%。