Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $\textbf{2.5}\times$ in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by $\textbf{7.69%}$ under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a $\log{n}$ time complexity. The code is available at https://github.com/JunqiZhao888/buzz-llm.
翻译:大语言模型(LLM)在自然语言处理中至关重要,但其推理速度和计算效率常面临挑战,限制了实时部署。键值(KV)缓存机制降低了Transformer模型的计算开销,但在保持上下文理解方面仍存在困难。本文提出BUZZ,一种新颖的KV缓存算法,该算法利用结构化上下文信息,在提升推理速度的同时最小化缓存内存占用。BUZZ采用蜂巢结构的稀疏缓存,结合滑动窗口捕获近期信息,并将历史词元动态分段为块,以优先处理局部邻域中的重要词元。我们在四个真实数据集上评估BUZZ:CNN/Daily Mail、XSUM、Wikitext和10-QA。实验结果表明,BUZZ(1)在LLM推理中将缓存内存使用量降低$\textbf{2.5}$倍,同时在长文本摘要任务中保持超过99%的准确率;(2)在相同内存限制下,于多文档问答任务中超越现有最优方法$\textbf{7.69%}$,而全缓存方法在此场景下会出现内存不足问题。此外,BUZZ凭借$\log{n}$的时间复杂度实现了显著的推理加速。代码公开于https://github.com/JunqiZhao888/buzz-llm。