Large Language Models (LLMs), epitomized by ChatGPT's release in late 2022, have revolutionized various industries with their advanced language comprehension. However, their efficiency is challenged by the Transformer architecture's struggle with handling long texts. KV Cache has emerged as a pivotal solution to this issue, converting the time complexity of token generation from quadratic to linear, albeit with increased GPU memory overhead proportional to conversation length. With the development of the LLM community and academia, various KV Cache compression methods have been proposed. In this review, we dissect the various properties of KV Cache and elaborate on various methods currently used to optimize the KV Cache space usage of LLMs. These methods span the pre-training phase, deployment phase, and inference phase, and we summarize the commonalities and differences among these methods. Additionally, we list some metrics for evaluating the long-text capabilities of large language models, from both efficiency and capability perspectives. Our review thus sheds light on the evolving landscape of LLM optimization, offering insights into future advancements in this dynamic field. Links to the papers mentioned in this review can be found in our Github Repo https://github.com/zcli-charlie/Awesome-KV-Cache.
翻译:以2022年末发布的ChatGPT为代表的大型语言模型(LLMs)凭借其卓越的语言理解能力,已在多个行业引发革命性变革。然而,Transformer架构在处理长文本时面临的挑战对其效率构成了制约。KV Cache已成为解决该问题的关键方案,它将令牌生成的时间复杂度从二次降低至线性,但代价是GPU内存开销随对话长度线性增加。随着LLM学术界与工业界的发展,多种KV Cache压缩方法相继被提出。本综述深入剖析了KV Cache的多重特性,系统阐述了当前用于优化LLMs中KV Cache空间占用的各类方法。这些方法涵盖预训练阶段、部署阶段和推理阶段,我们总结了不同方法间的共性与差异。此外,我们从效率与能力两个维度,列举了评估大型语言模型长文本处理能力的若干指标。本综述由此揭示了LLM优化领域的发展图景,为这一动态领域的未来进展提供见解。综述中提及论文的链接可在我们的GitHub仓库https://github.com/zcli-charlie/Awesome-KV-Cache中查阅。