Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To accelerate inference, we store computed keys and values (KV cache) in the GPU memory. Existing methods study the KV cache compression to reduce memory by pruning the pre-computed KV cache. However, they neglect the inter-layer dependency between layers and huge memory consumption in pre-computation. To explore these deficiencies, we find that the number of crucial keys and values that influence future generations decreases layer by layer and we can extract them by the consistency in attention weights. Based on the findings, we propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context. PyramidInfer saves significant memory by computing fewer keys and values without sacrificing performance. Experimental results show PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
翻译:大型语言模型(LLMs)展现了卓越的理解能力,但在推理过程中面临GPU内存使用挑战,阻碍了其用于聊天机器人等实时应用的可扩展性。为加速推理,我们在GPU内存中存储已计算的键值对(KV缓存)。现有方法通过修剪预计算的KV缓存来压缩内存,但忽略了层间依赖关系以及预计算过程中的巨大内存消耗。针对这些不足,我们发现影响未来生成的关键键值对数量逐层递减,并可通过注意力权重的一致性进行提取。基于此发现,我们提出PyramidInfer方法——通过逐层保留关键上下文实现KV缓存压缩。该方法在无需牺牲性能的前提下,通过计算更少的键值对显著节省内存。实验结果表明,与Accelerate相比,PyramidInfer在KV缓存中实现了超过54%的GPU内存缩减,吞吐量提升2.2倍。