Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. Based on our empirical verification and theoretical analysis around this hypothesis, we propose Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that Scissorhands reduces the inference memory usage of the KV cache by up to 5X without compromising model quality. We further demonstrate that Scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20X compression.
翻译:大语言模型(LLMs)已引发新一轮令人振奋的人工智能应用浪潮。大规模托管这些模型需要大量内存资源。部署过程中一个关键的内存瓶颈源于上下文窗口。人们普遍认为模型权重消耗大量内存,然而,在生成过程中存储的键值嵌入(KV缓存)的大小很容易超过模型尺寸。巨大的KV缓存大小对推理批处理大小施加了限制,这对于高吞吐量推理工作负载至关重要。受注意力的一个有趣观察启发,我们提出了"重要性持久性"假设:只有在某一时刻具有显著影响的标记,才会对未来生成产生重要影响。基于对这一假设的实证验证和理论分析,我们提出了Scissorhands系统,该系统无需微调模型即可将KV缓存的内存使用量维持在固定预算内。本质上,Scissorhands通过以更高概率存储关键标记来管理KV缓存。我们验证了Scissorhands在不对模型质量造成影响的情况下,可将KV缓存的推理内存使用量降低多达5倍。我们进一步证明,Scissorhands可以与传统用于压缩模型权重的4位量化相结合,实现高达20倍的压缩。