Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. Based on our empirical verification and theoretical analysis around this hypothesis, we propose Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that Scissorhands reduces the inference memory usage of the KV cache by up to 5X without compromising model quality. We further demonstrate that Scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20X compression.
翻译:大语言模型(LLMs)已引发新一轮令人兴奋的AI应用浪潮。大规模托管这些模型需要大量内存资源。部署过程中的一个关键内存瓶颈源于上下文窗口。尽管模型权重的内存消耗已广受关注,但生成过程中存储的键值嵌入(KV缓存)大小可能轻易超过模型规模。巨大的KV缓存规模对推理批处理大小(高吞吐量推理工作负载的关键因素)造成约束。受注意力分数中一个有趣现象的启发,我们提出重要性持久性假设:仅那些在某步骤产生显著影响的枢纽令牌,将在未来生成中持续发挥关键作用。基于对该假设的经验验证与理论分析,我们提出Scissorhands系统,该系统无需微调模型即可将KV缓存内存消耗维持在固定预算内。本质上,Scissorhands通过以更高概率存储枢纽令牌来管理KV缓存。我们验证了Scissorhands能在不牺牲模型质量的前提下,将KV缓存推理内存消耗降低高达5倍。进一步研究表明,Scissorhands可与传统用于压缩模型权重的4位量化技术结合,实现高达20倍的压缩比。