In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.
翻译:在本研究中,我们提出了一种即插即用的自适应KV缓存压缩方法,用于减少大型语言模型(LLMs)生成式推理的内存占用。不同于传统KV缓存保留所有上下文令牌的键值向量,我们通过定向分析来识别注意力模块的内在结构。基于识别出的结构,我们以自适应方式构建KV缓存:在关注局部上下文的注意力头上驱逐长程上下文,在聚焦特殊令牌的注意力头上丢弃非特殊令牌,仅对广泛关注所有令牌的注意力头使用标准KV缓存。此外,借助用于指导自适应KV缓存构建的轻量级注意力分析,FastGen可在无需资源密集型微调或重新训练的情况下部署。在涵盖多种任务的实验中,FastGen在显著降低GPU内存消耗的同时,仅产生可忽略的质量损失。我们将公开代码及兼容的CUDA内核以确保可复现性。