The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix also introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9x memory compression and a 5.3x speedup in inference throughput.
翻译:大型语言模型推理过程中键值缓存的高内存需求严重限制了其在资源受限平台上的部署。量化可有效缓解KV缓存带来的内存压力。然而,现有方法要么依赖静态的通用精度分配方案,要么无法在长上下文任务中动态区分关键KV,导致内存-精度-吞吐量之间的权衡困境。本文提出一种名为KVmix的新型KV缓存混合精度量化方法。KVmix利用基于梯度的重要性分析来评估各键值投影矩阵对模型损失的影响,从而为混合精度量化实现层特定的比特宽度分配。该方法动态地为重要层分配更高精度,同时对影响较小的层进行激进量化,实现了精度与效率的可调节平衡。KVmix还引入了动态长上下文优化策略,自适应地为近期关键令牌保留全精度KV对并压缩历史令牌,在低内存占用的前提下实现高质量序列生成。此外,KVmix提供了高效的低比特量化方案与CUDA内核以优化计算开销。在Llama、Mistral等大型语言模型上,KVmix以极低的量化配置(键2.19比特/值2.38比特)实现了近乎无损的推理性能,同时达成4.9倍的内存压缩与5.3倍的推理吞吐量提升。