The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache. KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs. However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization. In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache. Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step. BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores. Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency. Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2. It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x. On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios. The code is available at https://github.com/DD-DuDa/BitDecoding.
翻译:随着长上下文大语言模型(LLM)的日益普及,不断膨胀的键值(KV)缓存给自回归解码带来了显著的内存和计算挑战。KV缓存量化已成为一种有前景的解决方案,先前的研究表明,4比特甚至2比特量化可以在降低内存成本的同时保持模型精度。然而,尽管有这些优势,现有的低比特KV缓存初步实现由于量化和反量化开销以及未能利用Tensor Cores,难以达到预期的加速效果。在本工作中,我们提出了BitDecoding,这是一个针对GPU优化的框架,旨在解锁Tensor Cores以实现高效的低比特KV缓存解码。由于每个解码步骤中KV缓存生成的动态特性,高效利用Tensor Cores处理低比特KV缓存具有挑战性。BitDecoding通过一个以Tensor Cores为中心的比特融合方案来解决这些挑战,该方案确保数据布局兼容性,从而实现Tensor Cores的高利用率。此外,BitDecoding还集成了一个线程束高效并行解码内核和一个细粒度异步流水线,以最小化反量化开销并提高计算效率。实验表明,与FP16 FlashDecoding-v2相比,BitDecoding在RTX 4090上实现了高达7.5倍的加速,在A100上达到4.8倍,在H100上达到8.9倍。与最先进的低比特KV缓存实现(QServe)相比,其性能也高出最多4.3倍。在LLaMA-3.1-8B模型上处理128K序列长度时,BitDecoding将单批次解码延迟降低了3倍,证明了其在长上下文生成场景中的有效性。代码可在 https://github.com/DD-DuDa/BitDecoding 获取。