The growth of long-context Large Language Models (LLMs) significantly increases memory and bandwidth pressure during autoregressive decoding due to the expanding Key-Value (KV) cache. While accuracy-preserving KV-cache quantization (e.g., 4-bit or 2-bit) reduces memory footprint, existing systems decode inefficiently by relying solely on CUDA cores, underutilizing Tensor Cores-the dominant compute resource on GPUs. We present BitDecoding, the first inference system to efficiently decode low-bit KV caches by cooperatively leveraging CUDA cores and Tensor Cores. BitDecoding smartly induces Tensor-Core-friendly layouts, introduces warp-level dequantization parallelism, and provides unified system support through query transformation, high-performance tensor- and channel-wise quantization, and a software-pipelined dequantization kernel enabling mixed-precision execution. Architecture-aware optimizations further leverage Hopper's warpgroup tensor instructions and Blackwell's NVFP4 (MXFP4) tensor formats. Evaluated on Blackwell, Hopper, and Ampere GPUs, BitDecoding achieves an average 7.5x decoding speedup over FP16 FlashDecoding-v2, up to 8.6x on Blackwell with NVFP4, and up to 4.3x over state-of-the-art approaches. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x. BitDecoding is open-sourced at https://github.com/OpenBitSys/BitDecoding.
翻译:长上下文大型语言模型(LLMs)的扩展因其不断增大的键值(KV)缓存而在自回归解码过程中显著增加了内存和带宽压力。虽然保持精度的KV缓存量化(例如4位或2位)减少了内存占用,但现有系统仅依赖CUDA核心进行解码,效率低下,未能充分利用GPU上的主要计算资源——张量核心。我们提出了BitDecoding,这是首个通过协同利用CUDA核心和张量核心来高效解码低比特KV缓存的推理系统。BitDecoding智能地诱导出适合张量核心的内存布局,引入线程束级去量化并行,并通过查询变换、高性能的张量级和通道级量化以及支持混合精度执行的软件流水线去量化内核提供统一的系统支持。针对架构的优化进一步利用了Hopper的线程束组张量指令和Blackwell的NVFP4(MXFP4)张量格式。在Blackwell、Hopper和Ampere GPU上的评估表明,BitDecoding相比FP16 FlashDecoding-v2平均实现了7.5倍的解码加速,在Blackwell上使用NVFP4时最高可达8.6倍,相比现有最优方法最高可达4.3倍。在LLaMA-3.1-8B模型上处理128K上下文时,BitDecoding将单批次解码延迟降低了3倍。BitDecoding已在https://github.com/OpenBitSys/BitDecoding开源。