InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.

翻译：在解码过程中减少大型语言模型（LLMs）的硬件占用对于高效生成长序列至关重要。键值（KV）缓存是其中的关键瓶颈，其大小随序列长度线性增长，极易主导模型的内存占用。先前的研究提出了专注于压缩KV缓存同时保持其信息的量化方法。本文提出InnerQ，一种硬件感知的KV缓存量化方案，可在不损失精度的前提下降低解码延迟。InnerQ采用分组量化策略，沿缓存矩阵的内维度进行分组。与先前沿外维度分组的工作不同，InnerQ使反量化操作与向量-矩阵乘法计算对齐，并实现GPU计算单元间的缩放因子复用。这减少了内存访问并加速了反量化过程，相比先前工作最高可获得22%的加速，相比半精度向量-矩阵乘法最高可达88%的加速。为在激进压缩下保持模型保真度，InnerQ融合了以下技术：（i）混合量化，根据局部统计特性为每组选择对称或非对称量化；（ii）对最新token和注意力汇聚token采用高精度窗口处理，以缓解异常值泄露；（iii）键缓存的逐通道归一化，该操作在预填充阶段一次性计算并融入查询向量，避免运行时开销。我们在Llama模型上的评估实验表明，InnerQ在少样本GSM8K任务中保持与非量化KV缓存相当的性能，并超越先前的KV缓存量化方法。