Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.
翻译:大型语言模型(LLMs)在各类自然语言处理任务中展现出卓越能力。然而,其巨大的内存需求——尤其是长文本理解与生成过程中KV缓存的增长——对资源受限环境下的部署构成了重大挑战。量化已成为在保留历史信息的同时降低内存消耗的有效方案。本文提出XQuant,一种无需训练、即插即用的框架,可实现超低等效位宽KV缓存量化。XQuant引入两项关键创新:计算开销可忽略的无数据校准方法与跨层KV缓存压缩技术,从而将量化位宽降至1.4比特以下。在TruthfulQA和LongBench上的大量实验表明,XQuant在保持更优性能的同时实现了更低量化位宽,其性能优于现有先进方法(如KIVI-2bit与AsymKV-1.5bit),在内存效率与模型精度之间建立了更优的平衡点。