Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.
翻译:大语言模型因其庞大的内存需求以及自回归文本生成过程中的计算开销,面临着显著的部署挑战。本文聚焦于大语言模型的量化技术——通过将模型参数和激活值转换为低位整数来降低内存消耗。我们深入分析了现有量化方法在平衡量化后大语言模型精度与效率方面的局限性。为突破这些限制,我们提出WKVQuant框架,这是一种专为大语言模型权重与键值(KV)缓存量化设计的训练后量化(PTQ)方法。具体而言,该方法引入"仅对过去查询"的量化策略以改进注意力计算机制;同时,针对KV缓存的分布特性设计了二维量化策略,并采用跨块重建正则化对参数进行优化。实验表明,WKVQuant在实现几乎与权重-激活量化相当的内存节约效果的同时,其性能亦接近仅权重量化方法。