In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
翻译:本文表明,通过增加量化维度,神经网络量化的规模与精度权衡可得到显著改善。我们提出GPTVQ方法,这是一种可高效扩展至大型语言模型(LLM)的新型后训练向量量化(VQ)快速方法。该方法利用每层输出重建均方误差的Hessian矩阵信息,交替进行一列或多列的量化与剩余未量化权重的更新。量化码本通过高效数据感知型EM算法初始化,随后采用整数量化与基于SVD的压缩进行更新与进一步压缩。GPTVQ在Llama-v2和Mistral等多种LLM上的规模-精度权衡中确立了新的最优水平。此外,该方法具有高效性:在单个H100上,处理Llama-v2-70B模型需时3至11小时(取决于量化设置)。最后,基于移动CPU上VQ解压缩的片上时序测试表明,与4位整数格式相比,VQ能有效降低延迟。