Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT's effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.
翻译:由于生成式大型语言模型(LLM)规模庞大,需要大量计算和存储资源。本文提出一种新的训练后量化方法GPTQT,通过将LLM权重表示为3位/2位来降低内存占用并提升处理速度。实践表明,最小化权重量化误差效果有限,且易导致过拟合。因此,GPTQT采用渐进式两步策略:首先使用线性量化将权重量化为相对高位宽,随后将获得的整型权重转换为低位二进制编码。本文提出重探索策略以优化初始缩放因子。在推理过程中,这些步骤被合并为纯二进制编码,从而实现高效计算。在不同模型和数据集上的测试验证了GPTQT的有效性。与强基准3位量化方法相比,GPTQT在opt-66B上进一步降低困惑度4.01,在opt-30b上提升速度1.24倍。Llama2上的实验结果表明,GPTQT是目前此类LLM中最优的二进制编码量化方法。