As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.
翻译:随着大语言模型(LLMs)规模的持续增长,在不牺牲准确率的前提下进行模型压缩已成为部署中的关键挑战。尽管某些量化方法(如GPTQ)在实现可接受的4位仅权重量化方面取得了进展,但尝试更低位的量化往往导致严重的性能下降。本文提出一种名为"norm tweaking"的技术,可作为插件应用于当前的后训练量化(PTQ)方法,在保持成本效益的同时实现高精度。该方法的灵感源于一个观察:修正量化后的激活分布使其与浮点对应分布匹配,可有效恢复LLMs的准确率。为此,我们精心设计了包含校准数据生成和通道级距离约束的调整策略,以更新归一化层参数实现更好的泛化能力。我们在多个数据集上使用若干开源LLMs进行了广泛实验。该方法在仅权重量化以及权重与激活联合量化中均展现出显著提升,全面超越现有PTQ方法。在GLM-130B和OPT-66B上,我们的方法甚至在2位量化时达到了与浮点模型同等的准确率。这种简洁高效的方法使其更适用于实际场景。