As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.
翻译:随着大语言模型(LLMs)日益普及,亟需开发新型改进的量化方法,在保持精度的同时满足这些现代架构的计算需求。本文提出TEQ——一种可训练的等价变换方法,能在利用低精度量化(特别是3位和4位仅权重量化)优势的同时,保持模型输出的FP32精度。该训练过程轻量化,仅需1000步训练且可训练参数不足原始模型的0.1%。此外,该变换在推理阶段不增加任何计算开销。我们的结果在典型大语言模型上与最先进(SOTA)方法持平。本方法可与其他技术结合使用,以获得更优性能。代码开源地址:https://github.com/intel/neural-compressor。