Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks. However, their deployment presents significant challenges due to their substantial memory and storage requirements. To address this challenge, weight-only quantization has emerged as a promising solution. Previous research has indicated that fine-tuning through up and down rounding can enhance performance. In this study, we introduce SignRound, a method that utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping within just 200 steps, combining the strengths of both Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ). SignRound achieves outstanding results compared to recent methods across 2 to 4 bits, while maintaining low tuning costs and without introducing any additional inference overhead. For instance, SignRound led to absolute average accuracy improvements ranging from 6.91\% to 33.22\% at 2 bits. Furthermore, it demonstrates robust generalization to various recent models and achieves near-lossless quantization in most scenarios at 4 bits. The source code is publicly available at \url{https://github.com/intel/auto-round}.
翻译:大语言模型在语言相关任务中展现出卓越能力,但其部署因巨大的内存和存储需求面临重大挑战。为解决此问题,仅权重量化成为有前景的解决方案。先前研究表明,通过向上和向下舍入进行微调可提升性能。本研究提出SignRound方法,利用符号梯度下降在200步内优化舍入值与权值裁剪,融合了量化感知训练与训练后量化的优势。该方法在2至4比特量化场景下优于近期技术,同时保持低微调成本且不引入额外推理开销。例如,SignRound在2比特量化中实现了6.91%至33.22%的绝对平均准确率提升。此外,该方法对多种新型模型展现出强泛化能力,并在4比特量化的大多数场景中实现近乎无损量化。源代码已公开于\url{https://github.com/intel/auto-round}。