Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promising solution to address these challenges. Previous research suggests that fine-tuning through up and down rounding can enhance performance. In this study, we introduce SignRound, a method that utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping within just 200 steps. SignRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), achieving exceptional results across 2 to 4 bits while maintaining low tuning costs and avoiding additional inference overhead. For example, SignRound achieves absolute average accuracy improvements ranging from 6.91\% to 33.22\% at 2 bits. It also demonstrates robust generalization to recent models and achieves near-lossless quantization in most scenarios at 4 bits. The source code is publicly available at \url{https://github.com/intel/auto-round}.
翻译:大语言模型(LLMs)在语言相关任务中展现出卓越能力,但其部署因巨大的内存与存储需求面临显著挑战。仅权重量化已成为应对这些挑战的有效方案。先前研究表明,通过向上与向下舍入进行微调可提升模型性能。本研究提出SignRound方法,该方法利用符号梯度下降(SignSGD)在仅200步内优化舍入值与权重裁剪。SignRound融合了量化感知训练(QAT)与训练后量化(PTQ)的优势,在2至4比特量化中取得优异结果,同时保持较低的调优成本且不引入额外推理开销。例如,在2比特量化中,SignRound实现了6.91\%至33.22\%的绝对平均精度提升。该方法对近期模型展现出强泛化能力,在4比特量化的大多数场景中达到接近无损的量化效果。源代码已公开于\url{https://github.com/intel/auto-round}。