Large Language Models (LLMs) have proven their exceptional capabilities in performing language-related tasks. However, their deployment poses significant challenges due to their considerable memory and storage requirements. In response to this issue, weight-only quantization, particularly 3 and 4-bit weight-only quantization, has emerged as one of the most viable solutions. As the number of bits decreases, the quantization grid broadens, thus emphasizing the importance of up and down rounding. While previous studies have demonstrated that fine-tuning up and down rounding with the addition of perturbations can enhance accuracy in some scenarios, our study is driven by the precise and limited boundary of these perturbations, where only the threshold for altering the rounding value is of significance. Consequently, we propose a concise and highly effective approach for optimizing the weight rounding task. Our method, named SignRound, involves lightweight block-wise tuning using signed gradient descent, enabling us to achieve outstanding results within 400 steps. SignRound competes impressively against recent methods without introducing additional inference overhead. The source code will be publicly available at \url{https://github.com/intel/neural-compressor} soon.
翻译:大语言模型(LLMs)在执行语言相关任务方面展现出非凡的能力。然而,由于其庞大的内存与存储需求,部署这些模型面临重大挑战。为解决此问题,仅权重量化(尤其是3位和4位低精度量化)已成为最可行的方案之一。随着比特数减少,量化网格变宽,这使得向上舍入与向下舍入的重要性凸显。此前研究表明,通过添加扰动来微调舍入方向可在某些场景下提升精度,但这些扰动的边界精确且有限,唯一关键因素是改变舍入值的阈值。基于此,我们提出了一种简洁高效的权重舍入优化方法——SignRound。该方法采用符号梯度下降进行轻量级块状微调,仅需400步即可实现卓越效果。SignRound在不引入额外推理开销的前提下,与近期方法相比展现出极具竞争力的性能。源代码即将在\url{https://github.com/intel/neural-compressor} 中公开。