Large Language Models (LLMs) have proven their exceptional capabilities in performing language-related tasks. However, their deployment poses significant challenges due to their considerable memory and storage requirements. In response to this issue, weight-only quantization, particularly 3 and 4-bit weight-only quantization, has emerged as one of the most viable solutions. As the number of bits decreases, the quantization grid broadens, thus emphasizing the importance of up and down rounding. While previous studies have demonstrated that fine-tuning up and down rounding with the addition of perturbations can enhance accuracy in some scenarios, our study is driven by the precise and limited boundary of these perturbations, where only the threshold for altering the rounding value is of significance. Consequently, we propose a concise and highly effective approach for optimizing the weight rounding task. Our method, named SignRound, involves lightweight block-wise tuning using signed gradient descent, enabling us to achieve outstanding results within 400 steps. SignRound outperforms the established baseline of rounding-to-nearest (RTN) and competes impressively against recent methods, without introducing additional inference overhead. The source code will be publicly available at https://github.com/intel/neural-compressor soon.
翻译:大语言模型(LLMs)在语言相关任务中展现出卓越能力,但其部署因巨大的内存和存储需求面临重大挑战。针对这一问题,仅权值量化(尤其是3比特和4比特权值量化)已成为最可行的解决方案之一。随着比特数减少,量化网格扩大,从而凸显向上舍入与向下舍入的重要性。尽管先前研究表明,通过添加扰动微调舍入方式可在某些场景下提升精度,但本研究的驱动力源于这些扰动的精确有限边界——仅改变舍入值的阈值具有关键意义。由此,我们提出了一种简洁高效的权值舍入优化方法。该方法名为SignRound,采用符号梯度下降进行轻量级分块调优,可在400步内获得卓越效果。SignRound不仅超越舍入至最近值(RTN)的既定基准,且与近期方法相比展现出强劲竞争力,同时不引入额外推理开销。源代码将发布于https://github.com/intel/neural-compressor。