Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code will be released soon.
翻译:大语言模型(LLMs)的参数量化技术因其在降低内存成本与提升计算效率方面的潜力,近期受到越来越多的关注。早期方法已被广泛采用,然而现有方法在低位(例如2至3比特)场景下性能表现不佳。本文提出一种新颖且有效的列级自适应权重量化(CLAQ)框架,通过引入三种不同类型的自适应策略来实现LLM量化。首先,我们提出一种基于K-Means聚类的算法,能够为参数矩阵的每一列动态生成量化中心点。其次,我们设计了一种异常值引导的自适应精度搜索策略,可动态为不同列分配可变的比特宽度。最后,开发了一种动态异常值保留方案,通过保留部分参数为原始浮点精度,以权衡提升模型性能。在包括LLaMA-1、LLaMA-2和Yi在内的多种主流开源LLMs上的实验表明,我们的方法在不同比特设置下均取得了最先进的结果,尤其在极低位场景中表现突出。代码即将发布。