Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.
翻译:模型量化采用低比特值表示待量化模型的权重矩阵,是降低备受期待的大语言模型部署时存储与计算开销的有效途径。然而,现有量化方法在比特宽度极低时会出现严重的性能退化,因此主要集中于使用4比特或8比特值进行模型量化。本文大胆地将大语言模型的权重矩阵量化为1比特,为极低比特宽度的大语言模型部署开辟了新路径。为此,我们提出了名为OneBit的1比特模型压缩框架,其中包括一种新颖的1比特参数表示方法以更好地量化大语言模型,以及一种基于矩阵分解的有效参数初始化方法以提升量化框架的收敛速度。充分的实验结果表明,OneBit在使用仅1比特权重矩阵的情况下,通过稳健的训练过程实现了优异性能(在LLaMA模型上至少达到未量化模型81%的性能水平)。