Large language models (LLMs) have significantly advanced the field of natural language processing, while the expensive memory and computation consumption impede their practical deployment. Quantization emerges as one of the most effective methods for improving the computational efficiency of LLMs. However, existing ultra-low-bit quantization always causes severe accuracy drops. In this paper, we empirically relieve the micro and macro characteristics of ultra-low bit quantization and present a novel Dual-Binarization method for LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage of 2-bit-width and the efficiency advantage of binarization into account, introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized weights into two independent sets of binaries, FDB ensures the accuracy of representations and introduces flexibility, utilizing the efficient bitwise operations of binarization while retaining the inherent high sparsity of ultra-low bit quantization. For the macro-level, we find the distortion that exists in the prediction of LLM after quantization, which is specified as the deviations related to the ambiguity of samples. We propose the Deviation-Aware Distillation (DAD) method, enabling the model to focus differently on various samples. Comprehensive experiments show that our DB-LLM not only significantly surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization (eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional 20\% reduction in computational consumption compared to the SOTA method under the same bit-width. Our code will be released soon.
翻译:大语言模型(LLMs)显著推动了自然语言处理领域的发展,但其高昂的内存和计算消耗阻碍了实际部署。量化已成为提升LLMs计算效率最有效的方法之一。然而,现有的超低位宽量化总是导致严重的精度下降。本文通过实证分析缓解了超低位宽量化的微观与宏观特性,并提出一种新颖的LLMs双二值化方法——DB-LLM。微观层面,我们兼顾2位宽度的精度优势与二值化的效率优势,引入灵活双二值化(FDB)。通过将2比特量化权重拆分为两组独立的二值化参数,FDB在保证表示精度的同时引入灵活性,既能利用二值化的高效位运算,又能保持超低位宽量化固有的高稀疏性。宏观层面,我们发现量化后LLMs预测中存在与样本模糊性相关的偏差失真,并提出偏差感知蒸馏(DAD)方法,使模型能够差异化关注不同样本。全面实验表明,本文提出的DB-LLM不仅显著超越现有超低位宽量化的最先进方法(例如困惑度从9.64降至7.23),且在同位宽条件下,计算消耗较现有SOTA方法额外降低20%。相关代码即将开源。