This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ``WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as "waterfilling". Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits. Our code is available at https://github.com/egorlifar/watersic.
翻译:本文研究将给定稠密线性层转换为低精度的量化问题。从信息论(IT)角度分析了压缩长度与输出偏差之间的权衡关系。研究表明,流行的GPTQ算法与信息论极限之间可能存在任意大的差距。为缓解这一问题,我们提出了一种名为"WaterSIC"的新算法,并证明该算法在所有可能的输入激活协方差矩阵上,与信息论极限的速率差距一致地保持在0.255比特以内。WaterSIC的核心创新在于对权重矩阵的不同列(输入特征)分配不同的量化速率,模拟了经典信息论解中的"注水"策略。将WaterSIC应用于Llama和Qwen系列大语言模型,在1至4比特的所有量化级别上均取得了新的最佳性能。我们的代码开源于https://github.com/egorlifar/watersic。