This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.
翻译:本文研究将给定稠密线性层转换为低精度表示的问题。我们从信息论角度分析了压缩长度与输出误差之间的权衡关系。研究表明,流行的GPTQ算法与信息论极限之间可能存在任意大的性能差距。为缓解此问题,本文提出了一种名为"WaterSIC"的新型算法,该算法在所有可能的输入激活协方差矩阵上均能保持与信息论极限仅0.255比特的速率差距。WaterSIC的核心创新在于对权重矩阵的不同列(输入特征)分配差异化的量化比特率,这一设计借鉴了信息论中经典的"注水"解决方案。将WaterSIC应用于Llama和Qwen系列大语言模型,在1至4比特的所有量化率下均取得了当前最优的性能表现。