This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.
翻译:这是关于量化矩阵乘法(MatMul)研究的第二部分。在第一部分中,我们考虑了免校准量化的情况,而在此处我们讨论第二个因子的列协方差矩阵$Σ_X$已知的情形。这种情形出现在大语言模型(LLM)中仅权重的训练后量化这一普遍任务中。仅权重量化与加权均方误差(WMSE)信源编码问题相关,其经典(反向)注水解决方案指示了如何在向量的各个坐标之间分配量化率。我们展示了注水原理如何用于改进当前均匀分配量化率的实际LLM量化算法(GPTQ)。本文分析了一种近期提出的仅使用标量INT量化器的方案(称为“WaterSIC”),并证明其高率性能具有以下特性:(a) 与基的选择无关(即由$Σ_X$的行列式决定,因此与现有方案不同,不受随机旋转的影响);(b) 与信息论失真极限的差距在乘法因子$\frac{2πe}{12}$(或0.25比特/元素)以内。另一方面,GPTQ的性能受基选择的影响,但对于来自Llama-3-8B的随机旋转和实际$Σ_X$,我们发现其性能(取决于层类型)与WaterSIC的差距在0.1比特以内,这表明结合随机旋转的GPTQ也是接近最优的,至少在高速率范围内如此。