This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.
翻译:作为研究量化矩阵乘法(MatMul)的第二部分工作,在第一部分中我们考虑了免校准量化的情形,而本文则探讨可利用第二因子列协方差矩阵 \( \Sigma_X \) 的设定。该设定常见于大语言模型仅权重量化后训练这一普遍任务中。仅权重量化与加权均方误差(WMSE)信源编码问题密切相关,其经典(反向)注水解法规定了如何在向量坐标间分配速率。我们展示了如何利用注水原理改进现有均等分配速率的实用LLM量化算法(GPTQ)。针对仅使用标量INT量化器的近期方案(被称为“WaterSIC”),我们分析并证明了其高率性能:(a)具有基无关性(即由 \( \Sigma_X \) 行列式表征,因此与现有方案不同,对随机旋转具有鲁棒性);(b)其畸变与信息论极限的差距不超过乘法因子 \( \frac{2\pi e}{12} \)(即0.25比特/条目)。而GPTQ的性能受基选择影响,但结合随机旋转及来自Llama-3-8B的实际 \( \Sigma_X \) 时,我们发现其与WaterSIC的差距在0.1比特以内(依层类型而定),表明采用随机旋转的GPTQ至少在高速率区间内也接近最优性能。