High-Rate Quantized Matrix Multiplication: Theory and Practice

This work investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider two settings: 1) Generic MatMul, where both matrices must be quantized (weight+activation quantization); and 2) weight-only quantization, where the second matrix is only known through covariance matrix $Σ_X$ of its columns. For each setting, we first review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and then analyze the performance of several popular quantization schemes, comparing them to these fundamental limits. Specifically, we discuss rate loss (compared to information theoretic optima) of absmax INT and floating-point (FP) quantization, for which we also derive remarkably accurate heuristic approximations. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. This new scheme (termed ``WaterSIC'') only uses scalar INT quantizers, but its high-rate performance is basis free (it depends only on the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations) and is within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit (!). GPTQ's performance is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find GPTQ to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal (for high-rate quantization).

翻译：本研究探讨量化矩阵乘法（MatMul）问题，该问题对于大型语言模型（LLMs）的高效部署至关重要。我们考虑两种场景：1）通用矩阵乘法，其中两个矩阵均需量化（权重+激活量化）；2）仅权重量化，其中第二个矩阵仅通过其列向量的协方差矩阵$Σ_X$已知。针对每种场景，我们首先回顾量化比特率与失真之间的基础信息论权衡（高比特率理论），随后分析多种主流量化方案的性能，并将其与这些理论极限进行比较。具体而言，我们讨论了绝对值最大整数（absmax INT）与浮点数（FP）量化相较于信息论最优解的比特率损失，并为此推导出极为精确的启发式近似公式。仅权重量化问题与加权均方误差（WMSE）信源编码相关，其经典（反向）注水解决定了如何在向量各维度间分配比特率。我们展示了如何利用注水法改进当前等比特率分配的实用LLM量化算法（GPTQ）。这种新方案（称为“WaterSIC”）仅使用标量INT量化器，但其高比特率性能具有基无关性（仅取决于$Σ_X$的行列式，因此与现有方案不同，对随机旋转变换具有不变性），且其失真与信息论失真极限的比值在乘性因子$\frac{2πe}{12}$（或0.25比特/元素）范围内（！）。GPTQ的性能受基选择影响，但对于Llama-3-8B的实际$Σ_X$施加随机旋转后，我们发现GPTQ与WaterSIC的差距在0.1比特以内（具体取决于网络层类型），这表明采用随机旋转的GPTQ同样接近最优解（针对高比特率量化场景）。