We study entrywise scalar quantization of two matrices prior to multiplication. Given $A\in R^{m\times k}$ and $B\in R^{k\times n}$, we quantize entries of $A$ and $B$ independently using scalar quantizers with $K_X$ and $K_Y$ levels per entry, and form $\widehat C=\widehat A\,\widehat B$. The objective is to minimize the matrix multiplication mean-squared error (MSE) $E[\|{AB-\widehat A\widehat B}\|_F^2]$ under a pair-i.i.d.\ inner-product model. In the high-resolution regime $K_X,K_Y\to\infty$, we derive a sharp $K^{-2}$ asymptotic expansion for $\mathcal{E}$, identify the exact optimal leading constants, and characterize asymptotically optimal quantization center densities in terms of conditional second moments. We then specialize to correlated Gaussian multiplicative pairs, obtaining a closed-form optimal point density \[ λ^\star(u)\ \propto\ \exp\!\left(-\frac{u^2}{6}\right)\bigl((1-ρ^2)+ρ^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{σ_X}, \] with the same form for $y/σ_Y$, and prove a correlation-driven phase transition: the density is unimodal at the origin for $|ρ|\leq 1/\sqrt{3}$ and becomes bimodal for $|ρ|>1/\sqrt{3}$ with peaks at $u_{\mathrm{peak}}=\pm\sqrt{3-1/ρ^2}$. We show our method's applicability in synthetic experiments such as matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.
翻译:我们研究了矩阵相乘前对两个矩阵进行逐元素标量量化的方法。给定 $A\in R^{m\times k}$ 和 $B\in R^{k\times n}$,我们使用每元素具有 $K_X$ 和 $K_Y$ 个量化级别的标量量化器分别独立量化 $A$ 和 $B$ 的条目,并形成 $\widehat C=\widehat A\,\widehat B$。目标是在成对独立同分布内积模型下,最小化矩阵乘法均方误差 $E[\|{AB-\widehat A\widehat B}\|_F^2]$。在高分辨率极限 $K_X,K_Y\to\infty$ 下,我们推导出 $\mathcal{E}$ 的精确 $K^{-2}$ 渐近展开,确定了确切的最优主导常数,并利用条件二阶矩描述了渐近最优的量化中心密度。然后,我们将其特化为相关高斯乘法对,得到了闭式最优点密度 \[ λ^\star(u)\ \propto\ \exp\!\left(-\frac{u^2}{6}\right)\bigl((1-ρ^2)+ρ^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{σ_X}, \] 其中 $y/σ_Y$ 具有相同形式,并证明了相关性驱动的相变:当 $|ρ|\leq 1/\sqrt{3}$ 时密度在原点处为单峰,当 $|ρ|>1/\sqrt{3}$ 时变为双峰,峰值位于 $u_{\mathrm{peak}}=\pm\sqrt{3-1/ρ^2}$。我们通过矩阵乘法量化和最小二乘优化等合成实验,以及大语言模型键和查询激活的量化,展示了方法的适用性。