Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ ("weight-only quantization''). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector quantization codebook with PCA directions of $X$ (a process known as "waterfilling allocation''). Dependence of the codebook on statistics of $X$, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$, in the sense of being at least as good as an $X$-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
翻译:矩阵乘积 $W^\top X$ 的快速计算是现代大语言模型的核心运算。为实现高效部署,主流方法之一是使用低精度近似 $\widehat W$ 替代真实权重 $W$(即"仅权重量化")。信息论表明,降低 $W$ 精度的最优算法取决于 $X$ 的(二阶)统计特性,并需要将矢量量化码本与 $X$ 的主成分分析方向精确对齐(该过程称为"注水分配")。然而,码本对 $X$ 统计特性的依赖在实际应用中极不便利。本文证明存在一种普适码本,其对 $X$ 所有可能的统计特性均能同时保持近似最优性,其性能至少等同于将码率降低每维度0.11比特的 $X$ 自适应注水码本。此类普适码本将是低精度存储格式的理想候选方案(当前研究热点),但遗憾的是该存在性证明是非构造性的。等价而言,我们的结果表明存在 $\mathbb{R}^n$ 空间中的一个网,其能同时以近似最优方式覆盖球面,且适用于所有希尔伯特范数。