Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ (``weight-only quantization''). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector quantization codebook with PCA directions of $X$ (a process known as ``waterfilling allocation''). Dependence of the codebook on statistics of $X$, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$, in the sense of being at least as good as an $X$-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension in the case when $W$ is Gaussian. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
翻译:现代大语言模型的核心计算之一是快速计算矩阵乘积$W^\top X$。为了提升部署效率,一种常见方法是用低精度近似$\widehat W$替代真实的$W$(即“仅权重量化”)。信息论表明,降低$W$精度的最优算法取决于$X$的(二阶)统计特性,且需将向量量化码本与$X$的主成分分析方向仔细对齐(这一过程称为“注水分配”)。然而,码本对$X$统计特性的依赖在实际应用中极为不便。本文证明存在一个普适码本,其在所有可能的$X$统计特性下均能同时达到近似最优性能——具体而言,当$W$服从高斯分布时,该码本至少与基于$X$自适应注水分配的码本性能相当,且每维度码率仅降低0.11比特。这种普适码本是低精度存储格式的理想候选(该方向是现代研究热点),但遗憾的是,存在性证明是非构造性的。等价地,我们的结果表明:在$\mathbb{R}^n$中存在一个网,它相对于所有希尔伯特范数能同时实现对球面的近乎最优覆盖。