Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.
翻译:向量量化这一源于香农信源编码理论的问题,旨在对高维欧几里得向量进行量化,同时最小化其几何结构的失真。我们提出TurboQuant以同时解决均方误差(MSE)和内积失真问题,克服了现有方法无法达到最优失真率的局限。我们提出的数据无关算法适用于在线应用,在所有比特宽度和维度上均实现了近最优(在一个小的常数因子内)的失真率。TurboQuant通过随机旋转输入向量,在坐标上诱导出集中的Beta分布,并利用高维空间中不同坐标的近独立性特性,从而对每个坐标简单地应用最优标量量化器来实现这一目标。认识到MSE最优量化器会引入内积估计偏差,我们提出了一种两阶段方法:先应用一个MSE量化器,再对残差进行1位量化JL(QJL)变换,从而得到一个无偏的内积量化器。我们还形式化地证明了任何向量量化器所能达到的最佳失真率的信息论下界,表明TurboQuant紧密逼近这些下界,仅相差一个小的常数(约2.7)因子。实验结果验证了我们的理论发现:对于KV缓存量化,我们在每通道3.5比特时实现了绝对的质量无损,在每通道2.5比特时仅有轻微的质量下降。此外,在最近邻搜索任务中,我们的方法在召回率上优于现有的乘积量化技术,同时将索引时间几乎降至零。