Binary quantization (BQ) compresses high-dimensional embeddings into one or two bits per coordinate, enabling nearest neighbor search at extreme speed. Yet a striking puzzle persists: BQ achieves competitive recall on contrastive embeddings but fails on others -- and two leading systems adopt diametrically opposite strategies (random rotation vs. preserving coordinate axes) without a common theory explaining when each is appropriate. We resolve this puzzle by connecting the Gaussian structure recently established for InfoNCE-trained representations to a complete analytical framework for BQ quality. The key insight is that coordinate heterogeneity -- the non-uniformity of per-coordinate variances -- governs the key aspects of BQ performance. We derive closed-form expressions for ranking fidelity, prove that the magnitude bit carries information proportional to heterogeneity, and show that random rotation destroys precisely the signal that one paradigm exploits while creating the isotropy that the other requires. A two-parameter scaling law predicts fidelity across models and dimensions. Experiments on 13 datasets and 6 embedding families validate all predictions and provide the first principled design guide for binary quantization systems.
翻译:二值量化(BQ)通过将高维嵌入压缩为每坐标一或两个比特,实现了极速的最近邻搜索。然而,一个显著谜题始终存在:BQ在对比学习嵌入上取得具有竞争力的召回率,但在其他嵌入上表现欠佳——两种主流系统采用截然相反的策略(随机旋转与保留坐标轴),却缺乏统一理论解释各自的适用场景。通过将近期建立的InfoNCE训练表征的高斯结构与BQ质量的完整分析框架相关联,我们破解了这一谜题。核心洞察在于:坐标异质性——即各坐标方差的不均匀性——主导了BQ性能的关键方面。我们推导了排序保真度的闭式表达式,证明了大小比特携带的信息与异质性成正比,并揭示随机旋转恰好摧毁了一个范式所依赖的信号,同时创造了另一范式所需的各向同性条件。一个双参数标度律可跨模型与维度预测保真度。在13个数据集和6个嵌入族上的实验验证了所有预测,并为二值量化系统提供了首个原则性设计指南。