Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era

Approximate nearest neighbour (ANN) search underpins large-scale retrieval, increasingly within the retrieval-augmented generation pipelines that ground large language models, yet the methods that address it have multiplied across communities until they are seldom read as a single field. We argue they form one field with three design choices, and develop the projection-quantisation-organisation (PQO) lens, under which locality-sensitive hashing, learned binary hashing, deep end-to-end hashing, product quantisation, graph-based indexes, and the binary embeddings of modern vector databases are all settings of three coupled questions: where to place the projections, where to place the quantisation thresholds, and how to organise the resulting codes. The projection-then-quantisation reading is established; our contribution is the third, co-equal organisation stage, a demonstration that the three run unbroken from the field's origins to the deep, product-quantisation, graph, and retrieval-augmented eras, and a reproducible measurement that turns the lens from classifying methods to predicting them. The measurement yields three findings. First, memory is won on the quantisation axis: a one-bit code is a thirty-second the size of the float, and a single full-precision re-ranking pass over a short candidate list recovers uncompressed quality in full. Second, the trade-off orderings the lens anticipates recur unchanged as the embedding grows. Third, where supervision is available, an eight-byte code more than doubles the quality of the two-kilobyte float it replaces. We release these measurements as BitBudget, an extensible benchmark with a live leaderboard, recast generative retrieval's "semantic identifiers" as quantisation codes, and identify the open problems that follow as compact codes return to the centre of large-scale retrieval.

翻译：近似最近邻（ANN）搜索支撑着大规模检索，广泛应用于增强大型语言模型的检索增强生成流程中。尽管相关方法在不同领域层出不穷，但已鲜有人将其视为统一研究领域。我们认为这些方法构成了一个包含三个设计选择的统一领域，并提出了投影-量化-组织（PQO）分析框架。在此框架下，局部敏感哈希、学习型二进制哈希、深度端到端哈希、乘积量化、基于图的索引以及现代向量数据库的二进制嵌入，均归结为三个耦合问题的具体实现：投影位置选择、量化阈值设置、以及所生成编码的组织方式。投影后量化的解读已有共识；我们的贡献在于提出第三个同等重要的组织阶段，证明这三个阶段从该领域诞生之初到深度学习时代、乘积量化时代、图索引时代和检索增强时代始终连贯统一，并通过可复现的度量将框架从方法分类转向方法预测。该度量产生三项发现：第一，内存优化取决于量化轴——单位二进制码大小仅为浮点数的三十二分之一，而短候选列表上单次全精度重排序可完全恢复未压缩质量；第二，框架预期的权衡排序随嵌入维度增长保持不变；第三，在有监督条件下，八字节编码的质量较其所替代的两千字节浮点编码提升超过一倍。我们将这些度量以BitBudget形式发布——这是一个配备实时排行榜的可扩展基准，将生成式检索的"语义标识符"重新诠释为量化编码，并据此指出当紧凑编码重返大规模检索核心时有待解决的开放问题。