Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting in escalating operational cost and inherent trade-offs between latency, throughput, and retrieval accuracy. This paper analyzes the architectural limitations of the dominant "HNSW + float32 + cosine similarity" stack and evaluates existing cost-reduction strategies, including storage disaggregation and lossy vector quantization, which inevitably sacrifice either performance or accuracy. We introduce and empirically evaluate an alternative information-theoretic architecture based on maximally informative binarization (MIB), efficient bitwise distance metrics, and an information-theoretic scoring (ITS) mechanism. Unlike conventional ANN systems, this approach enables exhaustive search over compact binary representations, allowing deterministic retrieval and eliminating accuracy degradation under high query concurrency. Using the MAIR benchmark across 14 datasets and 10,038 queries, we compare this architecture against Elasticsearch, Pinecone, PGVector, and Qdrant. Results demonstrate retrieval quality comparable to full-precision systems, while achieving substantially lower latency and maintaining constant throughput at high request rates. We show that this architectural shift enables a truly serverless, cost-per-query deployment model, challenging the necessity of large in-memory ANN indexes for high-quality semantic search.
翻译:现代语义搜索与检索增强生成系统主要依赖于基于高精度浮点向量的内存近似最近邻索引,这导致运营成本不断攀升,并在延迟、吞吐量与检索精度之间形成固有的权衡。本文分析了当前主流的"HNSW + float32 + 余弦相似度"技术栈的架构局限性,评估了现有成本降低策略(包括存储解耦和有损向量量化)——这些策略不可避免地需要牺牲性能或精度。我们提出并实证评估了一种基于信息论的替代架构,该架构融合了最大化信息二值化、高效按位距离度量以及信息论评分机制。与传统近似最近邻系统不同,该方法支持对紧凑二进制表示进行穷举搜索,实现确定性检索,并消除了高查询并发下的精度衰减问题。通过在14个数据集和10,038条查询上使用MAIR基准测试,我们将该架构与Elasticsearch、Pinecone、PGVector和Qdrant进行对比。结果表明,该架构在保持与全精度系统相当检索质量的同时,实现了显著降低的延迟,并在高请求率下维持恒定吞吐量。我们证明,这种架构转变能够实现真正的按查询付费无服务器部署模式,从而对高质量语义搜索必须依赖大型内存近似最近邻索引的必要性提出了挑战。