We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection. The main goal of our work is to challenge the prevailing narrative that a dedicated vector store is necessary to take advantage of recent advances in deep neural networks as applied to search. Quite the contrary, we show that hierarchical navigable small-world network (HNSW) indexes in Lucene are adequate to provide vector search capabilities in a standard bi-encoder architecture. This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure.
翻译:我们提供了一个可复现的端到端演示,展示了如何使用Lucene在流行的MS MARCO段落排序测试集上进行基于OpenAI嵌入向量的向量搜索。本研究的主要目的是挑战当前普遍认为必须使用专用向量数据库才能利用深度神经网络在搜索领域最新进展的观点。恰恰相反,我们证明Lucene中的分层可导航小世界网络(HNSW)索引足以在标准的双编码器架构中提供向量搜索能力。这表明,从简单的成本效益分析来看,在现代化的"AI搜索栈"中引入专用向量数据库似乎缺乏令人信服的理由——因为此类应用已在现有广泛部署的基础设施上获得了大量投资。