Anserini is a Lucene-based toolkit for reproducible information retrieval research in Java that has been gaining traction in the community. It provides retrieval capabilities for both "traditional" bag-of-words retrieval models such as BM25 as well as retrieval using learned sparse representations such as SPLADE. With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface. Nevertheless, hybrid fusion techniques that integrate sparse and dense retrieval models need to stitch together results from two completely different "software stacks", which creates unnecessary complexities and inefficiencies. However, the introduction of HNSW indexes for dense vector search in Lucene promises the integration of both dense and sparse retrieval within a single software framework. We explore exactly this integration in the context of Anserini. Experiments on the MS MARCO passage and BEIR datasets show that our Anserini HNSW integration supports (reasonably) effective and (reasonably) efficient approximate nearest neighbor search for dense retrieval models, using only Lucene.
翻译:Anserini 是一个基于 Lucene 的可复现信息检索研究 Java 工具包,已在学界获得广泛关注。它不仅支持 BM25 等"传统"词袋检索模型,还可通过 SPLADE 等学习型稀疏表示进行检索。通过提供 Python 接口的 Pyserini,用户可同时访问稀疏与稠密检索模型——Pyserini 在统一的接口中实现了 Faiss 向量搜索库与 Lucene 倒排索引的绑定。然而,融合稀疏与稠密检索模型的混合融合技术需要拼接两个完全不同的"软件栈"的检索结果,这引入了不必要的复杂性与效率损失。随着 Lucene 引入用于稠密向量搜索的 HNSW 索引,我们得以在单一软件框架内同时实现稠密与稀疏检索。本文正是探索这种集成在 Anserini 中的实现方式。在 MS MARCO 段落检索和 BEIR 数据集上的实验表明,我们的 Anserini HNSW 集成方案仅依赖 Lucene 即可为稠密检索模型提供(合理)有效且(合理)高效的近似最近邻搜索。