SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

Traditionally, sparse retrieval systems relied on lexical representations to retrieve documents, such as BM25, dominated information retrieval tasks. With the onset of pre-trained transformer models such as BERT, neural sparse retrieval has led to a new paradigm within retrieval. Despite the success, there has been limited software supporting different sparse retrievers running in a unified, common environment. This hinders practitioners from fairly comparing different sparse models and obtaining realistic evaluation results. Another missing piece is, that a majority of prior work evaluates sparse retrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO. However, a key requirement in practical retrieval systems requires models that can generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In this work, we provide SPRINT, a unified Python toolkit based on Pyserini and Lucene, supporting a common interface for evaluating neural sparse retrieval. The toolkit currently includes five built-in models: uniCOIL, DeepImpact, SPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by defining their term weighting method. Using our toolkit, we establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2 achieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural sparse retrievers. In this work, we further uncover the reasons behind its performance gain. We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document which is often crucial for its performance gains, i.e. a limitation among its other sparse counterparts. We provide our SPRINT toolkit, models, and data used in our experiments publicly here at https://github.com/thakur-nandan/sprint.

翻译：传统上，稀疏检索系统依赖词级表示（如BM25）来检索文档，这一方法长期主导信息检索任务。随着BERT等预训练Transformer模型的出现，神经稀疏检索引领了检索领域的新范式。尽管取得了成功，但目前缺乏支持不同稀疏检索器在统一公共环境中运行的软件。这阻碍了从业者公平比较不同稀疏模型并获取真实评估结果。另一个缺失环节是，大多数先前工作仅针对领域内检索（即单一数据集MS MARCO）评估稀疏检索模型。然而，实际检索系统的关键需求是模型能良好泛化到未见过的领域外任务，即零样本检索。本文提出SPRINT——一个基于Pyserini和Lucene的统一Python工具包，为评估神经稀疏检索提供通用接口。该工具包目前包含五种内置模型：uniCOIL、DeepImpact、SPARTA、TILDEv2和SPLADEv2。用户还可通过自定义词权重方法轻松添加个性化模型。利用该工具包，我们在权威基准BEIR上建立了强健且可复现的零样本稀疏检索基线。结果表明，在所有神经稀疏检索器中，SPLADEv2在BEIR上取得了最佳平均得分（nDCG@10为0.470）。本研究进一步揭示了其性能提升的原因：SPLADEv2产生的稀疏表示中，大部分标记来自原始查询和文档之外的内容，这一特性对其性能提升至关重要，但也构成了其他稀疏模型的局限性。我们已在https://github.com/thakur-nandan/sprint 公开提供SPRINT工具包、模型及实验数据。