Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method's effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available at https://github.com/thongnt99/learned-sparse-retrieval
翻译:学习型稀疏检索(LSR)是一类初阶段检索方法,经过训练可生成查询和文档的稀疏词汇表示,以供倒排索引使用。近年来已提出多种LSR方法,其中Splade模型在MSMarco数据集上达到了最先进的性能。尽管模型架构相似,但多种LSR方法在效果和效率上存在显著差异。实验设置和配置的不同使得方法比较困难,难以提炼出见解。在本工作中,我们分析了现有的LSR方法,识别出关键组件,构建了一个将所有LSR方法统一在同一视角下的LSR框架。随后,我们使用统一的代码库复现了所有主流方法,并在相同环境下重新训练,从而量化框架中各组件对效果和效率的影响。我们发现:(1) 包含文档项加权对方法效果最为重要,(2) 包含查询加权具有较小的正面影响,(3) 文档扩展与查询扩展存在抵消效应。由此,我们展示了如何从最先进模型中移除查询扩展,从而在保持MSMarco和TripClick基准测试效果的同时显著降低延迟。我们的代码已公开于 https://github.com/thongnt99/learned-sparse-retrieval