Learned sparse retrieval systems aim to combine the effectiveness of contextualized language models with the scalability of conventional data structures such as inverted indexes. Nevertheless, the indexes generated by these systems exhibit significant deviations from the ones that use traditional retrieval models, leading to a discrepancy in the performance of existing query optimizations that were specifically developed for traditional structures. These disparities arise from structural variations in query and document statistics, including sub-word tokenization, leading to longer queries, smaller vocabularies, and different score distributions within posting lists. This paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning strategy tailored for indexes arising in learned sparse retrieval environments. BMP employs a block filtering mechanism to divide the document space into small, consecutive document ranges, which are then aggregated and sorted on the fly, and fully processed only as necessary, guided by a defined safe early termination criterion or based on approximate retrieval requirements. Through rigorous experimentation, we show that BMP substantially outperforms existing dynamic pruning strategies, offering unparalleled efficiency in safe retrieval contexts and improved tradeoffs between precision and efficiency in approximate retrieval tasks.
翻译:学习型稀疏检索系统旨在结合上下文语言模型的有效性与传统数据结构(如倒排索引)的可扩展性。然而,这些系统生成的索引与采用传统检索模型的索引存在显著差异,导致针对传统结构专门开发的现有查询优化策略性能出现偏差。这些差异源于查询与文档统计信息的结构变化,包括子词切分导致查询变长、词汇表缩小以及倒排列表中得分分布不同。本文提出块最大剪枝(Block-Max Pruning, BMP)——一种专为学习型稀疏检索环境下的索引设计的新型动态剪枝策略。BMP采用块过滤机制将文档空间划分为连续的小型文档范围,这些范围被动态聚合与排序,仅在必要时根据定义的安全提前终止准则或近似检索需求进行完整处理。通过严格实验,我们证明BMP在安全检索场景下显著优于现有动态剪枝策略,实现无与伦比的效率;在近似检索任务中则优化了精确度与效率之间的权衡。