We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.
翻译:我们探索利用语料特定词汇来提升学习型稀疏检索系统的效率与效果。研究发现,在目标语料上针对融入文档扩展过程的不同词汇量规模对底层BERT模型进行预训练,可将检索质量提升高达12%,同时在某些场景下使延迟降低50%。实验表明,采用语料特定词汇并扩大词汇规模可减少平均倒排列表长度,进而降低延迟。消融实验揭示了自定义词汇、文档扩展技术与稀疏模型稀疏化目标之间的有趣交互作用。效率和效果的提升可迁移至不同检索方法(如uniCOIL和SPLADE),为学习型稀疏检索系统提供了一种简单有效的效率-效果权衡新方案。