A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while keeping the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sparse methods that leverage more expensive optimizations such as query expansion and query term weighting. We focus on narrowing this gap by revisiting and optimizing DeepImpact, a sparse retrieval approach that uses DocT5Query for document expansion followed by a BERT language model to learn impact scores for document terms. We first reinvestigate the expansion process and find that the recently proposed Doc2Query -- query filtration does not enhance retrieval quality when used with DeepImpact. Instead, substituting T5 with a fine-tuned Llama 2 model for query prediction results in a considerable improvement. Subsequently, we study training strategies that have proven effective for other models, in particular the use of hard negatives, distillation, and pre-trained CoCondenser model initialization. Our results substantially narrow the effectiveness gap with the most effective versions of SPLADE.
翻译:近期大量研究聚焦于稀疏学习索引,这些索引采用深度神经网络架构,在保持倒排索引效率优势的同时显著提升检索质量。尽管此类稀疏学习结构的效果远超传统基于倒排索引的排序器,但与最优密文检索器相比仍存在效能差距,甚至相较于采用查询扩展和查询词加权等更昂贵优化手段的稀疏方法也存在不足。本研究致力于通过重新审视和优化DeepImpact来缩小这一差距——该方法采用DocT5Query进行文档扩展,随后利用BERT语言模型学习文档词项的权重分数。我们首先重新审视扩展过程,发现近期提出的Doc2Query-查询过滤技术与DeepImpact结合使用时并未提升检索质量。相反,采用微调后的Llama 2模型替代T5进行查询预测可带来显著改进。随后,我们研究了在其他模型中验证有效的训练策略,特别是困难负例的使用、知识蒸馏以及预训练CoCondenser模型初始化。实验结果表明,我们的方法大幅缩小了与SPLADE最高效版本之间的效能差距。