A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while keeping the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sparse methods that leverage more expensive optimizations such as query expansion and query term weighting. We focus on narrowing this gap by revisiting and optimizing DeepImpact, a sparse retrieval approach that uses DocT5Query for document expansion followed by a BERT language model to learn impact scores for document terms. We first reinvestigate the expansion process and find that the recently proposed Doc2Query query filtration does not enhance retrieval quality when used with DeepImpact. Instead, substituting T5 with a fine-tuned Llama 2 model for query prediction results in a considerable improvement. Subsequently, we study training strategies that have proven effective for other models, in particular the use of hard negatives, distillation, and pre-trained CoCondenser model initialization. Our results significantly narrow the effectiveness gap with the most effective versions of SPLADE.
翻译:近期大量研究聚焦于稀疏学习索引,这些索引利用深度神经网络架构显著提升检索质量,同时保持倒排索引的效率优势。尽管此类稀疏学习结构的效果远超基于传统倒排索引的排序器,但与最佳稠密检索器相比,甚至与采用查询扩展、查询词加权等更高成本优化策略的稀疏方法相比,其效果仍存在差距。本研究致力于通过重新审视并优化DeepImpact来缩小这一差距——该方法是一种稀疏检索方案,先采用DocT5Query进行文档扩展,再通过BERT语言模型学习文档词的权重分数。我们首先重新审视扩展流程,发现近期提出的Doc2Query查询过滤技术与DeepImpact结合使用时并未提升检索质量。相反,若将T5替换为经微调的Llama 2模型进行查询预测,则可实现显著改进。随后,我们研究了在其他模型中验证有效的训练策略,特别是困难负例的使用、知识蒸馏以及预训练CoCondenser模型初始化。实验结果表明,我们的方法显著缩小了与SPLADE最优版本的效果差距。