Hybrid retrievers can take advantage of both sparse and dense retrievers. Previous hybrid retrievers leverage indexing-heavy dense retrievers. In this work, we study "Is it possible to reduce the indexing memory of hybrid retrievers without sacrificing performance"? Driven by this question, we leverage an indexing-efficient dense retriever (i.e. DrBoost) and introduce a LITE retriever that further reduces the memory of DrBoost. LITE is jointly trained on contrastive learning and knowledge distillation from DrBoost. Then, we integrate BM25, a sparse retriever, with either LITE or DrBoost to form light hybrid retrievers. Our Hybrid-LITE retriever saves 13X memory while maintaining 98.0% performance of the hybrid retriever of BM25 and DPR. In addition, we study the generalization capacity of our light hybrid retrievers on out-of-domain dataset and a set of adversarial attacks datasets. Experiments showcase that light hybrid retrievers achieve better generalization performance than individual sparse and dense retrievers. Nevertheless, our analysis shows that there is a large room to improve the robustness of retrievers, suggesting a new research direction.
翻译:混合检索器能够同时利用稀疏检索器和稠密检索器的优势。以往的混合检索器依赖于索引密集型的稠密检索器。本文研究问题:“是否可以在不牺牲性能的前提下降低混合检索器的索引内存”?受此问题驱动,我们采用索引高效的稠密检索器(即DrBoost),并引入LITE检索器进一步降低DrBoost的内存占用。LITE通过对比学习与来自DrBoost的知识蒸馏进行联合训练。随后,我们将稀疏检索器BM25与LITE或DrBoost集成,形成轻量混合检索器。我们的Hybrid-LITE检索器在保持BM25与DPR混合检索器98.0%性能的同时,节省了13倍的内存。此外,我们研究了轻量混合检索器在域外数据集及一组对抗攻击数据集上的泛化能力。实验表明,轻量混合检索器比单独的稀疏或稠密检索器具有更优的泛化性能。然而,我们的分析显示检索器的鲁棒性仍有较大提升空间,这为未来研究指明了新方向。