Dense retrieval overcome the lexical gap and has shown great success in ad-hoc information retrieval (IR). Despite their success, dense retrievers are expensive to serve across practical use cases. For use cases requiring to search from millions of documents, the dense index becomes bulky and requires high memory usage for storing the index. More recently, learning-to-hash (LTH) techniques, for e.g., BPR and JPQ, produce binary document vectors, thereby reducing the memory requirement to efficiently store the dense index. LTH techniques are supervised and finetune the retriever using a ranking loss. They outperform their counterparts, i.e., traditional out-of-the-box vector compression techniques such as PCA or PQ. A missing piece from prior work is that existing techniques have been evaluated only in-domain, i.e., on a single dataset such as MS MARCO. In our work, we evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever while maintaining efficiency at inference. Our results demonstrate that, unlike prior work, LTH strategies when applied naively can underperform the zero-shot TAS-B dense retriever on average by up to 14% nDCG@10 on the BEIR benchmark. To solve this limitation, in our work, we propose an easy yet effective solution of injecting domain adaptation with existing supervised LTH techniques. We experiment with two well-known unsupervised domain adaptation techniques: GenQ and GPL. Our domain adaptation injection technique can improve the downstream zero-shot retrieval effectiveness for both BPR and JPQ variants of the TAS-B model by on average 11.5% and 8.2% nDCG@10 while both maintaining 32$\times$ memory efficiency and 14$\times$ and 2$\times$ speedup respectively in CPU retrieval latency on BEIR. All our code, models, and data are publicly available at https://github.com/thakur-nandan/income.
翻译:稠密检索克服了词汇鸿沟问题,并在即席信息检索中展现出显著成功。尽管性能优越,稠密检索器在实际应用场景中的部署成本高昂。对于需要从数百万文档中进行检索的场景,稠密索引体积庞大且需要高内存占用。近年来,哈希学习技术(如BPR和JPQ)通过生成二进制文档向量,有效降低了稠密索引的存储需求。此类技术采用监督学习方式,利用排序损失对检索器进行微调,其性能优于传统即用型向量压缩方法(如PCA或PQ)。现有工作的一个空白在于,这些技术仅在内领域场景(如MS MARCO单一数据集)中经过评估。在本研究中,我们系统性评估了哈希学习与向量压缩技术,旨在提升TAS-B稠密检索器的下游零样本检索准确率,同时保持推理效率。实验结果表明,与先前研究不同,朴素应用哈希学习策略会导致零样本TAS-B稠密检索器在BEIR基准上的平均性能下降(nDCG@10损失最高达14%)。针对这一局限,我们提出一种简单有效的解决方案:将领域自适应注入现有监督哈希学习技术。我们采用两种主流无监督领域自适应方法(GenQ和GPL)进行实验。所提出的领域自适应注入技术可显著提升TAS-B模型(包括BPR和JPQ变体)的下游零样本检索效果,在BEIR基准上实现平均nDCG@10提升11.5%和8.2%,同时保持32倍内存效率,并在CPU检索延迟方面分别获得14倍和2倍的加速。所有代码、模型及数据均已开源:https://github.com/thakur-nandan/income。