Dense retrievers have achieved impressive performance, but their demand for abundant training data limits their application scenarios. Contrastive pre-training, which constructs pseudo-positive examples from unlabeled data, has shown great potential to solve this problem. However, the pseudo-positive examples crafted by data augmentations can be irrelevant. To this end, we propose relevance-aware contrastive learning. It takes the intermediate-trained model itself as an imperfect oracle to estimate the relevance of positive pairs and adaptively weighs the contrastive loss of different pairs according to the estimated relevance. Our method consistently improves the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks. Further exploration shows that our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner. Our code is publicly available at https://github.com/Yibin-Lei/ReContriever.
翻译:稠密检索器已取得令人瞩目的性能,但其对大量训练数据的需求限制了应用场景。对比预训练方法通过从无标签数据中构建伪正例,展现出解决该问题的巨大潜力。然而,通过数据增强生成的伪正例可能包含无关样本。为此,我们提出相关性感知对比学习,将经过中间训练的模型本身作为不完美的先验估计器,用于评估正例对的相关性,并根据估计的相关性自适应加权不同样本对的对比损失。我们的方法在BEIR和开放域问答检索基准上持续提升了当前最先进的无监督Contriever模型性能。进一步研究表明,该方法不仅在目标语料库上进行进一步预训练后能超越BM25,还可作为优秀的小样本学习器。我们的代码已开源至https://github.com/Yibin-Lei/ReContriever。