Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we introduce a family of positive-aware mining methods that use the positive relevance score as an anchor for effective false negative removal, leading to faster training and more accurate retrieval models. We provide an ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We further demonstrate the efficacy of our proposed mining methods at scale with the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and placed 1st when it was published to the MTEB Retrieval on July, 2024.
翻译:文本嵌入模型在信息检索应用中广受欢迎,例如基于检索增强生成(RAG)的语义搜索和问答系统。这类模型通常是采用对比学习目标进行微调的Transformer模型。微调嵌入模型的一个挑战在于如何为对比学习选择高质量的难负样本段落。本文提出了一系列正样本感知挖掘方法,该方法以正样本相关性分数作为锚点进行有效的假负样本剔除,从而实现了更快的训练速度和更精确的检索模型。我们对难负样本挖掘方法的不同配置进行了消融研究,探索了不同的教师模型和基础模型。我们进一步通过NV-Retriever-v1模型大规模验证了所提挖掘方法的有效性,该模型在MTEB检索(BEIR)基准测试中获得60.9分,并于2024年7月发布至MTEB检索排行榜时位列第一。