Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. Many papers introduced new embedding model architectures and training approaches, however, one of the key ingredients, the process of mining negative passages, remains poorly explored or described. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we propose a family of positive-aware mining methods that leverage the positive relevance score for more effective false negatives removal. We also provide a comprehensive ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We demonstrate the efficacy of our proposed methods by introducing the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and 0.65 points higher than previous methods. The model placed 1st when it was published to MTEB Retrieval on July 07, 2024.
翻译:文本嵌入模型在信息检索应用中广受欢迎,例如基于检索增强生成(RAG)的语义搜索和问答系统。这些模型通常是采用对比学习目标进行微调的Transformer模型。许多论文提出了新的嵌入模型架构和训练方法,然而,其关键要素之一——负向段落挖掘过程——仍缺乏深入探索或清晰阐述。微调嵌入模型的一个挑战在于为对比学习选择高质量的硬负样本段落。本文提出了一系列正向感知挖掘方法,该方法利用正向相关性分数实现更高效的假负样本剔除。我们还对硬负样本挖掘方法的不同配置进行了全面的消融研究,探索了不同的教师模型和基础模型。通过引入NV-Retriever-v1模型,我们验证了所提方法的有效性:该模型在MTEB检索(BEIR)基准测试中获得60.9分,较先前方法提升0.65分。该模型于2024年7月7日发布至MTEB检索排行榜时位列第一。