Training Dense Retrievers with Multiple Positive Passages

Modern knowledge-intensive systems, such as retrieval-augmented generation (RAG), rely on effective retrievers to establish the performance ceiling for downstream modules. However, retriever training has been bottlenecked by sparse, single-positive annotations, which lead to false-negative noise and suboptimal supervision. While the advent of large language models (LLMs) makes it feasible to collect comprehensive multi-positive relevance labels at scale, the optimal strategy for incorporating these dense signals into training remains poorly understood. In this paper, we present a systematic study of multi-positive optimization objectives for retriever training. We unify representative objectives, including Joint Likelihood (JointLH), Summed Marginal Likelihood (SumMargLH), and Log-Sum-Exp Pairwise (LSEPair) loss, under a shared contrastive learning framework. Our theoretical analysis characterizes their distinct gradient behaviors, revealing how each allocates probability mass across positive document sets. Empirically, we conduct extensive evaluations on Natural Questions, MS MARCO, and the BEIR benchmark across two realistic regimes: homogeneous LLM-annotated data and heterogeneous mixtures of human and LLM labels. Our results show that LSEPair consistently achieves superior robustness and performance across settings, while JointLH and SumMargLH exhibit high sensitivity to the quality of positives. Furthermore, we find that the simple strategy of random sampling (Rand1LH) serves as a reliable baseline. By aligning theoretical insights with empirical findings, we provide practical design principles for leveraging dense, LLM-augmented supervision to enhance retriever effectiveness.

翻译：现代知识密集型系统，如检索增强生成（RAG），依赖于高效的检索器来为下游模块设定性能上限。然而，检索器训练一直受限于稀疏、单正例的标注，这会导致假阴性噪声和次优的监督。尽管大型语言模型（LLMs）的出现使得大规模收集全面的多正例相关性标注成为可能，但如何将这些密集信号最优地整合到训练中的策略仍不甚明了。本文对用于检索器训练的多正例优化目标进行了系统性研究。我们将代表性目标，包括联合似然（JointLH）、求和边缘似然（SumMargLH）以及对数和-指数配对（LSEPair）损失，统一在一个共享的对比学习框架下。我们的理论分析刻画了它们不同的梯度行为，揭示了每种方法如何在正例文档集上分配概率质量。在实证方面，我们在Natural Questions、MS MARCO和BEIR基准上，针对两种现实场景——同质的LLM标注数据和异质的人机标注混合数据——进行了广泛评估。我们的结果表明，LSEPair在各种设置下始终展现出卓越的鲁棒性和性能，而JointLH和SumMargLH则对正例质量表现出高度敏感性。此外，我们发现简单的随机采样策略（Rand1LH）可作为一个可靠的基线。通过将理论洞见与实证发现相结合，我们为利用密集的、LLM增强的监督来提升检索器效能提供了实用的设计原则。