In open-domain Question Answering (QA), dense retrieval is crucial for finding relevant passages for answer generation. Typically, contrastive learning is used to train a retrieval model that maps passages and queries to the same semantic space. The objective is to make similar ones closer and dissimilar ones further apart. However, training such a system is challenging due to the false negative issue, where relevant passages may be missed during data annotation. Hard negative sampling, which is commonly used to improve contrastive learning, can introduce more noise in training. This is because hard negatives are those closer to a given query, and thus more likely to be false negatives. To address this issue, we propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss, a commonly used loss for dense retrieval. Our analysis shows that the regularizer helps dense retrieval models be more robust against false negatives with a theoretical guarantee. Additionally, we propose a model-agnostic method to filter out noisy negative passages in the dataset, improving any downstream dense retrieval models. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance in comparison to existing state-of-the-art dense retrieval systems.
翻译:在开放域问答中,密集检索对于查找相关段落以生成答案至关重要。通常,对比学习被用于训练检索模型,将段落和查询映射到同一语义空间,其目标是使相似项更接近、不相似项更疏远。然而,由于假阴性问题(数据标注中可能遗漏相关段落),训练此类系统极具挑战性。常用于改进对比学习的困难负采样会在训练中引入更多噪声,因为困难负样本是更接近给定查询的样本,因此更可能成为假阴性。为解决此问题,我们针对密集检索中常用的噪声对比估计损失,提出了一种新颖的对比置信度正则化器。理论分析表明,该正则化器能通过理论保障增强密集检索模型对假阴性的鲁棒性。此外,我们提出了一种与模型无关的方法来过滤数据集中的噪声负段落,可改进任何下游密集检索模型。通过在三个数据集上的实验,我们证明了该方法相较于现有最先进的密集检索系统能取得更优的检索性能。