Sparse annotation poses persistent challenges to training dense retrieval models, such as the problem of false negatives, i.e. unlabeled relevant documents that are spuriously used as negatives in contrastive learning, distorting the training signal. To alleviate this problem, we introduce evidence-based label smoothing, a computationally efficient method that prevents penalizing the model for assigning high relevance to false negatives. To compute the target relevance distribution over candidate documents within the ranking context of a given query, candidates most similar to the ground truth are assigned a non-zero relevance probability based on the degree of their similarity to the ground-truth document(s). As a relevance estimate we leverage an improved similarity metric based on reciprocal nearest neighbors, which can also be used independently to rerank candidates in post-processing. Through extensive experiments on two large-scale ad hoc text retrieval datasets we demonstrate that both methods can improve the ranking effectiveness of dense retrieval models.
翻译:稀疏标注对稠密检索模型的训练构成持续挑战,例如假阴性问题——即未被标注的相关文档在对比学习中被错误用作负样本,从而扭曲训练信号。为缓解该问题,我们提出基于证据的标签平滑方法,这是一种计算高效的策略,可避免模型因对假阴性文档赋予高相关性而受到惩罚。针对特定查询的排序上下文,该方法通过计算候选文档的目标相关性分布,将与真实标注最相似的候选文档赋予非零相关性概率,概率值取决于其与真实标注文档的相似程度。作为相关性估计指标,我们采用基于互近邻的改进相似度度量,该度量亦可独立用于后处理阶段的候选文档重排序。通过两个大规模即席文本检索数据集上的大量实验,我们证明这两种方法均能提升稠密检索模型的排序有效性。