Sparse annotation poses persistent challenges to training dense retrieval models; for example, it distorts the training signal when unlabeled relevant documents are used spuriously as negatives in contrastive learning. To alleviate this problem, we introduce evidence-based label smoothing, a novel, computationally efficient method that prevents penalizing the model for assigning high relevance to false negatives. To compute the target relevance distribution over candidate documents within the ranking context of a given query, we assign a non-zero relevance probability to those candidates most similar to the ground truth based on the degree of their similarity to the ground-truth document(s). To estimate relevance we leverage an improved similarity metric based on reciprocal nearest neighbors, which can also be used independently to rerank candidates in post-processing. Through extensive experiments on two large-scale ad hoc text retrieval datasets, we demonstrate that reciprocal nearest neighbors can improve the ranking effectiveness of dense retrieval models, both when used for label smoothing, as well as for reranking. This indicates that by considering relationships between documents and queries beyond simple geometric distance we can effectively enhance the ranking context.
翻译:稀疏标注对训练稠密检索模型构成了持续挑战;例如,在对比学习中,未标注的相关文档被错误地用作负样本,从而扭曲了训练信号。为缓解这一问题,我们提出了一种基于证据的标签平滑方法,这是一种新颖且计算高效的技术,可防止模型因对假负样本赋予高相关性而受到惩罚。为了在给定查询的排序上下文中计算候选文档的目标相关性分布,我们根据候选文档与真实标注文档的相似度程度,将非零相关性概率分配给与真实标注最相似的候选文档。为估计相关性,我们利用了基于互近邻的改进相似度度量,该度量也可独立用于后处理中的候选文档重排序。通过在两个大规模即席文本检索数据集上的广泛实验,我们证明了互近邻在用于标签平滑和重排序时,均能提升稠密检索模型的排序效果。这表明,通过考虑文档与查询之间超越简单几何距离的关系,我们可以有效增强排序上下文。