Recently, dense passage retrieval has become a mainstream approach to finding relevant information in various natural language processing tasks. A number of studies have been devoted to improving the widely adopted dual-encoder architecture. However, most of the previous studies only consider query-centric similarity relation when learning the dual-encoder retriever. In order to capture more comprehensive similarity relations, we propose a novel approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval. To implement our approach, we make three major technical contributions by introducing formal formulations of the two kinds of similarity relations, generating high-quality pseudo labeled data via knowledge distillation, and designing an effective two-stage training procedure that incorporates passage-centric similarity relation constraint. Extensive experiments show that our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.
翻译:近年来,密集段落检索已成为各类自然语言处理任务中寻找相关信息的主流方法。大量研究致力于改进广泛采用的编码器-解码器架构。然而,以往研究大多仅考虑以查询为中心的相似性关系来训练双编码器检索器。为捕捉更全面的相似性关系,我们提出了一种融合查询中心与段落中心相似性关系的新方法(简称PAIR),用于密集段落检索。为实现该方法,我们做出三项主要技术贡献:提出两类相似性关系的形式化定义,通过知识蒸馏生成高质量伪标注数据,并设计一种融入段落中心相似性约束的有效两阶段训练流程。大量实验表明,我们的方法在MSMARCO和Natural Questions数据集上均显著优于现有最优模型。