Recently, methods have been developed to improve the performance of dense passage retrieval by using context-supervised pre-training. These methods simply consider two passages from the same document to be relevant, without taking into account the possibility of weakly correlated pairs. Thus, this paper proposes query-as-context pre-training, a simple yet effective pre-training technique to alleviate the issue. Query-as-context pre-training assumes that the query derived from a passage is more likely to be relevant to that passage and forms a passage-query pair. These passage-query pairs are then used in contrastive or generative context-supervised pre-training. The pre-trained models are evaluated on large-scale passage retrieval benchmarks and out-of-domain zero-shot benchmarks. Experimental results show that query-as-context pre-training brings considerable gains and meanwhile speeds up training, demonstrating its effectiveness and efficiency. Our code will be available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
翻译:近期,学界开发了一系列通过上下文监督预训练来提升密集段落检索性能的方法。这些方法简单地将同一文档中的两个段落视为相关,未考虑弱相关配对存在的可能性。为此,本文提出Query-as-context预训练——一种简洁高效的预训练技术以缓解该问题。该技术假设从某段落衍生的查询与其原段落具有更高相关性,由此构成段落-查询对,并进一步将这些配对应用于对比式或生成式上下文监督预训练中。我们在大规模段落检索基准与跨领域零样本基准上对预训练模型进行了评估。实验结果表明,Query-as-context预训练不仅带来了显著性能提升,同时加速了训练过程,充分验证了其有效性与高效性。相关代码将公布于https://github.com/caskcsg/ir/tree/main/cotmae-qc。