In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.
翻译:本文系统研究了基于大语言模型(LLM)文档扩展的密集段落检索预训练潜力。具体而言,我们利用LLM的文档扩展能力(即查询生成),并通过专门为段落检索设计的预训练策略(包括对比学习和瓶颈式查询生成)将扩展知识有效迁移至检索器。此外,我们引入课程学习策略以降低对LLM推理的依赖。实验结果表明,基于LLM文档扩展的预训练显著提升了大规模网页搜索任务的检索性能。我们的工作展现出强大的零样本和跨域检索能力,使其在无人工标注数据初始化时具有更广泛的检索适用性。