This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
翻译:本文介绍了一种简单而有效的查询扩展方法——query2doc,旨在提升稀疏和稠密检索系统的性能。该方法首先通过少样本提示大语言模型生成伪文档,随后利用生成的伪文档扩展原始查询。大语言模型基于海量网络文本语料训练,具备强大的知识记忆能力,其生成的伪文档常包含高度相关的信息,有助于消除查询歧义并指导检索器进行搜索。实验结果表明,在不进行任何模型微调的情况下,query2doc在MS-MARCO、TREC DL等即席检索数据集上将BM25的性能提升了3%至15%。此外,该方法在域内和域外场景中均能提升当前最先进稠密检索器的检索效果。