This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
翻译:本文提出一种简单而有效的查询扩展方法,记为query2doc,用于改进稀疏和稠密检索系统。该方法首先通过少样本提示大语言模型(LLMs)生成伪文档,随后利用生成的伪文档对查询进行扩展。LLMs在海量网页文本语料上训练而成,擅长知识记忆。源自LLMs的伪文档通常包含高度相关的信息,有助于查询消歧并引导检索器。实验结果表明,在MS-MARCO和TREC DL等即席检索数据集上,query2doc无需任何模型微调即可将BM25的性能提升3%至15%。此外,我们的方法在域内和域外结果上同样有利于当前最先进的稠密检索器。