Large Language Models (LLMs) have emerged as a pivotal force in language technology. Their robust reasoning capabilities and expansive knowledge repositories have enabled exceptional zero-shot generalization abilities across various facets of the natural language processing field, including information retrieval (IR). In this paper, we conduct an in-depth investigation into the utility of documents generated by LLMs for IR. We introduce a simple yet effective framework, Multi-Text Generation Integration (MuGI), to augment existing IR methodologies. Specifically, we prompt LLMs to generate multiple pseudo references and integrate with query for retrieval. The training-free MuGI model eclipses existing query expansion strategies, setting a new standard in sparse retrieval. It outstrips supervised counterparts like ANCE and DPR, achieving a notable over 18% enhancement in BM25 on the TREC DL dataset and a 7.5% increase on BEIR. Through MuGI, we have forged a rapid and high-fidelity re-ranking pipeline. This allows a relatively small 110M parameter retriever to surpass the performance of larger 3B models in in-domain evaluations, while also bridging the gap in out-of-distribution situations. We release our code and all generated references at https://github.com/lezhang7/Retrieval_MuGI.
翻译:大语言模型(LLMs)已成为语言技术发展的核心驱动力。其强大的推理能力与广泛的知识储备使其在自然语言处理领域的多个方面(包括信息检索)展现出卓越的零样本泛化能力。本文深入研究了LLMs生成文档在信息检索中的实用价值,提出一种简洁高效的框架——多文本生成集成(MuGI),用于增强现有检索方法。具体而言,我们引导LLMs生成多个伪参考文档并与查询进行集成检索。无需训练的MuGI模型超越现有查询扩展策略,为稀疏检索设立了新标准。该模型在TREC DL数据集上以BM25方法实现超过18%的性能提升,在BEIR上提升7.5%,超越ANCE和DPR等监督学习方法。通过MuGI,我们构建了快速高保真的重排序流水线,使仅含1.1亿参数的小型检索器在领域内评估中超越3B参数大型模型的表现,同时弥合了分布外场景的性能差距。相关代码及生成的所有参考文档已发布于https://github.com/lezhang7/Retrieval_MuGI。