Large Language Models (LLMs) have emerged as a pivotal force in language technology. Their robust reasoning capabilities and expansive knowledge repositories have enabled exceptional zero-shot generalization abilities across various facets of the natural language processing field, including information retrieval (IR). In this paper, we conduct an in-depth investigation into the utility of documents generated by LLMs for IR. We introduce a simple yet effective framework, Multi-Text Generation Integration (MuGI), to augment existing IR methodologies. Specifically, we prompt LLMs to generate multiple pseudo references and integrate with query for retrieval. The training-free MuGI model eclipses existing query expansion strategies, setting a new standard in sparse retrieval. It outstrips supervised counterparts like ANCE and DPR, achieving a notable over 18% enhancement in BM25 on the TREC DL dataset and a 7.5% increase on BEIR. Through MuGI, we have forged a rapid and high-fidelity re-ranking pipeline. This allows a relatively small 110M parameter retriever to surpass the performance of larger 3B models in in-domain evaluations, while also bridging the gap in out-of-distribution situations. We release our code and all generated references at https://github.com/lezhang7/Retrieval_MuGI.
翻译:大语言模型已成为语言技术中的关键力量。其强大的推理能力和广泛的知识库,使其在自然语言处理领域的各个方面(包括信息检索)展现出卓越的零样本泛化能力。本文深入探究了大语言模型生成的文档在信息检索中的效用,并提出了一个简洁高效的框架——多文本生成集成(MuGI),用于增强现有信息检索方法。具体而言,我们引导大语言模型生成多个伪参考文本,并将其与查询集成进行检索。无需训练的MuGI模型超越了现有的查询扩展策略,为稀疏检索树立了新标杆。它在TREC DL数据集上相较于BM25实现了超过18%的显著提升,在BEIR上则提升了7.5%,超越了ANCE和DPR等监督方法。通过MuGI,我们构建了快速且高保真的重排序流水线,使得仅1.1亿参数的检索器在领域内评估中超越了3B参数的大型模型,同时弥合了分布外场景下的性能差距。我们已将代码及所有生成的参考文本发布至https://github.com/lezhang7/Retrieval_MuGI。