Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu
翻译:近期,InPars提出了一种在信息检索任务中高效利用大型语言模型(LLM)的方法:通过少量样本示例,诱导LLM为文档生成相关查询。这些合成的查询-文档对可用于训练检索器。然而,InPars以及更近期的Promptagator均依赖GPT-3和FLAN等专有LLM来生成此类数据集。在本工作中,我们提出InPars-v2——一种利用开源LLM及现有强大重排序器来筛选训练用合成查询-文档对的数据集生成器。采用简单的BM25检索流水线,配合基于InPars-v2数据微调的monoT5重排序器,该方法在BEIR基准上取得了新的最优结果。为使研究者能进一步改进我们的方法,我们开源了代码、合成数据及微调模型:https://github.com/zetaalphavector/inPars/tree/master/tpu