Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs -- which are not widely accessible -- and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research community, the resources need to be accessible and easy to reproduce by researchers and practitioners. Our main contribution is a unified toolkit for end-to-end reproducible synthetic data generation research, which includes generation, filtering, training and evaluation. Additionally, we provide an interface to IR libraries widely used by the community and support for GPU. Our toolkit not only reproduces the InPars method and partially reproduces Promptagator, but also provides a plug-and-play functionality allowing the use of different LLMs, exploring filtering methods and finetuning various reranker models on the generated data. We also made available all the synthetic data generated in this work for the 18 different datasets in the BEIR benchmark which took more than 2,000 GPU hours to be generated as well as the reranker models finetuned on the synthetic data. Code and data are available at https://github.com/zetaalphavector/InPars
翻译:近期研究探索利用大语言模型(LLMs)克服信息检索(IR)任务中训练数据不足的问题。这些模型的泛化能力使其能够通过提供指令和少量示例来创建领域内合成数据。InPars与Promptagator率先采用该方法,两项工作均证明了LLMs作为IR任务合成数据生成器的潜力,为解决标注数据匮乏的IR任务提供了具有吸引力的方案。然而,这些方法的可复现性受限:前者训练脚本基于TPU(广泛可用性不足),后者代码未开源且其专有LLM未公开。为充分发挥这些方法的潜力并扩大其在研究社区的影响力,相关资源需具备可访问性和可复现性。本文核心贡献在于构建统一工具包,支持端到端的可复现合成数据生成研究(涵盖生成、过滤、训练与评估环节)。此外,我们提供与社区广泛使用的IR库交互接口并支持GPU加速。该工具包不仅复现了InPars方法并部分复现Promptagator,还提供即插即用功能:支持不同LLM的调用、过滤方法探索及基于生成数据的多种重排序模型微调。我们同时公开了本工作中涉及的18个BEIR基准数据集对应的合成数据(生成耗时超2000 GPU小时)及基于这些数据微调的重排序模型。代码与数据详见:https://github.com/zetaalphavector/InPars