We investigate the usefulness of generative Large Language Models (LLMs) in generating training data for cross-encoder re-rankers in a novel direction: generating synthetic documents instead of synthetic queries. We introduce a new dataset, ChatGPT-RetrievalQA, and compare the effectiveness of models fine-tuned on LLM-generated and human-generated data. Data generated with generative LLMs can be used to augment training data, especially in domains with smaller amounts of labeled data. We build ChatGPT-RetrievalQA based on an existing dataset, human ChatGPT Comparison Corpus (HC3), consisting of public question collections with human responses and answers from ChatGPT. We fine-tune a range of cross-encoder re-rankers on either human-generated or ChatGPT-generated data. Our evaluation on MS MARCO DEV, TREC DL'19, and TREC DL'20 demonstrates that cross-encoder re-ranking models trained on ChatGPT responses are statistically significantly more effective zero-shot re-rankers than those trained on human responses. In a supervised setting, the human-trained re-rankers outperform the LLM-trained re-rankers. Our novel findings suggest that generative LLMs have high potential in generating training data for neural retrieval models. Further work is needed to determine the effect of factually wrong information in the generated responses and test our findings' generalizability with open-source LLMs. We release our data, code, and cross-encoders checkpoints for future work.
翻译:我们探究了大型语言模型(LLMs)在为跨编码器重排序器生成训练数据方面的实用性,创新性地聚焦于生成合成文档而非合成查询。我们引入了新数据集ChatGPT-RetrievalQA,并比较了基于LLM生成数据与人类生成数据微调模型的有效性。利用生成式LLM生成的数据可用于扩充训练数据,尤其在标注数据较少的领域。我们基于现有数据集、人类ChatGPT比较语料库(HC3)构建了ChatGPT-RetrievalQA,该数据集包含公开问题集合及其人类回答与ChatGPT回答。我们在人类生成或ChatGPT生成的数据上微调了一系列跨编码器重排序器。在MS MARCO DEV、TREC DL'19和TREC DL'20上的评估表明,基于ChatGPT回答训练的跨编码器重排序模型作为零样本重排序器,其有效性显著优于基于人类回答训练模型。在有监督场景下,人类训练的重排序器优于LLM训练的重排序器。这一新发现表明,生成式LLM在神经网络检索模型训练数据生成方面具有巨大潜力。未来需进一步研究生成回答中事实错误信息的影响,并通过开源LLM验证结论的普适性。我们已公开数据集、代码及跨编码器检查点,以供后续研究。