Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts

We investigate the usefulness of generative Large Language Models (LLMs) in generating training data for cross-encoder re-rankers in a novel direction: generating synthetic documents instead of synthetic queries. We introduce a new dataset, ChatGPT-RetrievalQA, and compare the effectiveness of models fine-tuned on LLM-generated and human-generated data. Data generated with generative LLMs can be used to augment training data, especially in domains with smaller amounts of labeled data. We build ChatGPT-RetrievalQA based on an existing dataset, human ChatGPT Comparison Corpus (HC3), consisting of public question collections with human responses and answers from ChatGPT. We fine-tune a range of cross-encoder re-rankers on either human-generated or ChatGPT-generated data. Our evaluation on MS MARCO DEV, TREC DL'19, and TREC DL'20 demonstrates that cross-encoder re-ranking models trained on ChatGPT responses are statistically significantly more effective zero-shot re-rankers than those trained on human responses. In a supervised setting, the human-trained re-rankers outperform the LLM-trained re-rankers. Our novel findings suggest that generative LLMs have high potential in generating training data for neural retrieval models. Further work is needed to determine the effect of factually wrong information in the generated responses and test our findings' generalizability with open-source LLMs. We release our data, code, and cross-encoders checkpoints for future work.

翻译：我们探究了大型语言模型（LLMs）在为跨编码器重排序器生成训练数据方面的实用性，创新性地聚焦于生成合成文档而非合成查询。我们引入了新数据集ChatGPT-RetrievalQA，并比较了基于LLM生成数据与人类生成数据微调模型的有效性。利用生成式LLM生成的数据可用于扩充训练数据，尤其在标注数据较少的领域。我们基于现有数据集、人类ChatGPT比较语料库（HC3）构建了ChatGPT-RetrievalQA，该数据集包含公开问题集合及其人类回答与ChatGPT回答。我们在人类生成或ChatGPT生成的数据上微调了一系列跨编码器重排序器。在MS MARCO DEV、TREC DL'19和TREC DL'20上的评估表明，基于ChatGPT回答训练的跨编码器重排序模型作为零样本重排序器，其有效性显著优于基于人类回答训练模型。在有监督场景下，人类训练的重排序器优于LLM训练的重排序器。这一新发现表明，生成式LLM在神经网络检索模型训练数据生成方面具有巨大潜力。未来需进一步研究生成回答中事实错误信息的影响，并通过开源LLM验证结论的普适性。我们已公开数据集、代码及跨编码器检查点，以供后续研究。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日