This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.
翻译:本文探讨了利用大语言模型(LLMs)为训练检索及检索增强生成(RAG)系统标注文档效用的方法,旨在减少对成本高昂的人工标注的依赖。我们通过使用LLMs标注文档效用,以弥合检索相关性与生成效用之间的差距。为了有效利用每个查询的多个正样本,我们引入了一种新颖的损失函数,该函数最大化其求和边际似然。我们使用Qwen-2.5-32B模型在MS MARCO数据集上标注效用,并在MS MARCO和BEIR上进行检索实验,同时在MS MARCO QA、NQ和HotpotQA上进行RAG实验。我们的结果表明,与仅基于人工标注或下游问答指标训练的模型相比,LLM生成的标注提升了跨域检索性能并改善了RAG结果。此外,将LLM标注与仅20%的人工标签相结合,即可达到与使用全部人工标注相当的性能。本研究为在新语料库上初始化问答系统提供了一种利用LLM标注的全面方法。