The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.
翻译:将大规模语言模型应用于相关性评估为信息检索、自然语言处理及相关领域的发展带来了令人振奋的机遇,但迄今为止仍存在诸多未知。本文报告了一项大规模评估(TREC 2024 RAG Track)的结果,其中部署了四种不同的相关性评估方法:NIST沿用数十年的"标准"全人工流程,以及三种不同程度利用LLM的开源UMBRELA工具替代方案。该设置使我们能够通过不同方法导出的系统排序进行相关性分析,以权衡成本与质量。我们发现,在nDCG@20、nDCG@100和Recall@100指标上,基于UMBRELA自动生成的相关性评估所导出的系统排序,与来自19个团队77组运行结果的完全人工评估排序具有高度相关性。研究结果表明,自动生成的UMBRELA判定可替代完全人工判定,以准确捕捉运行层面的有效性。令人意外的是,LLM辅助并未显著提升与完全人工评估的相关性,这表明人机协同流程的相关成本并未带来明显的实质性收益。总体而言,人类评估员在应用相关性标准时似乎比UMBRELA更为严格。本研究验证了LLM在学术性TREC式评估中的应用,并为未来研究奠定了基础。