This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track. We have identified RAG evaluation as a barrier to continued progress in information access (and more broadly, natural language processing and artificial intelligence), and it is our hope that we can contribute to tackling the many challenges in this space. The central hypothesis we explore in this work is that the nugget evaluation methodology, originally developed for the TREC Question Answering Track in 2003, provides a solid foundation for evaluating RAG systems. As such, our efforts have focused on "refactoring" this methodology, specifically applying large language models to both automatically create nuggets and to automatically assign nuggets to system answers. We call this the AutoNuggetizer framework. Within the TREC setup, we are able to calibrate our fully automatic process against a manual process whereby nuggets are created by human assessors semi-manually and then assigned manually to system answers. Based on initial results across 21 topics from 45 runs, we observe a strong correlation between scores derived from a fully automatic nugget evaluation and a (mostly) manual nugget evaluation by human assessors. This suggests that our fully automatic evaluation process can be used to guide future iterations of RAG systems.
翻译:本报告提供了对 TREC 2024 检索增强生成(RAG)赛道部分结果的初步分析。我们认为,RAG 评估是信息获取(以及更广泛意义上的自然语言处理和人工智能)领域持续进步的障碍,我们希望能为解决这一领域的诸多挑战做出贡献。本工作探索的核心假设是:最初为 2003 年 TREC 问答赛道开发的信息单元评估方法,为评估 RAG 系统提供了坚实的基础。因此,我们的工作重点在于“重构”这一方法,特别是应用大语言模型来自动创建信息单元,并自动将信息单元分配给系统生成的答案。我们称此框架为 AutoNuggetizer。在 TREC 的设置下,我们能够将全自动流程与人工流程进行校准,后者由人工评估者半自动地创建信息单元,然后手动分配给系统答案。基于对来自 45 次运行的 21 个主题的初步结果,我们观察到全自动信息单元评估得出的分数与人工评估者进行的(主要)手动信息单元评估得出的分数之间存在强相关性。这表明我们的全自动评估流程可用于指导未来 RAG 系统的迭代开发。