Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnar\"ok, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnar\"ok framework and baselines to achieve a unified standard for future RAG systems.
翻译:你是否尝试过新版必应搜索?或者体验过Google AI概览功能?这些或许听起来很熟悉,因为现代搜索技术栈近期已演进为包含检索增强生成(RAG)系统。与传统依赖显示文档排序列表的搜索范式不同,RAG系统能够搜索并将实时数据整合到大型语言模型(LLM)中,从而提供信息全面、来源可溯、内容精炼的摘要。鉴于这些最新进展,建立一个能够构建、测试、可视化并系统评估基于RAG的搜索系统的平台至关重要。为此,我们提出TREC 2024 RAG赛道以促进RAG系统评估的创新。本文阐述了实现该赛道的具体步骤——详细介绍了可复用框架Ragnarök的设计细节,解释了新版MS MARCO V2.1数据集选定的依据,发布了赛道的开发主题集,并规范了辅助终端用户的输入输出定义。基于Ragnarök框架,我们构建并提供了关键工业基线系统,例如OpenAI的GPT-4o与Cohere的Command R+。此外,我们开发了基于网页的交互式竞技场界面,支持通过众包方式对RAG系统进行成对基准测试。我们将Ragnarök框架及基线系统开源,旨在为未来RAG系统建立统一标准。