Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \url{https://github.com/naver/bergen}.
翻译:检索增强生成技术能够利用外部知识增强大型语言模型。随着生成式大型语言模型的近期兴起,众多RAG方法被提出,这些方法涉及评估数据集、知识库、评价指标、检索器与大型语言模型等多种复杂配置。基准测试的不一致性成为比较不同方法及理解流程中各组件影响的主要挑战。本研究通过探索最佳实践为系统化RAG评估奠定基础,并推出BERGEN——一个标准化RAG实验的可复现研究端到端库。在以问答系统为重点的广泛实验中,我们对多种前沿检索器、重排序器及大型语言模型进行了基准测试。同时,我们对现有RAG评估指标与数据集进行了系统性分析。我们的开源库BERGEN可通过\url{https://github.com/naver/bergen}获取。