In the rapidly evolving landscape of Natural Language Processing (NLP) and text generation, the emergence of Retrieval Augmented Generation (RAG) presents a promising avenue for improving the quality and reliability of generated text by leveraging information retrieved from user specified database. Benchmarking is essential to evaluate and compare the performance of the different RAG configurations in terms of retriever and generator, providing insights into their effectiveness, scalability, and suitability for the specific domain and applications. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI Large Language Model (LLM) teaming. As a case study, we demonstrate the framework by introducing WeQA, a first-of-its-kind benchmark on the wind energy domain which comprises of multiple scientific documents/reports related to environmental impact of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level. We also demonstrate the performance of different models on our benchmark.
翻译:在自然语言处理与文本生成的快速发展背景下,检索增强生成技术的出现为提高生成文本的质量与可靠性提供了一条前景广阔的途径,其通过利用从用户指定数据库中检索的信息来实现这一目标。基准测试对于评估和比较不同检索器与生成器配置的RAG性能至关重要,能够为特定领域和应用场景提供关于其有效性、可扩展性及适用性的深入洞察。本文提出了一个生成领域相关RAG基准的综合框架。该框架基于人(领域专家)-人工智能大语言模型协同的自动问答生成机制。作为案例研究,我们通过推出WeQA——首个聚焦风能领域的基准数据集,展示了该框架的应用。该数据集包含多份与风能项目环境影响相关的科学文献/报告。我们的框架采用多样化评估指标及不同复杂度层次的多类问题,系统评估RAG性能。我们还在该基准上展示了不同模型的性能表现。