CRAG -- Comprehensive RAG Benchmark

Xiao Yang,Kai Sun,Hao Xin,Yushi Sun,Nikita Bhalla,Xiangsen Chen,Sajal Choudhary,Rongze Daniel Gui,Ziran Will Jiang,Ziyu Jiang,Lingkun Kong,Brian Moran,Jiaqi Wang,Yifan Ethan Xu,An Yan,Chenyu Yang,Eting Yuan,Hanwen Zha,Nan Tang,Lei Chen,Nicolas Scheffer,Yue Liu,Nirav Shah,Rakesh Wanga,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

from arxiv, NeurIPS 2024 Datasets and Benchmarks Track

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation of this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% of questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge and attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https://github.com/facebookresearch/CRAG/.

翻译：检索增强生成（RAG）作为一种缓解大语言模型（LLM）知识匮乏问题的有效方案，近年来受到广泛关注。然而，现有的RAG数据集未能充分体现现实世界问答（QA）任务所具有的多样性与动态性。为弥补这一不足，我们提出了综合性RAG基准（CRAG），该基准包含4,409个事实性问答对及用于模拟网络与知识图谱（KG）搜索的模拟API。CRAG的设计涵盖五个领域与八种问题类别，覆盖了从热门实体到长尾实体的不同流行度，以及从年际到秒级的时间动态性。基于此基准的评估揭示了实现完全可信问答仍存在的差距：当前最先进的LLM在CRAG上的准确率仅≤34%，即使采用基础RAG方案也仅将准确率提升至44%；业界领先的RAG解决方案在确保无幻觉的前提下仅能回答63%的问题。CRAG进一步表明，对于动态性更高、流行度更低或复杂度更高的事实类问题，现有模型的准确率显著下降，这为未来研究指明了方向。CRAG基准已作为KDD Cup 2024竞赛任务基础，吸引了数千名参赛者提交方案。我们将持续维护CRAG基准，以推动RAG技术及通用问答解决方案的研究进程。CRAG开源地址：https://github.com/facebookresearch/CRAG/。