Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling \emph{long-context retrieval} due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate \emph{long-form responses} that effectively exploits retrieved information. To address these shortcomings, we introduce the \textsc{Long$^2$RAG} benchmark and the Key Point Recall (\textit{KPR}) metric. \textsc{Long$^2$RAG} comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. \textit{KPR} evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information. Our dataset and scripts are available at https://github.com/QZH-777/longrag.
翻译:检索增强生成(RAG)是解决大型语言模型(LLM)固定知识局限性的有效方法。然而,当前用于评估RAG系统的基准存在两个关键缺陷:(1)由于缺乏反映检索文档特征的数据集,它们未能充分衡量LLM处理*长上下文检索*的能力;(2)缺乏全面评估LLM生成有效利用检索信息的*长格式响应*能力的方法。为弥补这些不足,我们提出了Long$^2$RAG基准和关键点召回(*KPR*)指标。Long$^2$RAG包含涵盖10个领域和8个问题类别的280个问题,每个问题关联5篇平均长度为2444词的检索文档。*KPR*通过评估LLM在生成响应中纳入从检索文档提取的关键点的程度,为模型利用检索信息的能力提供更精细的评估。我们的数据集与代码已公开于https://github.com/QZH-777/longrag。