Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, meaning that the answer is composed fluently, often by integrating multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq
翻译:检索增强生成(RAG)已成为大语言模型的热门应用。成功的RAG系统应提供基于篇章支撑的准确答案,且不产生任何幻觉。虽然构建完整的RAG流程需要大量工作,但对性能进行基准测试同样不可或缺。本文提出ClapNQ——一个面向完整RAG流程的基准长文本问答数据集。ClapNQ包含源自自然问题(NQ)的基于黄金篇章的长答案,以及可用于检索、生成或完整RAG流程的语料库。ClapNQ的答案具有简明性(比完整篇章小3倍)和连贯性,即答案通过流畅整合篇章中非连续的多处信息构成。RAG模型必须适应这些特性才能在ClapNQ上取得成功。我们通过基线实验与分析指出,基于篇章的RAG系统在多个方面仍存在显著改进空间。CLAPNQ数据集已在https://github.com/primeqa/clapnq 公开。