Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq
翻译:检索增强生成(RAG)已成为大型语言模型的热门应用。理想的RAG系统应提供基于段落支撑、无幻觉的准确答案。虽然构建完整RAG管道需要大量工作,但基准性能评估同样不可或缺。我们提出ClapNQ——面向全RAG管道的长答案问答基准数据集。ClapNQ包含来自自然问题(NQ)的带标注黄金段落的长答案,以及用于执行检索、生成或完整RAG管道的语料库。其答案具有简洁性(较完整段落缩小3倍)与连贯性(整合段落中多个非连续片段)特点。RAG模型必须适应这些特性方能成功应对ClapNQ。我们通过基线与分析实验揭示了有依据RAG领域仍存在的显著改进空间。CLAPNQ数据集已在https://github.com/primeqa/clapnq 公开。