Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and drive development of more capable RAG systems.
翻译:实际应用场景中,检索增强生成(RAG)系统常面临复杂查询,其相关信源在语料库中缺失或不完整。在此类场景下,RAG系统必须能够拒绝对不可回答、超出范围的问题,并识别检索失败与多跳推理失效的情况。然而,现有RAG基准测试很少能反映多跳或超范围问题的真实任务复杂度——这些问题往往可通过非连贯推理(即无需真正多跳推断即可解决)进行作弊,或仅需简单事实回忆。这限制了此类基准测试揭示现有RAG系统缺陷的能力。为填补这一空白,我们提出了首个自动化、难度可控的不可作弊、真实、不可回答且多跳查询(CRUMQs)生成流程,该流程可适配任意语料库与领域。我们运用该流程在两个主流RAG数据集上构建CRUMQs,并通过在领先的检索增强大语言模型上进行基准实验验证其有效性。结果表明,相较于既有RAG基准测试,CRUMQs对RAG系统构成显著挑战,作弊评分降低幅度最高达81.0%。更广泛而言,我们的流程为提升基准测试难度、推动开发更强大的RAG系统提供了简洁有效的途径。