Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0\% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
翻译:实际应用场景中,检索增强生成(RAG)系统常面临复杂查询,其相关信源在语料库中缺失或不完整。在此类情境下,RAG系统必须能够拒绝对不可回答及超出范围的问题,并识别检索与多跳推理的失败案例。然而,现有RAG基准测试很少能反映多跳或超范围问题的真实任务复杂度——这些问题往往可通过非连贯推理(即无需真正多跳推断即可解决)进行作弊,或仅需简单的事实回忆。这限制了此类基准测试揭示现有RAG系统缺陷的能力。为填补这一空白,我们提出了首个自动化、难度可控的不可作弊、现实、不可回答且多跳查询(CRUMQs)生成流程,该流程可适配任意语料库与领域。我们运用该流程在两个主流RAG数据集上构建CRUMQs,并通过在领先的检索增强大语言模型上进行基准实验验证其有效性。实验结果表明,相较于现有RAG基准测试,CRUMQs对RAG系统具有极高挑战性,作弊评分降低幅度最高达81.0%。更广泛而言,我们的流程为提升基准测试难度与真实性提供了一种简洁途径,有助于推动更强大RAG系统的研发。