Retrieval-Augmented Generation (RAG) has been used in question answering (QA) systems to improve performance when relevant information is in one (single-hop) or multiple (multi-hop) passages. However, many real life scenarios (e.g. dealing with financial, legal, medical reports) require checking all documents for relevant information without a clear stopping condition. We term these pluri-hop questions, and formalize them by 3 conditions - recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a multilingual diagnostic benchmark of 48 pluri-hop questions over 191 real wind-industry reports, with high repetitiveness to reflect the challenge of distractors in real-world datasets. Naive, graph-based, and multimodal RAG methods only reach up to 40% statement-wise F1 on PluriHopWIND. Motivated by this, we propose PluriHopRAG, which learns from synthetic examples to decompose queries according to corpus-specific document structure, and employs a cross-encoder filter at the document level to minimize costly LLM reasoning. We test PluriHopRAG on PluriHopWIND and the Loong benchmark built on financial, legal and scientific reports. On PluriHopWIND, our method shows 18-52% F1 score improvement across base LLMs, while on Loong, we show 33% improvement over long-context reasoning and 52% improvement over naive RAG.
翻译:检索增强生成(RAG)已被用于问答(QA)系统中,以在相关信息存在于单个(单跳)或多个(多跳)段落中时提升性能。然而,许多现实场景(如处理财务、法律、医疗报告)需要在不具备明确停止条件的情况下检查所有文档以获取相关信息。我们将这类问题称为“多跳密集型”(pluri-hop)问题,并通过三个条件——召回敏感性(recall sensitivity)、全面性(exhaustiveness)和精确性(exactness)——对其进行形式化定义。为研究该场景,我们提出了PluriHopWIND,一个包含48个多跳密集型问题、覆盖191份真实风电行业报告的多语言诊断基准,其高重复性旨在反映现实数据集中干扰项的挑战。朴素RAG、基于图的RAG以及多模态RAG方法在PluriHopWIND上的语句级F1分数最高仅为40%。受此启发,我们提出PluriHopRAG,该方法通过从合成样本中学习,根据语料特定文档结构分解查询,并在文档级别采用交叉编码器(cross-encoder)过滤器以减少昂贵的LLM推理开销。我们在PluriHopWIND和基于财务、法律及科学报告构建的Loong基准上测试了PluriHopRAG。在PluriHopWIND上,我们的方法在不同基础LLM中实现了18-52%的F1分数提升;在Loong上,相较于长上下文推理和朴素RAG,分别实现了33%和52%的提升。