Retrieval-augmented question answering over heterogeneous corpora requires connected evidence across text, tables, and graph nodes. While entity-level knowledge graphs support structured access, they are costly to construct and maintain, and inefficient to traverse at query time. In contrast, standard retriever-reader pipelines use flat similarity search over independently chunked text, missing multi-hop evidence chains across modalities. We propose SAGE (Structure Aware Graph Expansion) framework that (i) constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and (ii) performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k' additional chunks. We instantiate the initial retriever using hybrid dense+sparse retrieval for implicit cross-modal corpora and SPARK (Structure Aware Planning Agent for Retrieval over Knowledge Graphs) an agentic retriever for explicit schema graphs. On OTT-QA and STaRK, SAGE improves retrieval recall by 5.7 and 8.5 points over baselines.
翻译:在异构语料库上进行检索增强的问答需要跨越文本、表格和图节点的关联证据。虽然实体级知识图谱支持结构化访问,但其构建和维护成本高昂,且在查询时遍历效率低下。相比之下,标准的检索器-阅读器流水线对独立分块的文本使用扁平化相似性搜索,无法捕捉跨模态的多跳证据链。我们提出SAGE(结构感知图扩展)框架,该框架(i)离线构建分块级图,采用基于元数据的相似性度量与百分位数剪枝策略;(ii)在线检索时,首先运行基线检索器获取k个种子分块,扩展其一跳邻居,随后通过稠密+稀疏检索对邻居进行过滤,并选取k'个额外分块。我们通过两种方式实例化初始检索器:针对隐式跨模态语料库采用混合稠密+稀疏检索;针对显式模式图谱采用SPARK(面向知识图谱检索的结构感知规划智能体)。在OTT-QA和STaRK数据集上,SAGE相比基线方法将检索召回率分别提升了5.7和8.5个百分点。