Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existing approaches relying on flat text chunks or page-level images inherently struggle to (i) precisely pinpoint the target document among thousands of candidates and (ii) organically connect multimodal evidence, such as tables and figures, within a limited token budget. To address these challenges, we propose HiKEY, a hierarchical tree-based multimodal retrieval framework that elevates document hierarchy to a first-class retrieval signal. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph via Document Hierarchical Parsing (DHP), explicitly encoding parent-child relationships. Adopting a hierarchical coarse-to-fine strategy, the framework (1) performs global routing to rapidly prune the search space using hierarchical indexing, and (2) conducts fine-grained retrieval to rank sections by employing a multimodal fusion strategy that captures the most discriminative evidence. Finally, HiKEY assembles a token-efficient evidence subgraph via a hybrid structural-semantic packing strategy. Experiments on ODQA benchmarks demonstrate that HiKEY significantly outperforms page- and chunk-based baselines, improving retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8%.
翻译:摘要:在大规模工业语料库中,基于检索增强生成(RAG)的开放域文档问答(ODQA)面临两个关键瓶颈:定位正确文档时的路由失败与碎片化证据的整合困难。现有方法依赖平面文本块或页面级图像,本质上难以:(i)在数千候选文档中精准定位目标文档;(ii)在有限令牌预算内有机整合表格、图表等多模态证据。为解决上述挑战,我们提出HiKEY,一种基于层次化树的多模态检索框架,将文档层级结构提升为一级检索信号。HiKEY摒弃简单分块策略,通过文档层级解析(DHP)重构逻辑异构图,显式编码父子关系。该框架采用从粗到精的层级检索策略:(1)利用层级索引快速剪枝搜索空间实现全局路由;(2)通过捕获最具判别性证据的多模态融合策略,进行细粒度检索并对章节排序。最终,HiKEY通过混合结构-语义压缩策略,构建令牌高效的证据子图。在ODQA基准上的实验表明,HiKEY显著优于页面级和分块级基线方法,检索召回率提升高达12.9%,端到端问答性能提升达6.8%。