Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: single-vector retrieval, which is efficient but coarse, and multi-vector retrieval, which is accurate but computationally expensive. To address this trade-off, we propose HEAVEN, a two-stage hybrid-vector framework. In the first stage, HEAVEN efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages. In the second stage, it reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations. To evaluate retrieval systems under realistic conditions, we also introduce ViMDOC, the first benchmark for visually rich, multi-document, and long-document retrieval. Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models on average while reducing per-query computation by 99.82%, achieving efficiency and accuracy. Our code and datasets are available at: https://github.com/juyeonnn/HEAVEN
翻译:针对视觉丰富文档的检索在诸如法律发现、科学搜索和企业知识管理等任务中至关重要。现有方法主要分为两种范式:单向量检索(高效但粗糙)和多向量检索(精确但计算成本高昂)。为应对这一权衡,我们提出了HEAVEN——一种两阶段混合向量框架。在第一阶段,HEAVEN通过单向量方法对视觉摘要页面(VS-Pages)进行高效检索,VS-Pages整合了多页文档中的代表性视觉布局。在第二阶段,它采用多向量方法对候选结果进行重排序,同时根据语言重要性过滤查询词元以减少冗余计算。为在真实条件下评估检索系统,我们还引入了ViMDOC——首个面向视觉丰富、多文档及长文档检索的基准测试。在四个基准测试中,HEAVEN平均实现了多向量模型Recall@1性能的99.87%,同时将每查询计算量降低99.82%,兼顾了效率与精度。我们的代码和数据集已发布于:https://github.com/juyeonnn/HEAVEN