Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.
翻译:视觉文档检索旨在从海量视觉丰富文档语料库中检索相关页面,在当前多模态检索应用中具有重要意义。当前最先进的多向量范式虽在性能上表现出色,却存在计算开销过大的问题。现有剪枝与合并等效率优化方法未能完美解决该问题,导致压缩率与特征保真度之间形成难以权衡的困境。为突破此困境,我们提出剪枝-合并框架——一种融合互补策略的新型两阶段方法。该方法首先通过自适应剪枝阶段过滤低信息量图像块,生成精炼的高信号嵌入集合;随后通过分层合并阶段对预过滤集合进行压缩,在有效归纳语义内容的同时,避免了单阶段方法因噪声干扰导致的特征稀释问题。在29个视觉文档检索数据集上的大量实验表明,本框架始终优于现有方法,显著扩展了近无损压缩范围,并在高压缩比下保持稳健性能。