Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.
翻译:视觉文档检索(Visual Document Retrieval, VDR)旨在从海量富视觉文档语料库中检索相关页面,在当前多模态检索应用中具有重要意义。最先进的多向量范式在性能上表现出色,但存在难以承受的开销问题,当前剪枝与融合等效率方法对此问题的解决不够完善,在压缩率与特征保真度之间形成了难以平衡的取舍。为克服这一困境,我们提出了"剪枝-融合"(Prune-then-Merge)这一新颖的两阶段框架,通过协同整合这些互补性方法来解决问题。该方法首先采用自适应剪枝阶段过滤低信息量图块,生成一组精炼的高信号嵌入向量;随后通过层次化融合阶段压缩该预过滤集合,在不产生单阶段方法中噪声诱导的特征稀释效应情况下,有效概括语义内容。在29个VDR数据集上的广泛实验表明,我们的框架持续优于现有方法,显著扩展了近无损压缩范围,并在高压缩比下提供稳健的性能表现。