Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
翻译:充分利用视觉丰富文档的潜力需要检索系统不仅能理解文本,还能理解复杂的布局,这是视觉文档检索(VDR)中的一个核心挑战。主流的多向量架构虽然强大,但面临一个关键的存储瓶颈。当前的优化策略,如嵌入合并、剪枝或使用抽象标记,要么会损害性能,要么会忽略关键的布局线索,从而无法解决此问题。为此,我们提出了ColParse,这是一种新颖的范式,它利用文档解析模型生成一小部分布局感知的子图像嵌入,然后将其与全局页面级向量融合,以创建紧凑且结构感知的多向量表示。大量实验表明,我们的方法在显著降低存储需求(超过95%)的同时,还在众多基准测试和基础模型上带来了显著的性能提升。因此,ColParse弥合了多向量检索的细粒度准确性与大规模部署的实际需求之间的关键差距,为构建高效且可解释的多模态信息系统提供了一条新路径。