With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
翻译:随着多模态信息的快速扩散,视觉文档检索已成为弥合非结构化视觉丰富数据与精确信息获取之间鸿沟的关键前沿。与传统的自然图像检索不同,视觉文档展现出由密集文本内容、复杂布局和细粒度语义依赖所定义的独特特征。本文首次对视觉文档检索领域进行了全面综述,特别是通过多模态大语言模型时代的视角。我们首先审视了基准测试的现状,随后深入探讨了方法论的演进,将现有方法归纳为三个主要方面:多模态嵌入模型、多模态重排序模型,以及为复杂文档智能而整合的检索增强生成与智能体系统。最后,我们指出了持续存在的挑战并勾勒了有前景的未来方向,旨在为未来的多模态文档智能提供一个清晰的路线图。