With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
翻译:随着多模态信息的快速普及,视觉文档检索(VDR)已成为连接非结构化富视觉数据与精准信息获取的关键前沿领域。与传统自然图像检索不同,视觉文档具有文本密集、布局复杂和细粒度语义依赖等独特特征。本文首次对VDR领域进行全面综述,专门从多模态大语言模型(MLLM)时代的视角展开论述。我们首先审视基准测试发展现状,继而深入探讨方法论演进历程,将现有方法归为三类核心方向:多模态嵌入模型、多模态重排序模型,以及检索增强生成(RAG)与智能体系统在复杂文档智能中的融合应用。最后,我们识别持续存在的挑战并勾勒未来极具前景的研究方向,旨在为多模态文档智能的后续发展提供清晰路线图。