Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG). This survey provides a comprehensive and timely review of document parsing research. We propose a systematic taxonomy that organizes existing approaches into modular pipeline-based systems and unified models driven by Vision-Language Models (VLMs). We provide a detailed review of key components in pipeline systems, including layout analysis and the recognition of heterogeneous content such as text, tables, mathematical expressions, and visual elements, and then systematically track the evolution of specialized VLMs for document parsing. Additionally, we summarize widely adopted evaluation metrics and high-quality benchmarks that establish current standards for parsing quality. Finally, we discuss key open challenges, including robustness to complex layouts, reliability of VLM-based parsing, and inference efficiency, and outline directions for building more accurate and scalable document intelligence systems.
翻译:文档解析(DP)将非结构化或半结构化文档转化为结构化、机器可读的表示形式,从而支持知识库构建与检索增强生成(RAG)等下游应用。本综述对文档解析研究进行了全面且及时的回顾。我们提出了一种系统性的分类方法,将现有方法划分为基于模块化流水线的系统和由视觉语言模型(VLM)驱动的统一模型。我们详细回顾了流水线系统中的关键组件,包括布局分析以及对文本、表格、数学表达式和视觉元素等异构内容的识别,并系统性地追踪了用于文档解析的专业VLM的演化过程。此外,我们总结了广泛采用的评估指标和高质量基准数据集,这些指标和数据集确立了当前解析质量的标准。最后,我们讨论了关键开放挑战,包括对复杂布局的鲁棒性、基于VLM的解析的可靠性以及推理效率,并指出了构建更准确、更可扩展的文档智能系统的未来方向。