Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.
翻译:文档内容提取在计算机视觉领域至关重要,尤其对于满足大语言模型(LLMs)和检索增强生成(RAG)技术对高质量数据的需求。然而,当前的文档解析方法在多样性和综合评估方面存在显著局限。为应对这些挑战,我们提出了OmniDocBench——一个旨在推进自动化文档内容提取的新型多源基准测试。OmniDocBench包含一个精心策划并标注的高质量评估数据集,涵盖学术论文、教科书、幻灯片等九种不同类型的文档。我们的基准测试提供了一个灵活而全面的评估框架,包含19个版面类别标签和14个属性标签,支持对整个数据集、独立模块或特定数据类型进行多层次评估。基于OmniDocBench,我们对现有的模块化流程和多模态端到端方法进行了详尽的比较分析,揭示了它们在处理文档多样性和确保公平评估方面的局限性。OmniDocBench为文档内容提取领域建立了一个稳健、多样且公平的评估标准,为未来技术发展提供了关键见解,并促进了文档解析技术的进步。相关代码与数据集可在https://github.com/opendatalab/OmniDocBench获取。