Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field's advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 2,233 diverse and real-world documents from arXiv and GitHub. In addition, we develop a DSE Evaluation S$^3$uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
翻译:文档结构化抽取(DSE)旨在从原始文档中提取结构化内容。尽管已涌现出众多DSE系统,其统一评估仍显不足,这严重阻碍了该领域的发展。该问题主要归因于现有基准范式呈现碎片化与局部化的特征。为应对这些局限并对DSE系统进行全面评估,我们提出了名为READoc的新型基准,其将DSE定义为将非结构化PDF转换为语义丰富Markdown的现实任务。READoc数据集源自arXiv与GitHub中2,233份多样化真实文档。此外,我们开发了包含标准化、分割与评分模块的DSE评估套件S$^3$uite,用以对前沿DSE方法进行统一评估。通过对系列流程工具、专业视觉模型及通用视觉语言模型的评估,我们首次揭示了当前工作与统一化、现实化DSE目标之间的差距。我们期望READoc能推动DSE领域的未来研究,催生更全面且实用的解决方案。