Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study
翻译:从PDF中可靠地提取表格对于大规模科学数据挖掘和知识库构建至关重要,然而现有评估方法依赖于基于规则的指标,无法捕捉表格内容的语义等价性。我们提出一个基于合成生成PDF并带有精确LaTeX真实标签的基准测试框架,所使用的表格来自arXiv,以确保现实的复杂性和多样性。作为我们的核心方法论贡献,我们将LLM作为评判者应用于语义表格评估,并将其集成到一个可适应解析器输出不一致性的匹配流程中。通过一项涵盖超过1,500次关于提取表格对质量判断的人工验证研究,我们表明,与基于树编辑距离的相似度(TEDS,r=0.68)和网格表格相似度(GriTS,r=0.70)相比,基于LLM的评估与人类判断的相关性显著更高(Pearson r=0.93)。对100个包含451个表格的合成文档中的21个现代PDF解析器进行评估,揭示了显著的性能差异。我们的结果为选择用于表格数据提取的解析器提供了实用指导,并为此关键任务建立了一种可复现、可扩展的评估方法论。代码和数据:https://github.com/phorn1/pdf-parse-bench 度量研究和人工评估:https://github.com/phorn1/table-metric-study