Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching (CDM) and r~0 for text similarity. Our robust two-stage matching pipeline combining LLM-based extraction with fuzzy validation reliably aligns parsed formulas with ground truth despite format inconsistencies across parsers. Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities, providing actionable guidance for practitioners selecting parsers for downstream applications. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench and https://github.com/phorn1/formula-metric-study
翻译:从PDF中正确解析数学公式对于训练大语言模型以及从学术文献中构建科学知识库至关重要,然而现有基准测试要么完全忽略公式,要么缺乏基于语义的评估指标。我们提出一个以合成生成的PDF为核心的基准测试框架,这些PDF包含精确的LaTeX真值,从而能够对布局、公式和内容特征进行系统性控制。在评估方面,我们采用LLM辅助评判法来评估解析后公式的语义等价性,捕捉超越表面符号差异的数学含义。我们通过一项人工研究(250个公式对,30位评估者共计750次评分)验证了该方法,结果显示其与人类判断的皮尔逊相关系数为r=0.78,相比之下字符级匹配(CDM)为r=0.34,文本相似度约为r=0。我们稳健的两阶段匹配流程——结合基于LLM的提取与模糊验证——能够可靠地将解析后的公式与真值对齐,尽管不同解析器存在格式不一致问题。通过评估20多个当代PDF解析器在包含2000多个公式的100个合成文档上的表现,揭示了显著的性能差异,为从业者为下游应用选择解析器提供了可操作的指导。代码和基准数据:https://github.com/phorn1/pdf-parse-bench 和 https://github.com/phorn1/formula-metric-study