ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.

翻译：PDF等非结构化文档包含有价值的结构化信息，但下游系统需要可靠、标准化的数据格式。大型语言模型正被日益广泛地应用于自动化抽取任务，这使得准确性与可靠性变得至关重要。然而，该领域的发展受到两个关键缺陷的制约。首先，目前缺乏在企业级模式广度下评估PDF到JSON抽取的端到端基准。其次，缺乏系统性的方法论来捕捉嵌套抽取的语义特性——其中不同字段需要差异化的正确性判定标准（标识符需精确匹配、数值需容错处理、名称需语义等价），数组需要对齐操作，且必须区分信息缺失与幻觉生成。我们通过ExtractBench解决了这两大缺陷，这是一个用于PDF到JSON结构化抽取的开源基准与评估框架。该基准集成了35份PDF文档及其对应的JSON模式与人工标注的黄金标准标签，覆盖多个具有经济价值的领域，共产生12,867个可评估字段，其模式复杂度从数十到数百个字段不等。评估框架将模式视为可执行规范：每个字段声明其专属评分度量标准。基线评估表明，前沿模型（GPT-5/5.2、Gemini-3 Flash/Pro、Claude 4.5 Opus/Sonnet）在实际复杂模式中仍不可靠。性能随模式广度的增加急剧下降，在包含369个字段的财务报告模式中，所有测试模型均产生0%的有效输出。我们在https://github.com/ContextualAI/extract-bench 开源发布ExtractBench。