AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on https://huggingface.co/datasets/llamaindex/ParseBench and https://github.com/run-llama/ParseBench.
翻译:AI代理正在改变文档解析的需求。关键在于**语义正确性**:解析后的输出必须保留自主决策所需的结构和含义,包括正确的表格结构、精确的图表数据、具有语义意义的格式以及视觉定位。现有基准测试未能完全捕捉企业自动化中的这一要求,它们依赖于狭窄的文档分布和文本相似度指标,忽略了代理关键性失败。我们提出**ParseBench**,这是一个包含约2,000页经过人工验证的企业文档的基准测试,覆盖保险、金融和政府领域,围绕五个能力维度组织:表格、图表、内容忠实性、语义格式和视觉定位。该基准测试跨越了涵盖视觉语言模型、专用文档解析器和LlamaParse的14种方法,揭示了一个碎片化的能力格局:没有一种方法在所有五个维度上都表现稳定。LlamaParse Agentic以\agenticoverall\%的最高总分脱颖而出,同时该基准测试也凸显了当前系统在其余能力上的差距。数据集和评估代码可在https://huggingface.co/datasets/llamaindex/ParseBench和https://github.com/run-llama/ParseBench获取。