DTBench: A Synthetic Benchmark for Document-to-Table Extraction

Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL-based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table extraction. We argue that a capability-aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human-annotated document-table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi-agent synthesis workflow to generate documents from ground-truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two-level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at https://github.com/ZJU-DAILY/DTBench.

翻译：文档到表格（Doc2Table）提取旨在从非结构化文档中依据目标模式推导出结构化表格，从而实现可靠且可验证的基于SQL的数据分析。尽管大型语言模型（LLMs）在灵活的信息提取方面展现出潜力，但其生成精确结构化表格的能力仍未得到充分理解，尤其是在需要复杂能力（如推理与冲突解决）的间接提取任务中。现有基准测试既未明确区分，也未全面覆盖文档到表格提取所需的各种能力。我们认为，一个具备能力感知的基准测试对于系统性评估至关重要。然而，使用人工标注的文档-表格对构建此类基准成本高昂、难以扩展，且能力覆盖范围有限。为解决此问题，我们采用逆向的Table2Doc范式，并设计了一个多智能体合成工作流，以从真实表格生成文档。基于此方法，我们提出了DTBench——一个采用我们提出的文档到表格提取能力双层分类体系（涵盖5个主要类别和13个子类别）的合成基准测试。我们在DTBench上评估了多个主流LLMs，结果显示模型之间存在显著的性能差距，且在推理、忠实度与冲突解决方面仍存在持续挑战。DTBench为数据生成与评估提供了一个全面的测试平台，将推动未来文档到表格提取领域的研究。本基准测试已在 https://github.com/ZJU-DAILY/DTBench 公开提供。