DTBench: A Synthetic Benchmark for Document-to-Table Extraction

Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL-based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table extraction. We argue that a capability-aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human-annotated document-table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi-agent synthesis workflow to generate documents from ground-truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two-level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at https://github.com/ZJU-DAILY/DTBench.

翻译：文档到表格（Doc2Table）提取任务旨在根据目标模式从非结构化文档中推导出结构化表格，从而实现可靠且可验证的基于SQL的数据分析。尽管大语言模型（LLMs）在灵活的信息提取方面展现出潜力，但其生成精确结构化表格的能力仍未得到充分理解，尤其是在需要复杂推理与冲突消解能力的间接提取场景中。现有基准既未明确区分、也未全面覆盖Doc2Table提取所需的各种能力。我们认为，一个具备能力感知的基准对于系统化评估至关重要。然而，基于人工标注的文档-表格对构建此类基准成本高昂、难以扩展，且能力覆盖范围有限。为解决此问题，我们采用逆向的Table2Doc范式，设计了一种多智能体合成工作流，以从真实表格中生成文档。基于此方法，我们提出了DTBench——一个采用我们提出的Doc2Table能力双层分类体系（涵盖5个主要类别及13个子类别）的合成基准。我们在DTBench上评估了多个主流LLMs，结果表明不同模型之间存在显著的性能差距，且在推理、忠实度与冲突消解方面仍存在持续挑战。DTBench为数据生成与评估提供了一个全面的测试平台，将推动未来对Doc2Table提取的研究。本基准已公开于https://github.com/ZJU-DAILY/DTBench。