Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
翻译:企业文档(如表格和报告)中嵌入了对数据归档、自动化工作流和分析等下游应用至关重要的信息。尽管通用视觉语言模型(VLMs)在现有文档理解基准测试中表现良好,但其在不同文档类型和灵活模式上进行整体、细粒度结构化提取的能力尚未得到充分研究。现有的关键实体提取(KEE)、关系提取(RE)和视觉问答(VQA)数据集受限于狭窄的实体本体、简单的查询或同质化的文档类型,往往忽视了可适应结构化提取的需求。为填补这些空白,我们提出了ExStrucTiny——一个面向文档图像结构化信息提取(IE)的新基准数据集,它统一了KEE、RE和VQA的关键方面。通过结合人工标注与合成人工验证样本的新型构建流程,ExStrucTiny涵盖了更丰富的文档类型和提取场景。我们在此基准上分析了开源与闭源VLMs的表现,揭示了模式适应、查询欠规范及答案定位等核心挑战。本研究希望为提升通用模型在文档结构化信息提取方面的能力奠定基础。