Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as hierarchical entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and offer three observations: (1) generalizing to new document templates is still very challenging, (2) few-shot performance has a lot of headroom, and (3) models struggle with hierarchical fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope this helps the community make progress on these challenging tasks in extracting structured data from visually rich documents.
翻译:理解视觉丰富的商业文档以提取结构化数据并自动化业务流程,近年来在学术界和工业界均受到广泛关注。尽管近期多模态语言模型取得了令人瞩目的成果,但我们发现现有基准测试未能反映工业界真实文档的复杂性。在本研究中,我们明确了更全面基准测试应具备的要素,并提出名为"视觉丰富文档理解"(VRDU)的基准测试。VRDU包含两个数据集,体现了多重挑战:涵盖多种数据类型及层级实体的丰富模式、包含表格及多列布局的复杂模板,以及单一文档类型内不同布局(模板)的多样性。我们设计了小样本与传统实验设置,并配以精心设计的匹配算法来评估提取结果。我们报告了强基线的性能,并提出三点观察:(1)泛化至新文档模板仍极具挑战;(2)小样本性能存在巨大提升空间;(3)模型在处理层级型字段(如发票中的行项目)时表现困难。我们计划开源该基准测试及评估工具包,期望能助力社区在从视觉丰富文档中提取结构化数据这一挑战性任务上取得进展。