Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a tree proposal network, which are subsequently refined into hierarchical trees by a relation decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the relation decoder: a tree attention mask and a tree level embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.
翻译:现有表单类文档视觉信息抽取方法通常将任务分解为若干子任务(如关键信息抽取、键值对抽取、选项组抽取),但这类方法常忽略表单文档的层级结构(包括层级键值对与层级选项组)。为突破此局限,本文提出全新视角:将视觉信息抽取重构为关系预测问题,并将不同任务的标签统一至单一标签空间。该统一方法可定义多种关系类型,有效应对表单文档中的层级关系。基于此视角,我们提出UniVIE——一种统一模型,旨在全面解决视觉信息抽取问题。UniVIE采用由粗到精的策略:首先通过树提案网络生成树状提案,再经由关系解码器模块将其精炼为层级树。为增强UniVIE的关系预测能力,我们在关系解码器中引入两种新颖树约束:树注意力掩码与树层级嵌入。在自建数据集HierForms与公开数据集SIBR上的大量实验证明,本方法取得当前最优结果,充分展现了统一方法在推进视觉信息抽取领域发展方面的有效性与潜力。