Key information extraction (KIE) from visually rich documents (VRD) has been a challenging task in document intelligence because of not only the complicated and diverse layouts of VRD that make the model hard to generalize but also the lack of methods to exploit the multimodal features in VRD. In this paper, we propose a light-weight model named GraphRevisedIE that effectively embeds multimodal features such as textual, visual, and layout features from VRD and leverages graph revision and graph convolution to enrich the multimodal embedding with global context. Extensive experiments on multiple real-world datasets show that GraphRevisedIE generalizes to documents of varied layouts and achieves comparable or better performance compared to previous KIE methods. We also publish a business license dataset that contains both real-life and synthesized documents to facilitate research of document KIE.
翻译:视觉丰富文档(VRD)中的关键信息抽取(KIE)一直是文档智能领域的一项挑战性任务,这既是因为VRD复杂多样的布局使得模型难以泛化,也由于缺乏有效利用VRD中多模态特征的方法。本文提出一种轻量级模型GraphRevisedIE,它能有效嵌入VRD中的文本、视觉与布局等多模态特征,并通过图修正与图卷积操作,利用全局上下文信息增强多模态嵌入表征。在多个真实数据集上的大量实验表明,GraphRevisedIE能够泛化至不同布局的文档,其性能与现有KIE方法相当或更优。我们还发布了一个包含真实与合成文档的商业许可证数据集,以促进文档关键信息抽取的相关研究。