XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser

In the domain of document AI, semi-structured form parsing plays a crucial role. This task leverages techniques from key information extraction (KIE), dealing with inputs that range from plain text to intricate modal data comprising images and structural layouts. The advent of pre-trained multimodal models has driven the extraction of key information from form documents in different formats such as PDFs and images. Nonetheless, the endeavor of form parsing is still encumbered by notable challenges like subpar capabilities in multi-lingual parsing and diminished recall in contexts rich in text and visuals. In this work, we introduce a simple but effective \textbf{M}ultimodal and \textbf{M}ultilingual semi-structured \textbf{FORM} \textbf{PARSER} (\textbf{XFormParser}), which is anchored on a comprehensive pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework, enhanced by a novel staged warm-up training approach that employs soft labels to significantly refine form parsing accuracy without amplifying inference overhead. Furthermore, we have developed a groundbreaking benchmark dataset, named InDFormBench, catering specifically to the parsing requirements of multilingual forms in various industrial contexts. Through rigorous testing on established multilingual benchmarks and InDFormBench, XFormParser has demonstrated its unparalleled efficacy, notably surpassing the state-of-the-art (SOTA) models in RE tasks within language-specific setups by achieving an F1 score improvement of up to 1.79\%. Our framework exhibits exceptionally improved performance across tasks in both multi-language and zero-shot contexts when compared to existing SOTA benchmarks. The code is publicly available at https://github.com/zhbuaa0/layoutlmft.

翻译：在文档智能领域，半结构化表单解析发挥着关键作用。该任务利用关键信息抽取技术，处理从纯文本到包含图像与结构布局的复杂模态数据的多种输入。预训练多模态模型的出现推动了从PDF和图像等不同格式表单文档中提取关键信息的发展。然而，表单解析仍面临显著挑战，例如多语言解析能力不足以及在图文密集场景下的召回率下降。本文提出一种简单而有效的多模态多语言半结构化表单解析器XFormParser，其基于成熟的预训练语言模型，创新性地将语义实体识别与关系抽取整合到统一框架中，并通过采用软标签的新型分阶段预热训练方法显著提升表单解析精度，且未增加推理开销。此外，我们开发了开创性的基准数据集InDFormBench，专门服务于多语言表单在各类工业场景中的解析需求。通过在现有多语言基准和InDFormBench上的严格测试，XFormParser展现出卓越的性能，在特定语言设置的关系抽取任务中以高达1.79%的F1分数提升显著超越现有最优模型。与当前最优基准相比，我们的框架在多语言及零样本场景的各项任务中均表现出显著增强的性能。代码已公开于https://github.com/zhbuaa0/layoutlmft。