Key Information Extraction (KIE) is a challenging multimodal task that aims to extract structured value semantic entities from visually rich documents. Although significant progress has been made, there are still two major challenges that need to be addressed. Firstly, the layout of existing datasets is relatively fixed and limited in the number of semantic entity categories, creating a significant gap between these datasets and the complex real-world scenarios. Secondly, existing methods follow a two-stage pipeline strategy, which may lead to the error propagation problem. Additionally, they are difficult to apply in situations where unseen semantic entity categories emerge. To address the first challenge, we propose a new large-scale human-annotated dataset named Complex Layout form for key information EXtraction (CLEX), which consists of 5,860 images with 1,162 semantic entity categories. To solve the second challenge, we introduce Parallel Pointer-based Network (PPN), an end-to-end model that can be applied in zero-shot and few-shot scenarios. PPN leverages the implicit clues between semantic entities to assist extracting, and its parallel extraction mechanism allows it to extract multiple results simultaneously and efficiently. Experiments on the CLEX dataset demonstrate that PPN outperforms existing state-of-the-art methods while also offering a much faster inference speed.
翻译:关键信息提取(KIE)是一项具有挑战性的多模态任务,旨在从视觉丰富的文档中提取结构化的值语义实体。尽管已取得显著进展,但仍有两个主要挑战需解决。首先,现有数据集的版式相对固定且语义实体类别数量有限,导致这些数据集与复杂真实场景之间存在显著差距。其次,现有方法采用两阶段流水线策略,可能引发错误传播问题,且难以应对未知语义实体类别出现的情况。为应对第一个挑战,我们提出新型大规模人工标注数据集CLEX(复杂版式关键信息提取表单),包含5,860张图像及1,162种语义实体类别。为解决第二个挑战,我们提出并行指针网络(PPN)——一种可应用于零样本和少样本场景的端到端模型。PPN利用语义实体间的隐含线索辅助提取,其并行提取机制能同时高效提取多个结果。在CLEX数据集上的实验表明,PPN在超越现有最优方法的同时,推理速度也显著更快。