The recognition of named entities in visually-rich documents (VrD-NER) plays a critical role in various real-world scenarios and applications. However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. The UNER head considers the VrD-NER task as a combination of sequence labeling and reading order prediction, effectively addressing the issues of discontinuous entities in documents. Experimental evaluations on diverse datasets demonstrate the effectiveness of UNER in improving entity extraction performance. Moreover, the UNER head enables a supervised pre-training stage on various VrD-NER datasets to enhance the document transformer backbones and exhibits substantial knowledge transfer from the pre-training stage to the fine-tuning stage. By incorporating universal layout understanding, a pre-trained UNER-based model demonstrates significant advantages in few-shot and cross-linguistic scenarios and exhibits zero-shot entity extraction abilities.
翻译:富文本文档中的命名实体识别(VrD-NER)在众多现实场景与应用中发挥着关键作用。然而,VrD-NER研究面临三大挑战:复杂的文档布局、错误的阅读顺序以及不合适的任务建模方式。为应对这些挑战,我们提出一种查询感知的实体抽取头——UNER,使其与现有的多模态文档Transformer模型协同工作,以构建更鲁棒的VrD-NER模型。UNER头将VrD-NER任务视为序列标注与阅读顺序预测的结合,有效解决了文档中不连续实体的识别问题。在不同数据集上的实验评估证明了UNER在提升实体抽取性能方面的有效性。此外,UNER头支持在多种VrD-NER数据集上进行监督式预训练,以增强文档Transformer骨干网络,并展现出从预训练阶段到微调阶段显著的跨任务知识迁移能力。通过融入通用布局理解能力,基于UNER的预训练模型在少样本与跨语言场景中展现出显著优势,并具备零样本实体抽取能力。