Visually-Rich Document Entity Retrieval (VDER) is a type of machine learning task that aims at recovering text spans in the documents for each of the entities in question. VDER has gained significant attention in recent years thanks to its broad applications in enterprise AI. Unfortunately, as document images often contain personally identifiable information (PII), publicly available data have been scarce, not only because of privacy constraints but also the costs of acquiring annotations. To make things worse, each dataset would often define its own sets of entities, and the non-overlapping entity spaces between datasets make it difficult to transfer knowledge between documents. In this paper, we propose a method to collect massive-scale, noisy, and weakly labeled data from the web to benefit the training of VDER models. Such a method will generate a huge amount of document image data to compensate for the lack of training data in many VDER settings. Moreover, the collected dataset named DocuNet would not need to be dependent on specific document types or entity sets, making it universally applicable to all VDER tasks. Empowered by DocuNet, we present a lightweight multimodal architecture named UniFormer, which can learn a unified representation from text, layout, and image crops without needing extra visual pertaining. We experiment with our methods on popular VDER models in various settings and show the improvements when this massive dataset is incorporated with UniFormer on both classic entity retrieval and few-shot learning settings.
翻译:视觉丰富文档实体检索(VDER)是一类机器学习任务,旨在从文档中恢复目标实体的文本片段。近年来,因其在企业人工智能中的广泛应用,VDER受到显著关注。然而,由于文档图像常包含个人身份信息(PII),公开可用的数据十分稀缺,这不仅源于隐私约束,也受到标注成本的限制。更棘手的是,不同数据集常定义各自独立的实体集合,实体空间的不重叠使得文档间的知识迁移变得困难。本文提出一种从网络大规模收集噪声弱标注数据的方法,以促进VDER模型的训练。该方法可生成海量文档图像数据,弥补众多VDER场景中训练数据的不足。此外,所构建的数据集DocuNet不依赖于特定文档类型或实体集,因此可普遍适用于所有VDER任务。在DocuNet的支持下,我们提出轻量级多模态架构UniFormer,该架构无需额外的视觉预训练即可从文本、布局和图像切片中学习统一表征。我们在多种设置下对主流VDER模型进行实验,结果表明,结合UniFormer的海量数据集在经典实体检索和少样本学习场景中均能显著提升性能。