The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
翻译:近年来,自监督预训练技术的兴起推动了多模态学习在表单文档理解中的广泛应用。然而,现有将掩码语言建模扩展至其他模态的方法,往往需要精细的多任务调优、复杂的重构目标设计或额外的预训练数据。在FormNetV2中,我们提出一种集中式多模态图对比学习策略,通过单一损失函数统一所有模态的自监督预训练。该图对比目标函数最大化多模态表征的一致性,无需特殊定制即可自然实现模态间的交互。此外,我们提取图边连接令牌对边界框内的图像特征,在不加载复杂且独立预训练的图像嵌入器的情况下捕获更具针对性的视觉线索。FormNetV2以更紧凑的模型规模,在FUNSD、CORD、SROIE和Payment基准上建立了新的最优性能。