The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
翻译:近期自监督预训练技术的兴起推动了多模态学习在表单文档理解中的广泛应用。然而,现有将掩码语言建模扩展至其他模态的方法需谨慎的多任务调参、复杂的重建目标设计或额外的预训练数据。FormNetV2提出一种集中式多模态图对比学习策略,通过单一损失函数统一所有模态的自监督预训练。图对比目标通过最大化多模态表征的一致性,无需特殊定制即可实现各模态的自然交互。此外,我们在连接图边节点对的边界框内提取图像特征,无需加载复杂且需单独预训练的图像编码器即可捕获更具针对性的视觉线索。FormNetV2以更紧凑的模型规模在FUNSD、CORD、SROIE和Payment基准测试中创下最新最优性能。