The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference. Combined with an auxiliary reconstruction objective, DoPTA consistently outperforms larger models, while using significantly lesser pre-training compute. DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.
翻译:多模态学习的兴起为文档人工智能带来了显著进步。文档现被视为多模态实体,融合了文本与视觉信息以进行下游分析。然而,该领域的研究通常侧重于文本方面,仅将视觉空间作为辅助信息。尽管已有工作探索基于纯视觉的文档图像理解技术,但这些方法在推理阶段需要OCR识别的文本作为输入,或者在学习过程中未与文本对齐。为此,我们提出了一种新颖的图像-文本对齐技术,专门设计用于利用文档图像中的文本信息以提升视觉任务性能。采用此技术训练的文档编码器模型DoPTA,在广泛的文档图像理解任务中展现出强大性能,且无需在推理阶段进行OCR处理。结合辅助重建目标,DoPTA在显著减少预训练计算量的同时,持续超越规模更大的模型。此外,DoPTA在D4LA和FUNSD这两个具有挑战性的文档视觉分析基准测试中创造了新的最先进结果。