Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Content Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SoTA models on DocVQA and AI2D datasets by 2% and 21%, respectively. Also, DUBLIN is the first ever pixel-based model which achieves comparable performance to text-based SoTA methods on XFUND dataset for Semantic Entity Recognition showcasing its multilingual capability. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.
翻译:视觉文档理解是一项复杂的任务,涉及分析文档图像中的文本与视觉元素。现有模型通常依赖人工特征工程或特定领域的处理流程,这限制了它们在不同文档类型和语言间的泛化能力。本文提出DUBLIN模型,该模型基于网页数据通过三种新颖的预训练目标进行训练:掩码文档内容生成任务、边界框任务和渲染问答任务,这些任务充分利用了文档图像中的空间与语义信息。我们的模型在多项基准测试中取得了具备竞争力或最优的结果,包括基于网页的结构化阅读理解、文档视觉问答、关键信息提取、图表理解及表格问答。尤其值得指出的是,DUBLIN是首个在WebSRC数据集上达到EM 77.75和F1 84.25指标的基于像素的模型。我们还证明,该模型在DocVQA和AI2D数据集上分别以2%和21%的绝对优势超越当前基于像素的最优模型。此外,DUBLIN是首个在XFUND数据集语义实体识别任务中达到与基于文本的最优方法相当性能的基于像素模型,展现了其多语言处理能力。最后,我们通过将基于文本的数据集渲染为文档图像创建了新的基线,以推动该方向的研究。