Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve competitive performance on RVL-CDIP document classification. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.
翻译:视觉文档理解是一项复杂任务,需要同时分析文档图像中的文本与视觉元素。现有模型往往依赖人工特征工程或领域特定流程,限制了其在不同文档类型与语言间的泛化能力。本文提出DUBLIN模型,该模型通过三项新型预训练目标(掩码文档文本生成任务、边界框任务及渲染问答任务)在网页上进行预训练,充分利用文档图像中的空间与语义信息。我们在多个基准测试中取得了具有竞争力或最优的结果,涵盖基于网页的结构化阅读理解、文档视觉问答、关键信息抽取、图表理解及表格问答等任务。特别地,DUBLIN是首个在WebSRC数据集上达到EM 77.75与F1 84.25的像素级模型。在DocVQA、InfographicsVQA、OCR-VQA及AI2D数据集上,我们的模型分别以4.6%、6.5%、2.6%及21%的优势超越当前像素级最优模型。同时在RVL-CDIP文档分类任务中取得具有竞争力的表现。此外,我们通过将文本数据集渲染为文档图像创建新基线,以推动该方向的研究。