Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
翻译:先前研究表明,预训练技术能够提升视觉文档理解(VDU)的性能,这通常要求模型具备感知和推理文档文本及布局(例如文本位置和表格单元格)的能力。为此,我们提出一种视觉引导式生成型文本-布局预训练方法,命名为ViTLP。对于给定的文档图像,该模型通过优化分层语言与布局建模目标,生成交错排列的文本与布局序列。此外,为解决Transformer处理长文档的局限性,我们引入了一种简洁而有效的多段生成式预训练方案,使ViTLP能够处理任意长度的密集文字文档。ViTLP可作为原生OCR模型,定位并识别文档图像中的文本。同时,ViTLP可有效应用于各类下游VDU任务。大量实验表明,在包含信息抽取、文档分类和文档问答在内的基准VDU任务上,ViTLP相较于现有基线方法取得了具有竞争力的性能表现。