In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.
翻译:本文提出StrucTexTv2——一种有效的文档图像预训练框架,通过执行掩码视觉-文本预测实现。该框架包含两个自监督预训练任务:基于文本区域级图像掩码的掩码图像建模与掩码语言建模。所提方法根据文本单词的边界框坐标随机掩码部分图像区域。预训练任务的目标是同时重建掩码图像区域的像素及对应的掩码词元。相较于通常预测掩码图像块的掩码图像建模方法,该方法使预训练编码器能够捕获更丰富的文本语义。与依赖图像和文本两种模态的文档图像理解掩码多模态建模方法相比,StrucTexTv2仅建模图像输入,可处理更多无需OCR预处理的潜在应用场景。在主流文档图像理解基准上的大量实验证明了StrucTexTv2的有效性,其在端到端场景下的图像分类、版面分析、表格结构识别、文档OCR及信息抽取等各类下游任务中均取得了具有竞争力甚至达到新最优的性能。