Tremendous efforts have been made on document image rectification, but how to learn effective representation of such distorted images is still under-explored. In this paper, we present DocMAE, a novel self-supervised framework for document image rectification. Our motivation is to encode the structural cues in document images by leveraging masked autoencoder to benefit the rectification, i.e., the document boundaries, and text lines. Specifically, we first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents by restoring document boundaries and missing text lines. Transfer performance in the downstream rectification task validates the effectiveness of our method. Extensive experiments are conducted to demonstrate the effectiveness of our method.
翻译:针对文档图像校正问题已开展了大量研究,但如何学习失真文档图像的有效表示仍鲜有探索。本文提出DocMAE——一种用于文档图像校正的新型自监督框架。我们的动机是通过利用掩码自编码器编码文档图像中的结构线索(即文档边界与文本行)来提升校正效果。具体而言,我们首先对去除背景的文档图像进行随机块掩码处理,随后重建缺失像素。通过这种自监督学习方法,网络被鼓励通过恢复文档边界和缺失文本行来学习形变文档的内在结构。在下游校正任务中的迁移性能验证了该方法的有效性。大量实验证明了该方法的有效性。