In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the captured image merely involves a local text region, its rectification quality is degraded and unsatisfactory. Our previously proposed DocTr, a transformer-assisted network for document image rectification, also suffers from this limitation. In this work, we present DocTr++, a novel unified framework for document image rectification, without any restrictions on the input distorted images. Our major technical improvements can be concluded in three aspects. Firstly, we upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. Secondly, we reformulate the pixel-wise mapping relationship between the unrestricted distorted document images and the distortion-free counterparts. The obtained data is used to train our DocTr++ for unrestricted document image rectification. Thirdly, we contribute a real-world test set and metrics applicable for evaluating the rectification quality. To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images. Extensive experiments are conducted, and the results demonstrate the effectiveness and superiority of our method. We hope our DocTr++ will serve as a strong baseline for generic document image rectification, prompting the further advancement and application of learning-based algorithms. The source code and the proposed dataset are publicly available at https://github.com/fh2019ustc/DocTr-Plus.
翻译:近年来,文档图像校正研究取得显著进展,但现有先进算法局限于处理受限文档图像,即输入图像必须包含完整文档。当采集图像仅涉及局部文字区域时,其校正质量会退化且无法令人满意。我们先前提出的DocTr——一种基于Transformer架构的文档图像校正网络——同样存在此局限。本文提出DocTr++,一种新颖的无限制文档图像校正统一框架,对输入畸变图像没有任何限制。主要技术改进可归纳为三个方面:首先,通过采用分层编码器-解码器结构进行多尺度表征提取与解析,对原始架构进行升级;其次,重新定义了无限制畸变文档图像与无畸变图像之间的像素级映射关系,所得数据用于训练DocTr++以执行无限制文档图像校正;最后,构建了适用于评估校正质量的实际场景测试集与评估指标。据我们所知,这是首个基于学习的无限制文档图像校正方法。大量实验结果表明了该方法的有效性与优越性。期待DocTr++能够成为通用文档图像校正的强基线方法,推动基于学习算法的进一步发展与应用。源代码与所提出数据集已在https://github.com/fh2019ustc/DocTr-Plus 公开提供。