In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the captured image merely involves a local text region, its rectification quality is degraded and unsatisfactory. Our previously proposed DocTr, a transformer-assisted network for document image rectification, also suffers from this limitation. In this work, we present DocTr++, a novel unified framework for document image rectification, without any restrictions on the input distorted images. Our major technical improvements can be concluded in three aspects. Firstly, we upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. Secondly, we reformulate the pixel-wise mapping relationship between the unrestricted distorted document images and the distortion-free counterparts. The obtained data is used to train our DocTr++ for unrestricted document image rectification. Thirdly, we contribute a real-world test set and metrics applicable for evaluating the rectification quality. To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images. Extensive experiments are conducted, and the results demonstrate the effectiveness and superiority of our method. We hope our DocTr++ will serve as a strong baseline for generic document image rectification, prompting the further advancement and application of learning-based algorithms. The source code and the proposed dataset are publicly available at https://github.com/fh2019ustc/DocTr-Plus.
翻译:近年来,文档图像矫正领域已取得大量研究进展,但现有先进算法仍局限于处理受限的文档图像,即输入图像必须包含完整文档。一旦拍摄图像仅涉及局部文本区域,其矫正质量便会下降且难以令人满意。我们先前提出的DocTr——一种基于Transformer的文档图像矫正网络——也受此局限。本文提出DocTr++,一种针对文档图像矫正的新型统一框架,对输入畸变图像不再设任何限制。我们的主要技术改进可归纳为三个方面:首先,采用层级式编码器-解码器结构进行多尺度表征提取与解析,对原有架构进行升级。其次,重新构建了无限制畸变文档图像与无畸变对应图像之间的像素级映射关系,所得数据用于训练我们的DocTr++以实现无限制文档图像矫正。第三,贡献了一个适用于评估矫正质量的真实场景测试集及评价指标。据我们所知,这是首个基于学习方法实现无限制文档图像矫正的方案。大量实验结果表明了该方法的有效性与优越性。我们希望DocTr++能成为通用文档图像矫正的强基线方法,推动基于学习算法的进一步发展与应用。源代码及所提数据集已在https://github.com/fh2019ustc/DocTr-Plus 公开。