Recently, there has been a growing interest in research concerning document image analysis and recognition in photographic scenarios. However, the lack of labeled datasets for this emerging challenge poses a significant obstacle, as manual annotation can be time-consuming and impractical. To tackle this issue, we present DocAligner, a novel method that streamlines the manual annotation process to a simple step of taking pictures. DocAligner achieves this by establishing dense correspondence between photographic document images and their clean counterparts. It enables the automatic transfer of existing annotations in clean document images to photographic ones and helps to automatically acquire labels that are unavailable through manual labeling. Considering the distinctive characteristics of document images, DocAligner incorporates several innovative features. First, we propose a non-rigid pre-alignment technique based on the document's edges, which effectively eliminates interference caused by significant global shifts and repetitive patterns present in document images. Second, to handle large shifts and ensure high accuracy, we introduce a hierarchical aligning approach that combines global and local correlation layers. Furthermore, considering the importance of fine-grained elements in document images, we present a details recurrent refinement module to enhance the output in a high-resolution space. To train DocAligner, we construct a synthetic dataset and introduce a self-supervised learning approach to enhance its robustness for real-world data. Through extensive experiments, we demonstrate the effectiveness of DocAligner and the acquired dataset. Datasets and codes will be publicly available.
翻译:近年来,针对拍摄场景下的文档图像分析与识别研究日益受到关注。然而,这一新兴挑战缺乏标注数据集,成为关键障碍——因为手工标注不仅耗时且实际操作性差。为解决此问题,我们提出DocAligner这一创新方法,将人工标注流程简化为"拍照"这一简单步骤。DocAligner通过建立拍摄文档图像与对应干净文档图像之间的密集对应关系实现这一目标:它既能自动将干净文档图像中的已有标注迁移至拍摄图像,还能自动获取人工标注难以获得的标签。针对文档图像的独特特性,DocAligner融合了多项创新设计。首先,我们提出基于文档边缘的非刚性预对齐技术,有效消除文档图像中大幅全局偏移和重复模式造成的干扰。其次,为处理大形变并保证高精度,我们引入一种结合全局与局部相关层的层次化对齐方法。此外,考虑到文档图像中精细元素的重要性,我们设计了细节循环精化模块,在高分辨率空间中增强输出质量。为训练DocAligner,我们构建了合成数据集并引入自监督学习方法以增强其对真实数据的鲁棒性。通过大量实验,我们验证了DocAligner及其生成数据集的有效性。相关数据集与代码将公开发布。