DocAligner: Annotating Real-world Photographic Document Images by Simply Taking Pictures

Recently, there has been a growing interest in research concerning document image analysis and recognition in photographic scenarios. However, the lack of labeled datasets for this emerging challenge poses a significant obstacle, as manual annotation can be time-consuming and impractical. To tackle this issue, we present DocAligner, a novel method that streamlines the manual annotation process to a simple step of taking pictures. DocAligner achieves this by establishing dense correspondence between photographic document images and their clean counterparts. It enables the automatic transfer of existing annotations in clean document images to photographic ones and helps to automatically acquire labels that are unavailable through manual labeling. Considering the distinctive characteristics of document images, DocAligner incorporates several innovative features. First, we propose a non-rigid pre-alignment technique based on the document's edges, which effectively eliminates interference caused by significant global shifts and repetitive patterns present in document images. Second, to handle large shifts and ensure high accuracy, we introduce a hierarchical aligning approach that combines global and local correlation layers. Furthermore, considering the importance of fine-grained elements in document images, we present a details recurrent refinement module to enhance the output in a high-resolution space. To train DocAligner, we construct a synthetic dataset and introduce a self-supervised learning approach to enhance its robustness for real-world data. Through extensive experiments, we demonstrate the effectiveness of DocAligner and the acquired dataset. Datasets and codes will be publicly available.

翻译：近年来，针对拍摄场景下的文档图像分析与识别研究日益受到关注。然而，这一新兴挑战缺乏标注数据集，成为关键障碍——因为手工标注不仅耗时且实际操作性差。为解决此问题，我们提出DocAligner这一创新方法，将人工标注流程简化为"拍照"这一简单步骤。DocAligner通过建立拍摄文档图像与对应干净文档图像之间的密集对应关系实现这一目标：它既能自动将干净文档图像中的已有标注迁移至拍摄图像，还能自动获取人工标注难以获得的标签。针对文档图像的独特特性，DocAligner融合了多项创新设计。首先，我们提出基于文档边缘的非刚性预对齐技术，有效消除文档图像中大幅全局偏移和重复模式造成的干扰。其次，为处理大形变并保证高精度，我们引入一种结合全局与局部相关层的层次化对齐方法。此外，考虑到文档图像中精细元素的重要性，我们设计了细节循环精化模块，在高分辨率空间中增强输出质量。为训练DocAligner，我们构建了合成数据集并引入自监督学习方法以增强其对真实数据的鲁棒性。通过大量实验，我们验证了DocAligner及其生成数据集的有效性。相关数据集与代码将公开发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日