Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSeg
翻译:文档布局分析是文档研究领域的已知问题,已得到广泛探索,产生了从文本挖掘、识别到基于图的表示、视觉特征提取等多种解决方案。然而,现有工作大多忽略了标注数据稀缺这一关键事实。随着互联网日益融入个人生活,公共领域涌现出海量文档,使得数据标注成为一项繁重任务。我们通过自监督方法应对这一挑战,与少数现有使用文本挖掘和文本标签的自监督文档分割方法不同,我们在预训练中采用完全基于视觉的方法,不使用任何真实标签或其衍生信息。相反,我们从文档图像生成伪布局,并在自监督框架中预训练图像编码器以学习文档对象表示与定位,然后使用目标检测模型进行微调。我们证明,本流程在此领域树立了新基准,与现有方法及有监督对应方法相比性能相当,甚至更优。代码已公开于:https://github.com/MaitySubhajit/SelfDocSeg