RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization

Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.

翻译：大规模真实标注数据集与深度学习技术的进步为文档布局检测提供了有力支持。然而，由于现有数据集布局多样性有限，基于这些数据集进行训练需要大量标注实例，这不仅成本高昂且耗时。因此，源域与目标域之间的差异会显著影响模型性能。为解决该问题，领域自适应方法已通过少量标注数据调整模型以适应目标域。本文提出了一种名为RanLayNet的合成文档数据集，该数据集包含自动生成的标注信息，用于标识布局元素的空间位置、范围及类型。本研究旨在开发一个通用数据集，用于训练具有鲁棒性和适应性的模型，以应对多样化的文档格式。实验结果表明，基于本数据集训练的深度布局识别模型，其性能优于仅使用真实文档训练的模型。此外，我们通过使用PubLayNet与IIIT-AR-13K数据集在Doclaynet数据集上进行微调推理模型的对比分析。研究结果强调，结合本数据集的模型在科学文档领域中实现TABLE类别0.398和0.588的mAP95得分时具有最优性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日