RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization

Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.

翻译：大规模真实标注数据集与深度学习技术的进步为布局检测提供了有力支持。然而，由于这些数据集的布局多样性有限，训练时需要大量标注实例，既昂贵又耗时。因此，源域与目标域之间的差异可能显著影响模型性能。为解决这一问题，域适应方法应运而生，利用少量标注数据将模型调整至目标域。本研究提出一种名为RanLayNet的合成文档数据集，该数据集包含自动标注的布局元素空间位置、范围及类别标签。主要目标是构建一个通用数据集，使模型能够对多样化文档格式具备鲁棒性和适应性。通过实验验证，基于本数据集训练的深度布局检测模型相较于仅使用真实文档训练的模型性能更优。进一步地，我们基于PubLayNet与IIIT-AR-13K数据集在Doclaynet数据集上对推理模型进行微调并开展对比分析。研究结果表明，融合本数据集的模型在科学文档领域TABLE类别的mAP95评分中分别达到0.398和0.588，达到最优性能表现。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日