Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.
翻译:大规模真实标注数据集与深度学习技术的进步为布局检测提供了有力支持。然而,由于这些数据集的布局多样性有限,训练时需要大量标注实例,既昂贵又耗时。因此,源域与目标域之间的差异可能显著影响模型性能。为解决这一问题,域适应方法应运而生,利用少量标注数据将模型调整至目标域。本研究提出一种名为RanLayNet的合成文档数据集,该数据集包含自动标注的布局元素空间位置、范围及类别标签。主要目标是构建一个通用数据集,使模型能够对多样化文档格式具备鲁棒性和适应性。通过实验验证,基于本数据集训练的深度布局检测模型相较于仅使用真实文档训练的模型性能更优。进一步地,我们基于PubLayNet与IIIT-AR-13K数据集在Doclaynet数据集上对推理模型进行微调并开展对比分析。研究结果表明,融合本数据集的模型在科学文档领域TABLE类别的mAP95评分中分别达到0.398和0.588,达到最优性能表现。