Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.
翻译:大规模真实标注数据集与深度学习技术的进步为文档布局检测提供了有力支持。然而,由于现有数据集布局多样性有限,基于这些数据集进行训练需要大量标注实例,这不仅成本高昂且耗时。因此,源域与目标域之间的差异会显著影响模型性能。为解决该问题,领域自适应方法已通过少量标注数据调整模型以适应目标域。本文提出了一种名为RanLayNet的合成文档数据集,该数据集包含自动生成的标注信息,用于标识布局元素的空间位置、范围及类型。本研究旨在开发一个通用数据集,用于训练具有鲁棒性和适应性的模型,以应对多样化的文档格式。实验结果表明,基于本数据集训练的深度布局识别模型,其性能优于仅使用真实文档训练的模型。此外,我们通过使用PubLayNet与IIIT-AR-13K数据集在Doclaynet数据集上进行微调推理模型的对比分析。研究结果强调,结合本数据集的模型在科学文档领域中实现TABLE类别0.398和0.588的mAP95得分时具有最优性能。