In the rapidly evolving digital era, the analysis of document layouts plays a pivotal role in automated information extraction and interpretation. In our work, we have trained MViTv2 transformer model architecture with cascaded mask R-CNN on BaDLAD dataset to extract text box, paragraphs, images and tables from a document. After training on 20365 document images for 36 epochs in a 3 phase cycle, we achieved a training loss of 0.2125 and a mask loss of 0.19. Our work extends beyond training, delving into the exploration of potential enhancement avenues. We investigate the impact of rotation and flip augmentation, the effectiveness of slicing input images pre-inference, the implications of varying the resolution of the transformer backbone, and the potential of employing a dual-pass inference to uncover missed text-boxes. Through these explorations, we observe a spectrum of outcomes, where some modifications result in tangible performance improvements, while others offer unique insights for future endeavors.
翻译:在快速发展的数字时代,文档布局分析在自动化信息提取与解读中扮演着关键角色。本研究基于BaDLAD数据集,采用级联掩码区域卷积神经网络(cascaded mask R-CNN)训练MViTv2变换器模型架构,以实现文档中文本框、段落、图像和表格的提取。经过在20365张文档图像上执行三个阶段的36轮训练后,我们取得了0.2125的训练损失和0.19的掩码损失。我们的工作不仅限于模型训练,还深入探索了潜在的优化路径:我们研究了旋转与翻转数据增强的影响、推理前对输入图像进行切片切分的有效性、变换器主干网络分辨率变化的效应,以及采用双通道推理以发现遗漏文本框的潜力。通过这些探索,我们观察到一系列结果:部分修改带来了切实的性能提升,而另一些则为未来研究方向提供了独特见解。