Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions to document layout understanding. In this report, we present the results of our \textit{ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents}, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. To raise the bar over previous competitions, we engineered a hard competition dataset and proposed the recent DocLayNet dataset for training. We recorded 45 team registrations and received official submissions from 21 teams. In the presented solutions, we recognize interesting combinations of recent computer vision models, data augmentation strategies and ensemble methods to achieve remarkable accuracy in the task we posed. A clear trend towards adoption of vision-transformer based methods is evident. The results demonstrate substantial progress towards achieving robust and highly generalizing methods for document layout understanding.
翻译:将文档转换为机器可处理表示形式是一项具有挑战性的任务,因其结构复杂且格式多样。从PDF文件或扫描材料中恢复版面结构和内容数十年来一直是关键难题。ICDAR在举办竞赛以衡量最新技术水平并推动文档版面理解创新解决方案方面有着悠久传统。本报告展示了我们举办的《ICDAR 2023企业文档鲁棒版面分割竞赛》成果,该竞赛要求参赛者在广泛文档风格与领域(包括企业报告、技术文献及专利)中精确分割页面布局。为提升竞赛门槛,我们精心设计了高难度竞赛数据集,并采用近期提出的DocLayNet数据集进行训练。最终共记录45支团队注册,收到21支团队的正式提交。在呈现的解决方案中,我们识别出融合最新计算机视觉模型、数据增强策略及集成方法的创新组合,显著提升了任务精度。基于vision-transformer的方法已成为显著趋势。结果表明,文档版面理解的鲁棒与高度泛化方法已取得实质性进展。