Document layout analysis (DLA) is crucial for understanding the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. However, previous studies have typically used separate models to address individual sub-tasks within DLA, including table/figure detection, text region detection, logical role classification, and reading order prediction. In this work, we propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer, which integrates all these sub-tasks into a single model. To achieve this, we treat various DLA sub-tasks (such as text region detection, logical role classification, and reading order prediction) as relation prediction problems and consolidate these relation prediction labels into a unified label space, allowing a unified relation prediction module to handle multiple tasks concurrently. Additionally, we introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR. Moreover, we adopt a coarse-to-fine strategy to accurately identify graphical page objects. Experimental results demonstrate that our proposed DLAFormer outperforms previous approaches that employ multi-branch or multi-stage architectures for multiple tasks on two document layout analysis benchmarks, DocLayNet and Comp-HRDoc.
翻译:文档布局分析(DLA)对于理解文档的物理布局和逻辑结构至关重要,广泛应用于信息检索、文档摘要、知识抽取等领域。然而,以往的研究通常使用独立的模型来处理DLA中的各个子任务,包括表格/图像检测、文本区域检测、逻辑角色分类和阅读顺序预测。在这项工作中,我们提出了一种基于端到端Transformer的文档布局分析方法——DLAFormer,该方法将所有子任务整合到单一模型中。为实现这一目标,我们将DLA的多种子任务(如文本区域检测、逻辑角色分类和阅读顺序预测)视为关系预测问题,并将这些关系预测标签统一到一个标签空间中,从而使得统一的关系预测模块能够同时处理多个任务。此外,我们引入了一组新型的类别型查询(type-wise queries),以增强DETR中内容查询的物理语义。同时,我们采用由粗到精的策略来准确识别图形化页面对象。实验结果表明,在DocLayNet和Comp-HRDoc这两个文档布局分析基准数据集上,我们提出的DLAFormer在多个任务上的表现优于以往采用多分支或多阶段架构的方法。