Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document and correctly classifying these items into an appropriate category (e.g., text, title, figure). DLA pipelines enable users to convert documents into structured machine-readable formats that can then be used for many useful downstream tasks. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. Directly leveraging this metadata, we represent each PDF page as a structured graph and frame the DLA problem as a graph segmentation and classification problem. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network competitive with SOTA models on two challenging DLA datasets - while being an order of magnitude smaller than existing models. In particular, the 4-million parameter GLAM model outperforms the leading 140M+ parameter computer vision-based model on 5 of the 11 classes on the DocLayNet dataset. A simple ensemble of these two models achieves a new state-of-the-art on DocLayNet, increasing mAP from 76.8 to 80.8. Overall, GLAM is over 5 times more efficient than SOTA models, making GLAM a favorable engineering choice for DLA tasks.
翻译:文档布局分析(DLA)是在文档中检测不同语义内容并将其正确分类到适当类别(如文本、标题、图形)的任务。DLA流程能够将文档转换为结构化的机器可读格式,进而支持多种有用的下游任务。现有最先进的(SOTA)DLA模型大多将文档表示为图像,忽略了电子生成PDF中丰富的元数据。通过直接利用这些元数据,我们将每个PDF页面表示为结构化图,并将DLA问题框架化为图分割与分类问题。我们提出了基于图的布局分析模型(GLAM),这是一种轻量级图神经网络,在两个具有挑战性的DLA数据集上与SOTA模型性能相当,同时模型规模小了一个数量级。具体而言,拥有400万参数的GLAM模型,在DocLayNet数据集的11个类别中有5个类别上超越了领先的1.4亿+参数的基于计算机视觉的模型。这两种模型的简单集成在DocLayNet上取得了新的最先进结果,将平均精度均值(mAP)从76.8提升至80.8。总体而言,GLAM的效率是SOTA模型的5倍以上,使其成为DLA任务的优选工程方案。