Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.
翻译:视觉丰富文档理解(VrDU)的进展推动了具有复杂布局文档的信息抽取与问答技术发展。当前涌现出两类主流架构——受大语言模型启发的基于Transformer模型和图神经网络。本文提出DocGraphLM,一种融合预训练语言模型与图语义的新型框架。为此,我们提出:1)一种联合编码器架构用于文档表示,2)一种创新的链接预测方法用于重构文档图。DocGraphLM通过收敛联合损失函数同时预测节点间的方向与距离,该函数优先恢复邻域信息并降低远距离节点检测的权重。我们在三个最先进数据集上的实验表明,采用图特征后信息抽取与问答任务均获得一致性提升。此外,我们报告即便仅通过链接预测构建图特征,在训练过程中采用这些特征仍能加速学习收敛过程。