Visual Relation Extraction (VRE) is a powerful means of discovering relationships between entities within visually-rich documents. Existing methods often focus on manipulating entity features to find pairwise relations, yet neglect the more fundamental structural information that links disparate entity pairs together. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a \textbf{G}l\textbf{O}bal \textbf{S}tructure knowledge-guided relation \textbf{E}xtraction (\textbf{\model}) framework. {\model} initiates by generating preliminary relation predictions on entity pairs extracted from a scanned image of the document. Subsequently, global structural knowledge is captured from the preceding iterative predictions, which are then incorporated into the representations of the entities. This ``generate-capture-incorporate'' cycle is repeated multiple times, allowing entity representations and global structure knowledge to be mutually reinforced. Extensive experiments validate that {\model} not only outperforms existing methods in the standard fine-tuning setting but also reveals superior cross-lingual learning capabilities; indeed, even yields stronger data-efficient performance in the low-resource setting. The code for GOSE will be available at https://github.com/chenxn2020/GOSE.
翻译:视觉关系抽取(Visual Relation Extraction, VRE)是发现视觉丰富文档中实体间关系的有效手段。现有方法通常侧重于操控实体特征以寻找成对关系,却忽略了连接不同实体对的更基础的结构信息。缺乏全局结构信息可能导致模型难以学习长距离关系,并容易预测出矛盾的结果。为解决这些局限,我们提出了一种全局结构知识引导的关系抽取框架(GOSE)。GOSE首先从文档扫描图像中提取的实体对上生成初步关系预测,随后从先前迭代预测中捕获全局结构知识,并将其融入实体表示。通过重复“生成-捕获-融入”循环,使实体表示与全局结构知识相互强化。大量实验验证,GOSE不仅在标准微调设置中优于现有方法,还展现出卓越的跨语言学习能力,甚至在低资源场景下实现了更强的数据高效性能。GOSE代码将发布于https://github.com/chenxn2020/GOSE。